# Checking Dataset - Analytics Vidhya - JOB A THON - May 2021

Dataset is from https://www.kaggle.com/nextbigwhat/analytics-vidhya-job-a-thon-may-2021

## The question: 
#### Can we use Machine Learning to predict which customers would be interested in buying a credit card product at this bank based on a set of characteristics?
#### What kind of customer is most likely to be interested in buying a credit product?

In [1]:
#Import Libraries

import warnings
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
plt.rcParams['figure.figsize'] = [12,5]
warnings.filterwarnings("ignore")

# Multiple Line Output
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
#Load dataset and apply variable name
dataset = pd.read_csv('../Datasets/AnalyticsVidhya-JOB_A_THON-May2021/train_s3TEQDk.csv')

#The original data contains two files, which appear to be a dataset divided in two. Since one is much larger than the other, we'll stick to that one.

In [3]:
#Preliminary examination of dataset / EDA

print('Preview')
dataset
print('-----------------------------------------------------')
print('Shape')
dataset.shape
print('-----------------------------------------------------')
print('Describe')
dataset.describe
print('-----------------------------------------------------')
print('Info')
dataset.info

Preview


Unnamed: 0,ID,Gender,Age,Region_Code,Occupation,Channel_Code,Vintage,Credit_Product,Avg_Account_Balance,Is_Active,Is_Lead
0,NNVBBKZB,Female,73,RG268,Other,X3,43,No,1045696,No,0
1,IDD62UNG,Female,30,RG277,Salaried,X1,32,No,581988,No,0
2,HD3DSEMC,Female,56,RG268,Self_Employed,X3,26,No,1484315,Yes,0
3,BF3NC7KV,Male,34,RG270,Salaried,X1,19,No,470454,No,0
4,TEASRWXV,Female,30,RG282,Salaried,X1,33,No,886787,No,0
...,...,...,...,...,...,...,...,...,...,...,...
245720,BPAWWXZN,Male,51,RG284,Self_Employed,X3,109,,1925586,No,0
245721,HFNB7JY8,Male,27,RG268,Salaried,X1,15,No,862952,Yes,0
245722,GEHAUCWT,Female,26,RG281,Salaried,X1,13,No,670659,No,0
245723,GE7V8SAH,Female,28,RG273,Salaried,X1,31,No,407504,No,0


-----------------------------------------------------
Shape


(245725, 11)

-----------------------------------------------------
Describe


<bound method NDFrame.describe of               ID  Gender  Age Region_Code     Occupation Channel_Code  \
0       NNVBBKZB  Female   73       RG268          Other           X3   
1       IDD62UNG  Female   30       RG277       Salaried           X1   
2       HD3DSEMC  Female   56       RG268  Self_Employed           X3   
3       BF3NC7KV    Male   34       RG270       Salaried           X1   
4       TEASRWXV  Female   30       RG282       Salaried           X1   
...          ...     ...  ...         ...            ...          ...   
245720  BPAWWXZN    Male   51       RG284  Self_Employed           X3   
245721  HFNB7JY8    Male   27       RG268       Salaried           X1   
245722  GEHAUCWT  Female   26       RG281       Salaried           X1   
245723  GE7V8SAH  Female   28       RG273       Salaried           X1   
245724  BOCZSWLJ    Male   29       RG269       Salaried           X1   

        Vintage Credit_Product  Avg_Account_Balance Is_Active  Is_Lead  
0            43 

-----------------------------------------------------
Info


<bound method DataFrame.info of               ID  Gender  Age Region_Code     Occupation Channel_Code  \
0       NNVBBKZB  Female   73       RG268          Other           X3   
1       IDD62UNG  Female   30       RG277       Salaried           X1   
2       HD3DSEMC  Female   56       RG268  Self_Employed           X3   
3       BF3NC7KV    Male   34       RG270       Salaried           X1   
4       TEASRWXV  Female   30       RG282       Salaried           X1   
...          ...     ...  ...         ...            ...          ...   
245720  BPAWWXZN    Male   51       RG284  Self_Employed           X3   
245721  HFNB7JY8    Male   27       RG268       Salaried           X1   
245722  GEHAUCWT  Female   26       RG281       Salaried           X1   
245723  GE7V8SAH  Female   28       RG273       Salaried           X1   
245724  BOCZSWLJ    Male   29       RG269       Salaried           X1   

        Vintage Credit_Product  Avg_Account_Balance Is_Active  Is_Lead  
0            43   

In [4]:
#Further EDA
print('Count of lead values (yes/no)')
dataset.Is_Lead.value_counts()
print('-----------------------------------------------------')
print('Count of active values (yes/no)')
dataset.Is_Active.value_counts()
print('-----------------------------------------------------')
dataset[['Is_Active', 'Is_Lead']].apply(pd.Series.value_counts)

Count of lead values (yes/no)


0    187437
1     58288
Name: Is_Lead, dtype: int64

-----------------------------------------------------
Count of active values (yes/no)


No     150290
Yes     95435
Name: Is_Active, dtype: int64

-----------------------------------------------------


Unnamed: 0,Is_Active,Is_Lead
No,150290.0,
Yes,95435.0,
0,,187437.0
1,,58288.0


### The target (dependent) variable is 'Is_Lead'. This means whether the customer has indicated interest. 
### The independent variables are Gender,Age,Region Code,Occupation,Channel Code,Vintage,Credit_Product,Avg_Account_Balance,Is_Active

In [5]:
df = dataset

In [6]:
#Data Cleaning

#Check that there are no duplicate rows

print("Total rows:") 
df.shape
print("Total unique rows:")
df.ID.unique().shape

#Store original number of rows for comparison of kept/removed rows later (
original_row_num = df.shape[0]


dfmod = df.copy()

#Some columns are uneeded, so we'll go ahead and remove those.
dfmod.drop(['ID'], axis=1, inplace=True)




Total rows:


(245725, 11)

Total unique rows:


(245725,)

In [7]:
#Turn records into numerical values where possible
dfmod["Gender"].replace({"Male": "0", "Female": "1"}, inplace=True)
dfmod["Credit_Product"].replace({"No": "0", "Yes": "1"}, inplace=True)
dfmod["Is_Active"].replace({"No": "0", "Yes": "1"}, inplace=True)
dfmod


Unnamed: 0,Gender,Age,Region_Code,Occupation,Channel_Code,Vintage,Credit_Product,Avg_Account_Balance,Is_Active,Is_Lead
0,1,73,RG268,Other,X3,43,0,1045696,0,0
1,1,30,RG277,Salaried,X1,32,0,581988,0,0
2,1,56,RG268,Self_Employed,X3,26,0,1484315,1,0
3,0,34,RG270,Salaried,X1,19,0,470454,0,0
4,1,30,RG282,Salaried,X1,33,0,886787,0,0
...,...,...,...,...,...,...,...,...,...,...
245720,0,51,RG284,Self_Employed,X3,109,,1925586,0,0
245721,0,27,RG268,Salaried,X1,15,0,862952,1,0
245722,1,26,RG281,Salaried,X1,13,0,670659,0,0
245723,1,28,RG273,Salaried,X1,31,0,407504,0,0


In [8]:
# Check the column data types
dfmod.dtypes

Gender                 object
Age                     int64
Region_Code            object
Occupation             object
Channel_Code           object
Vintage                 int64
Credit_Product         object
Avg_Account_Balance     int64
Is_Active              object
Is_Lead                 int64
dtype: object

In [9]:
# Check how many unique values in all columns
for col in dfmod.columns:
    count = dfmod[col].unique().shape[0]
    print(col + ': ' + str(count))

Gender: 2
Age: 63
Region_Code: 35
Occupation: 4
Channel_Code: 4
Vintage: 66
Credit_Product: 3
Avg_Account_Balance: 135292
Is_Active: 2
Is_Lead: 2


In [10]:
# Check presence of null values in any columns
(dfmod.astype(np.object).isnull()).any()

Gender                 False
Age                    False
Region_Code            False
Occupation             False
Channel_Code           False
Vintage                False
Credit_Product          True
Avg_Account_Balance    False
Is_Active              False
Is_Lead                False
dtype: bool

In [11]:
# Calculate total number of null values in all of the columns
for col in dfmod.columns:
    count = dfmod[col].isnull().sum()
    print(col + ' ' + str(count))

Gender 0
Age 0
Region_Code 0
Occupation 0
Channel_Code 0
Vintage 0
Credit_Product 29325
Avg_Account_Balance 0
Is_Active 0
Is_Lead 0


In [12]:
print('Count of credit product values')
dfmod.Credit_Product.value_counts()

Count of credit product values


0    144357
1     72043
Name: Credit_Product, dtype: int64

In [13]:
#Since null values in Credit_Product column constitute a significant portion of the data, impute them to "2"

dfmod['Credit_Product'].fillna("2",inplace=True)
dfmod

Unnamed: 0,Gender,Age,Region_Code,Occupation,Channel_Code,Vintage,Credit_Product,Avg_Account_Balance,Is_Active,Is_Lead
0,1,73,RG268,Other,X3,43,0,1045696,0,0
1,1,30,RG277,Salaried,X1,32,0,581988,0,0
2,1,56,RG268,Self_Employed,X3,26,0,1484315,1,0
3,0,34,RG270,Salaried,X1,19,0,470454,0,0
4,1,30,RG282,Salaried,X1,33,0,886787,0,0
...,...,...,...,...,...,...,...,...,...,...
245720,0,51,RG284,Self_Employed,X3,109,2,1925586,0,0
245721,0,27,RG268,Salaried,X1,15,0,862952,1,0
245722,1,26,RG281,Salaried,X1,13,0,670659,0,0
245723,1,28,RG273,Salaried,X1,31,0,407504,0,0


In [14]:
# Recheck presence of nulls
(dfmod.astype(np.object).isnull()).any()
print('-----------------------------------------------------')
for col in dfmod.columns:
    count = dfmod[col].isnull().sum()
    print(col + ' ' + str(count))

Gender                 False
Age                    False
Region_Code            False
Occupation             False
Channel_Code           False
Vintage                False
Credit_Product         False
Avg_Account_Balance    False
Is_Active              False
Is_Lead                False
dtype: bool

-----------------------------------------------------
Gender 0
Age 0
Region_Code 0
Occupation 0
Channel_Code 0
Vintage 0
Credit_Product 0
Avg_Account_Balance 0
Is_Active 0
Is_Lead 0


In [15]:
#Check values in that column once more
print('Count of credit product values')
dfmod.Credit_Product.value_counts()

Count of credit product values


0    144357
1     72043
2     29325
Name: Credit_Product, dtype: int64

In [16]:
# Check the column data types again
dfmod.dtypes

Gender                 object
Age                     int64
Region_Code            object
Occupation             object
Channel_Code           object
Vintage                 int64
Credit_Product         object
Avg_Account_Balance     int64
Is_Active              object
Is_Lead                 int64
dtype: object

In [17]:
#Convert numeric columns to the appropriate type where possible
dfmod = dfmod.apply(pd.to_numeric, errors='ignore')
dfmod.dtypes

Gender                  int64
Age                     int64
Region_Code            object
Occupation             object
Channel_Code           object
Vintage                 int64
Credit_Product          int64
Avg_Account_Balance     int64
Is_Active               int64
Is_Lead                 int64
dtype: object

In [19]:
# # Check the balance of the dataset
# # This dataset is unbalanced and will have to be dealt with in different ways
# dfmod.Is_Lead.value_counts()



0    187437
1     58288
Name: Is_Lead, dtype: int64