# Project 3 : Income Qualification

## DESCRIPTION

Identify the level of income qualification needed for the families in Latin America

## Problem Statement Scenario:
Many social programs have a hard time making sure the right people are given enough aid. It’s tricky when a program focuses on the poorest segment of the population. 
This segment of population can’t provide the necessary income and expense records to prove that they qualify.

In Latin America, a popular method called Proxy Means Test (PMT) uses an algorithm to verify income qualification. 
With PMT, agencies use a model that considers a family’s observable household attributes like the material of their walls and ceiling or the assets 
found in their homes to classify them and predict their level of need. While this is an improvement, 
accuracy remains a problem as the region’s population grows and poverty declines.

The Inter-American Development Bank (IDB) believes that new methods beyond traditional econometrics, based on a dataset of Costa Rican household characteristics, 
might help improve PMT’s performance.

## Following actions should be performed:
* Identify the output variable.
* Understand the type of data.
* Check if there are any biases in your dataset.
* Check whether all members of the house have the same poverty level.
* Check if there is a house without a family head.
* Set the poverty level of the members and the head of the house same in a family.
* Count how many null values are existing in columns.
* Remove null value rows of the target variable.
* Predict the accuracy using random forest classifier.
* Check the accuracy using a random forest with cross-validation.



### Import necessary libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

pd.pandas.set_option('display.max_columns',None) #used to display all the columns in the dataset

### Load train and test data

In [3]:
#importing train dataset
train_data=pd.read_csv('train.csv')
train_data.head()

Unnamed: 0,Id,v2a1,hacdor,rooms,hacapo,v14a,refrig,v18q,v18q1,r4h1,r4h2,r4h3,r4m1,r4m2,r4m3,r4t1,r4t2,r4t3,tamhog,tamviv,escolari,rez_esc,hhsize,paredblolad,paredzocalo,paredpreb,pareddes,paredmad,paredzinc,paredfibras,paredother,pisomoscer,pisocemento,pisoother,pisonatur,pisonotiene,pisomadera,techozinc,techoentrepiso,techocane,techootro,cielorazo,abastaguadentro,abastaguafuera,abastaguano,public,planpri,noelec,coopele,sanitario1,sanitario2,sanitario3,sanitario5,sanitario6,energcocinar1,energcocinar2,energcocinar3,energcocinar4,elimbasu1,elimbasu2,elimbasu3,elimbasu4,elimbasu5,elimbasu6,epared1,epared2,epared3,etecho1,etecho2,etecho3,eviv1,eviv2,eviv3,dis,male,female,estadocivil1,estadocivil2,estadocivil3,estadocivil4,estadocivil5,estadocivil6,estadocivil7,parentesco1,parentesco2,parentesco3,parentesco4,parentesco5,parentesco6,parentesco7,parentesco8,parentesco9,parentesco10,parentesco11,parentesco12,idhogar,hogar_nin,hogar_adul,hogar_mayor,hogar_total,dependency,edjefe,edjefa,meaneduc,instlevel1,instlevel2,instlevel3,instlevel4,instlevel5,instlevel6,instlevel7,instlevel8,instlevel9,bedrooms,overcrowding,tipovivi1,tipovivi2,tipovivi3,tipovivi4,tipovivi5,computer,television,mobilephone,qmobilephone,lugar1,lugar2,lugar3,lugar4,lugar5,lugar6,area1,area2,age,SQBescolari,SQBage,SQBhogar_total,SQBedjefe,SQBhogar_nin,SQBovercrowding,SQBdependency,SQBmeaned,agesq,Target
0,ID_279628684,190000.0,0,3,0,1,1,0,,0,1,1,0,0,0,0,1,1,1,1,10,,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,21eb7fcc1,0,1,0,1,no,10,no,10.0,0,0,0,1,0,0,0,0,0,1,1.0,0,0,1,0,0,0,0,1,1,1,0,0,0,0,0,1,0,43,100,1849,1,100,0,1.0,0.0,100.0,1849,4
1,ID_f29eb3ddd,135000.0,0,4,0,1,1,1,1.0,0,1,1,0,0,0,0,1,1,1,1,12,,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0e5d7a658,0,1,1,1,8,12,no,12.0,0,0,0,0,0,0,0,1,0,1,1.0,0,0,1,0,0,0,0,1,1,1,0,0,0,0,0,1,0,67,144,4489,1,144,0,1.0,64.0,144.0,4489,4
2,ID_68de51c94,,0,8,0,1,1,0,,0,0,0,0,1,1,0,1,1,1,1,11,,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,1,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,2c7317ea8,0,1,1,1,8,no,11,11.0,0,0,0,0,1,0,0,0,0,2,0.5,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,92,121,8464,1,0,0,0.25,64.0,121.0,8464,4
3,ID_d671db89c,180000.0,0,5,0,1,1,1,1.0,0,2,2,1,1,2,1,3,4,4,4,9,1.0,4,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,2b58d945f,2,2,0,4,yes,11,no,11.0,0,0,0,1,0,0,0,0,0,3,1.333333,0,0,1,0,0,0,0,1,3,1,0,0,0,0,0,1,0,17,81,289,16,121,4,1.777778,1.0,121.0,289,4
4,ID_d56d6f5f5,180000.0,0,5,0,1,1,1,1.0,0,2,2,1,1,2,1,3,4,4,4,11,,4,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2b58d945f,2,2,0,4,yes,11,no,11.0,0,0,0,0,1,0,0,0,0,3,1.333333,0,0,1,0,0,0,0,1,3,1,0,0,0,0,0,1,0,37,121,1369,16,121,4,1.777778,1.0,121.0,1369,4


In [4]:
#importing test dataset
test_data=pd.read_csv('test.csv')
test_data.head()

Unnamed: 0,Id,v2a1,hacdor,rooms,hacapo,v14a,refrig,v18q,v18q1,r4h1,r4h2,r4h3,r4m1,r4m2,r4m3,r4t1,r4t2,r4t3,tamhog,tamviv,escolari,rez_esc,hhsize,paredblolad,paredzocalo,paredpreb,pareddes,paredmad,paredzinc,paredfibras,paredother,pisomoscer,pisocemento,pisoother,pisonatur,pisonotiene,pisomadera,techozinc,techoentrepiso,techocane,techootro,cielorazo,abastaguadentro,abastaguafuera,abastaguano,public,planpri,noelec,coopele,sanitario1,sanitario2,sanitario3,sanitario5,sanitario6,energcocinar1,energcocinar2,energcocinar3,energcocinar4,elimbasu1,elimbasu2,elimbasu3,elimbasu4,elimbasu5,elimbasu6,epared1,epared2,epared3,etecho1,etecho2,etecho3,eviv1,eviv2,eviv3,dis,male,female,estadocivil1,estadocivil2,estadocivil3,estadocivil4,estadocivil5,estadocivil6,estadocivil7,parentesco1,parentesco2,parentesco3,parentesco4,parentesco5,parentesco6,parentesco7,parentesco8,parentesco9,parentesco10,parentesco11,parentesco12,idhogar,hogar_nin,hogar_adul,hogar_mayor,hogar_total,dependency,edjefe,edjefa,meaneduc,instlevel1,instlevel2,instlevel3,instlevel4,instlevel5,instlevel6,instlevel7,instlevel8,instlevel9,bedrooms,overcrowding,tipovivi1,tipovivi2,tipovivi3,tipovivi4,tipovivi5,computer,television,mobilephone,qmobilephone,lugar1,lugar2,lugar3,lugar4,lugar5,lugar6,area1,area2,age,SQBescolari,SQBage,SQBhogar_total,SQBedjefe,SQBhogar_nin,SQBovercrowding,SQBdependency,SQBmeaned,agesq
0,ID_2f6873615,,0,5,0,1,1,0,,1,1,2,0,1,1,1,2,3,3,3,0,,3,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,72958b30c,1,2,0,3,.5,no,17,16.5,1,0,0,0,0,0,0,0,0,2,1.5,1,0,0,0,0,1,0,1,2,1,0,0,0,0,0,1,0,4,0,16,9,0,1,2.25,0.25,272.25,16
1,ID_1c78846d2,,0,5,0,1,1,0,,1,1,2,0,1,1,1,2,3,3,3,16,,3,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,72958b30c,1,2,0,3,.5,no,17,16.5,0,0,0,0,0,0,0,1,0,2,1.5,1,0,0,0,0,1,0,1,2,1,0,0,0,0,0,1,0,41,256,1681,9,0,1,2.25,0.25,272.25,1681
2,ID_e5442cf6a,,0,5,0,1,1,0,,1,1,2,0,1,1,1,2,3,3,3,17,,3,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,72958b30c,1,2,0,3,.5,no,17,16.5,0,0,0,0,0,0,0,0,1,2,1.5,1,0,0,0,0,1,0,1,2,1,0,0,0,0,0,1,0,41,289,1681,9,0,1,2.25,0.25,272.25,1681
3,ID_a8db26a79,,0,14,0,1,1,1,1.0,0,1,1,0,0,0,0,1,1,1,1,16,,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,5b598fbc9,0,1,0,1,no,16,no,16.0,0,0,0,0,0,0,0,1,0,1,1.0,1,0,0,0,0,1,0,1,2,1,0,0,0,0,0,1,0,59,256,3481,1,256,0,1.0,0.0,256.0,3481
4,ID_a62966799,175000.0,0,4,0,1,1,1,1.0,0,0,0,0,1,1,0,1,1,1,1,11,,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1e2fc704e,1,0,0,1,8,no,11,,0,0,0,0,1,0,0,0,0,2,0.5,0,0,1,0,0,0,0,1,1,1,0,0,0,0,0,1,0,18,121,324,1,0,1,0.25,64.0,,324


### Dimensions of train and test datasets

In [5]:
print('Shape of the train_data : The train data as {} rows and {} columns'.format(train_data.shape[0],train_data.shape[1]))
print('Shape of the test_data : The train data as {} rows and {} columns'.format(test_data.shape[0],test_data.shape[1]))

Shape of the train_data : The train data as 9557 rows and 143 columns
Shape of the test_data : The train data as 23856 rows and 142 columns


In [6]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9557 entries, 0 to 9556
Columns: 143 entries, Id to Target
dtypes: float64(8), int64(130), object(5)
memory usage: 10.4+ MB


In [7]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23856 entries, 0 to 23855
Columns: 142 entries, Id to agesq
dtypes: float64(8), int64(129), object(5)
memory usage: 25.8+ MB


### Identify the output/target variable.

In [8]:
for cols in train_data.columns:
    if cols not in test_data.columns:
        print('The target variable is {}'.format(cols))

The target variable is Target


In [9]:
train_data[['Target']].head()

Unnamed: 0,Target
0,4
1,4
2,4
3,4
4,4


### Understand the type of the data.

In [10]:
print(train_data.dtypes.value_counts())

int64      130
float64      8
object       5
dtype: int64


There are 
* 130 columns with int type
* 8 columns with float type
* 5 columns with object type

### Checking for categorical variables in the dataset

In [11]:
cat_fea_in_train_data=[cols for cols in train_data.columns if train_data[cols].dtypes=='O']
print('The categorical variables in the dataset are show below :\n')
for cols in cat_fea_in_train_data:
    print(cols)

The categorical variables in the dataset are show below :

Id
idhogar
dependency
edjefe
edjefa


In [12]:
train_data[cat_fea_in_train_data].head()

Unnamed: 0,Id,idhogar,dependency,edjefe,edjefa
0,ID_279628684,21eb7fcc1,no,10,no
1,ID_f29eb3ddd,0e5d7a658,8,12,no
2,ID_68de51c94,2c7317ea8,8,no,11
3,ID_d671db89c,2b58d945f,yes,11,no
4,ID_d56d6f5f5,2b58d945f,yes,11,no


In [13]:
# Let's drop the varibles 'Id'(Unique Id) and 'idhogar'(Household level identifier) as they are not that much important for analysis
train_data.drop(['Id','idhogar'],axis=1,inplace=True)

In [14]:
train_data['dependency'].unique()

array(['no', '8', 'yes', '3', '.5', '.25', '2', '.66666669', '.33333334',
       '1.5', '.40000001', '.75', '1.25', '.2', '2.5', '1.2', '4',
       '1.3333334', '2.25', '.22222222', '5', '.83333331', '.80000001',
       '6', '3.5', '1.6666666', '.2857143', '1.75', '.71428573',
       '.16666667', '.60000002'], dtype=object)

In [15]:
train_data['edjefe'].unique()

array(['10', '12', 'no', '11', '9', '15', '4', '6', '8', '17', '7', '16',
       '14', '5', '21', '2', '19', 'yes', '3', '18', '13', '20'],
      dtype=object)

In [16]:
train_data['edjefa'].unique()

array(['no', '11', '4', '10', '9', '15', '7', '14', '13', '8', '17', '6',
       '5', '3', '16', '19', 'yes', '21', '12', '2', '20', '18'],
      dtype=object)

### Converting categorical variables into numerical variables

In [17]:
def map(val):
    if val=='yes':
        return(float(1))
    elif val=='no':
        return(float(0))
    else:
        return(float(val))

In [18]:
train_data['dependency']=train_data['dependency'].apply(map)
train_data['edjefe']=train_data['edjefe'].apply(map)
train_data['edjefa']=train_data['edjefa'].apply(map)

In [19]:
print(train_data.dtypes.value_counts())

int64      130
float64     11
dtype: int64


From the above result we can see no 'object' type in the dataset hence it means all the 'object' variables are converted into numerical variables

### Identify the variables with zero variance

In [20]:
from sklearn.feature_selection import VarianceThreshold
var_thres=VarianceThreshold(threshold=0)
var_thres.fit(train_data)
constant_columns=[column for column in train_data.columns
                  if column not in train_data.columns[var_thres.get_support()]]
print('The columns with zero variance are :')
for column in constant_columns:
    print(column)

The columns with zero variance are :
elimbasu5


Since elimbasu5 has no variablity in dataset therefore we can drop this variable

In [21]:
train_data.drop('elimbasu5',axis=1,inplace=True)

### Check if there are any biases in the dataset.

In [22]:
import scipy.stats
from scipy.stats import chi2

In [23]:
# r4t3 - Total persons in the household
# hogar_total - total individuals in the household
contingency_tab=pd.crosstab(train_data['r4t3'],train_data['hogar_total'])
Observed_Values=contingency_tab.values
b=scipy.stats.chi2_contingency(contingency_tab)
Expected_Values = b[3]
no_of_rows=len(contingency_tab.iloc[0:2,0])
no_of_columns=len(contingency_tab.iloc[0,0:2])
df=(no_of_rows-1)*(no_of_columns-1)
print("Degree of Freedom:-",df)
chi_square=sum([(o-e)**2./e for o,e in zip(Observed_Values,Expected_Values)])
chi_square_statistic=chi_square[0]+chi_square[1]
print("chi-square statistic:-",chi_square_statistic)
alpha=0.05
critical_value=chi2.ppf(q=1-alpha,df=df)
print('critical_value:',critical_value)
p_value=1-chi2.cdf(x=chi_square_statistic,df=df)
print('p-value:',p_value)
print('Significance level: ',alpha)
print('Degree of Freedom: ',df)
print('chi-square statistic:',chi_square_statistic)
print('critical_value:',critical_value)
print('p-value:',p_value)
if chi_square_statistic>=critical_value:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")
    
if p_value<=alpha:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")

Degree of Freedom:- 1
chi-square statistic:- 17022.072400560897
critical_value: 3.841458820694124
p-value: 0.0
Significance level:  0.05
Degree of Freedom:  1
chi-square statistic: 17022.072400560897
critical_value: 3.841458820694124
p-value: 0.0
Reject H0,There is a relationship between 2 categorical variables
Reject H0,There is a relationship between 2 categorical variables


The above result shows that the variables 'r4t3' and 'hogar_total' have some relationship between them,hence For the good results we can use any one of them.

In [24]:
# tipovivi3 - rented (=1 if rented)
# v2a1 - Monthly rent payment
contingency_tab=pd.crosstab(train_data['tipovivi3'],train_data['v2a1'])
Observed_Values=contingency_tab.values
b=scipy.stats.chi2_contingency(contingency_tab)
Expected_Values = b[3]
no_of_rows=len(contingency_tab.iloc[0:2,0])
no_of_columns=len(contingency_tab.iloc[0,0:2])
df=(no_of_rows-1)*(no_of_columns-1)
print("Degree of Freedom:-",df)
chi_square=sum([(o-e)**2./e for o,e in zip(Observed_Values,Expected_Values)])
chi_square_statistic=chi_square[0]+chi_square[1]
print("chi-square statistic:-",chi_square_statistic)
alpha=0.05
critical_value=chi2.ppf(q=1-alpha,df=df)
print('critical_value:',critical_value)
p_value=1-chi2.cdf(x=chi_square_statistic,df=df)
print('p-value:',p_value)
print('Significance level: ',alpha)
print('Degree of Freedom: ',df)
print('chi-square statistic:',chi_square_statistic)
print('critical_value:',critical_value)
print('p-value:',p_value)
if chi_square_statistic>=critical_value:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")
    
if p_value<=alpha:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")

Degree of Freedom:- 1
chi-square statistic:- 54.04781105990782
critical_value: 3.841458820694124
p-value: 1.9562129693895258e-13
Significance level:  0.05
Degree of Freedom:  1
chi-square statistic: 54.04781105990782
critical_value: 3.841458820694124
p-value: 1.9562129693895258e-13
Reject H0,There is a relationship between 2 categorical variables
Reject H0,There is a relationship between 2 categorical variables


The above result shows that the variables 'tipovivi3' and 'v2a1' have some relationship between them,hence For the good results we can use any one of them.

In [25]:
#v18q - owns a tablet
#v18q1 - number of tablets household owns
contingency_tab=pd.crosstab(train_data['v18q'],train_data['v18q1'])
Observed_Values=contingency_tab.values
b=scipy.stats.chi2_contingency(contingency_tab)
Expected_Values = b[3]
no_of_rows=len(contingency_tab.iloc[0:2,0])
no_of_columns=len(contingency_tab.iloc[0,0:2])
df=(no_of_rows-1)*(no_of_columns-1)
print("Degree of Freedom:-",df)
chi_square=sum([(o-e)**2./e for o,e in zip(Observed_Values,Expected_Values)])
chi_square_statistic=chi_square[0]+chi_square[1]
print("chi-square statistic:-",chi_square_statistic)
alpha=0.05
critical_value=chi2.ppf(q=1-alpha,df=df)
print('critical_value:',critical_value)
p_value=1-chi2.cdf(x=chi_square_statistic,df=df)
print('p-value:',p_value)
print('Significance level: ',alpha)
print('Degree of Freedom: ',df)
print('chi-square statistic:',chi_square_statistic)
print('critical_value:',critical_value)
print('p-value:',p_value)
if chi_square_statistic>=critical_value:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")
    
if p_value<=alpha:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")

Degree of Freedom:- 0
chi-square statistic:- 0.0
critical_value: nan
p-value: nan
Significance level:  0.05
Degree of Freedom:  0
chi-square statistic: 0.0
critical_value: nan
p-value: nan
Retain H0,There is no relationship between 2 categorical variables
Retain H0,There is no relationship between 2 categorical variables


The above result shows that the variables 'v18q' and 'v18q1' does not have any relationship between them since the values for critical_value and p-value are found to be nan's but in general it is the other way round because both variables tells about the tablets household owns or not

_**Hence from the above some of the results we can conclude that there is a biasness in the dataset.**_

### Check if there is a house without a family head.

In [26]:
#parentesco1 - household head (=1 if present)
train_data['parentesco1'].value_counts()

0    6584
1    2973
Name: parentesco1, dtype: int64

The above result says that there are 6584 houses without a family head and 2973 houses with a family head

In [27]:
# edjefe - years of education of male head of household, based on the interaction of escolari (years of education), head of household and gender, yes=1 and no=0
# edjefa - years of education of female head of household, based on the interaction ofescolari (years of education), head of household and gender, yes=1 and no=0

pd.crosstab(train_data['edjefa'],train_data['edjefe'])

edjefe,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,20.0,21.0
edjefa,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
0.0,435,123,194,307,137,222,1845,234,257,486,111,751,113,103,208,285,134,202,19,14,7,43
1.0,69,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2.0,84,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3.0,152,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4.0,136,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5.0,176,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6.0,947,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7.0,179,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8.0,217,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9.0,237,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


_**Above cross table shows 0 male head and 0 female head which implies that there are 435 families with no family head.**_

### Count how many null values are existing in columns.

In [30]:
#checking whether there are missing values in the dataset
train_data.isnull().sum().any()

True

In [31]:
train_data.isnull().sum().value_counts()

0       135
5         2
7928      1
6860      1
7342      1
dtype: int64

The above result says that :
* 135 columns as 0 missing values,
* 2 columns as 5 missing values,
* 1 column as 7928 missing values,
* 1 column as 6860 missing values, and 
* 1 column as 7342 missing values

In [32]:
#checking whether there are missing values in the target variable
train_data['Target'].isnull().sum().any()

False

In [33]:
train_data['Target'].value_counts()

4    5996
2    1597
3    1209
1     755
Name: Target, dtype: int64

From the above result we can see that the target variable does not have any missing values.

In [34]:
float_var_in_train_data=[cols for cols in train_data.columns if train_data[cols].dtypes=='float64']
print('Variables with type float in dataset are :\n')
for cols in float_var_in_train_data:
    print(cols)

Variables with type float in dataset are :

v2a1
v18q1
rez_esc
dependency
edjefe
edjefa
meaneduc
overcrowding
SQBovercrowding
SQBdependency
SQBmeaned


In [35]:
train_data[float_var_in_train_data].isnull().sum()

v2a1               6860
v18q1              7342
rez_esc            7928
dependency            0
edjefe                0
edjefa                0
meaneduc              5
overcrowding          0
SQBovercrowding       0
SQBdependency         0
SQBmeaned             5
dtype: int64

From the above result we can observe that  most of the missing values are found in variables with float type

In [36]:
pd.crosstab(train_data['r4t3'],train_data['hogar_total'])

hogar_total,1,2,3,4,5,6,7,8,9,10,11,12,13
r4t3,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,378,0,0,0,0,0,0,0,0,0,0,0,0
2,7,1348,0,0,0,0,0,0,0,0,0,0,0
3,1,10,2249,0,0,0,0,0,0,0,0,0,0
4,0,2,6,2439,0,0,0,0,0,0,0,0,0
5,0,2,3,8,1585,0,0,0,0,0,0,0,0
6,0,0,0,0,5,819,0,0,0,0,0,0,0
7,0,0,0,4,0,0,364,0,0,0,0,0,0
8,0,0,0,0,0,0,0,96,0,0,0,0,0
9,0,0,0,0,0,0,0,0,90,0,0,0,0
10,0,0,0,0,0,0,0,0,0,60,0,0,0


In [37]:
pd.crosstab(train_data['tipovivi3'],train_data['v2a1'])

v2a1,0.0,12000.0,13000.0,14000.0,15000.0,16000.0,17000.0,20000.0,23000.0,25000.0,25310.0,26000.0,27000.0,28000.0,30000.0,32000.0,32600.0,35000.0,36350.0,40000.0,42500.0,44000.0,45000.0,46500.0,50000.0,51000.0,52000.0,52831.0,55000.0,58731.0,60000.0,62539.0,65000.0,68000.0,70000.0,72000.0,72554.0,73000.0,75000.0,77000.0,78000.0,78039.0,80000.0,83333.0,84529.0,85000.0,89000.0,90000.0,92000.0,93000.0,94000.0,95000.0,96000.0,97000.0,100000.0,100297.0,102000.0,104000.0,105000.0,105661.0,106000.0,107000.0,108000.0,110000.0,115000.0,118097.0,119813.0,120000.0,125000.0,125518.0,127000.0,130000.0,132000.0,135000.0,140000.0,142635.0,145000.0,150000.0,155000.0,159751.0,160000.0,163000.0,165000.0,168000.0,169000.0,170000.0,171162.0,172000.0,175000.0,176000.0,178000.0,180000.0,185000.0,188000.0,190000.0,191500.0,200000.0,205000.0,210000.0,215000.0,219087.0,220000.0,225000.0,230000.0,234000.0,240000.0,245000.0,249896.0,250000.0,253000.0,260000.0,268153.0,270000.0,275000.0,278000.0,280000.0,283000.0,285000.0,285270.0,288750.0,290975.0,294000.0,300000.0,320000.0,325000.0,328000.0,342324.0,350000.0,357000.0,360000.0,380000.0,399378.0,400000.0,420000.0,427905.0,432000.0,450000.0,456432.0,470000.0,480000.0,500000.0,510000.0,525000.0,540000.0,542013.0,550000.0,564834.0,570540.0,600000.0,620000.0,684648.0,700000.0,770229.0,800000.0,855810.0,1000000.0,2353477.0
tipovivi3,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1,Unnamed: 122_level_1,Unnamed: 123_level_1,Unnamed: 124_level_1,Unnamed: 125_level_1,Unnamed: 126_level_1,Unnamed: 127_level_1,Unnamed: 128_level_1,Unnamed: 129_level_1,Unnamed: 130_level_1,Unnamed: 131_level_1,Unnamed: 132_level_1,Unnamed: 133_level_1,Unnamed: 134_level_1,Unnamed: 135_level_1,Unnamed: 136_level_1,Unnamed: 137_level_1,Unnamed: 138_level_1,Unnamed: 139_level_1,Unnamed: 140_level_1,Unnamed: 141_level_1,Unnamed: 142_level_1,Unnamed: 143_level_1,Unnamed: 144_level_1,Unnamed: 145_level_1,Unnamed: 146_level_1,Unnamed: 147_level_1,Unnamed: 148_level_1,Unnamed: 149_level_1,Unnamed: 150_level_1,Unnamed: 151_level_1,Unnamed: 152_level_1,Unnamed: 153_level_1,Unnamed: 154_level_1,Unnamed: 155_level_1,Unnamed: 156_level_1,Unnamed: 157_level_1
0,29,0,4,3,0,0,4,14,5,12,0,5,14,4,19,10,2,3,4,17,5,2,12,5,17,3,2,0,2,0,0,0,0,4,11,8,0,4,3,4,2,0,10,2,0,0,5,7,0,0,7,0,4,0,63,0,2,0,0,0,3,3,0,17,5,0,1,23,0,0,5,11,10,0,14,0,2,57,0,2,7,4,0,0,4,4,0,3,5,2,5,30,10,4,1,0,55,0,0,4,2,12,0,4,1,8,0,0,43,1,3,0,4,0,2,5,3,5,11,0,4,2,47,3,0,4,0,25,3,0,5,0,18,2,2,4,6,12,2,4,19,1,2,3,1,8,4,8,11,3,3,7,3,0,8,7,2
1,0,3,0,0,3,2,0,8,0,9,1,0,0,0,29,0,0,17,0,55,0,0,13,0,101,0,0,3,11,2,57,3,19,0,68,0,3,0,11,0,0,4,94,0,7,16,0,74,3,2,0,6,0,3,118,2,0,4,3,6,0,0,1,36,3,1,0,126,20,1,0,42,0,15,25,5,8,176,2,0,39,0,5,4,0,37,7,0,18,0,0,47,10,0,5,2,104,6,6,2,0,18,8,7,0,3,4,1,32,0,7,1,8,2,0,0,0,0,10,2,0,0,29,0,2,0,9,28,0,6,0,5,4,0,3,0,3,7,0,0,0,0,0,0,0,0,0,17,0,0,0,0,0,4,3,0,0


In [38]:
pd.crosstab(train_data['v18q1'],train_data['v18q'])

v18q,1
v18q1,Unnamed: 1_level_1
1.0,1586
2.0,444
3.0,129
4.0,37
5.0,13
6.0,6


The variables 'v2a1', 'v18q1', 'rez_esc' have more than 70% of null values in it, because for 'v18q1' there are families with their own house so they won't pay rent (in that case it should be 0) and similarly for 'v18q1' there can be families with 0 tablets.

Hence we can drop variables 'tipovivi3'(as v2a1 alone can show whether it is rented or not) ,and 'v18q' (as 'v18q1' alone can show that if respondent owns a tablet or not)

Also drop the variable 'r4t3' (as 'hogar_total' alone can show the count of people in a house)

In [39]:
train_data.drop(['rez_esc','r4t3','tipovivi3','v18q'],axis=1,inplace=True)

### Handling the missing values

In [40]:
train_data['v2a1'].fillna(0,inplace=True)
train_data['v18q1'].fillna(0,inplace=True)
train_data['meaneduc'].fillna(np.mean(train_data['meaneduc']),inplace=True)
train_data['SQBmeaned'].fillna(np.mean(train_data['SQBmeaned']),inplace=True)

In [41]:
if train_data.isnull().sum().any()==True:
    print("There are missing values in the dataset")
else:
    print("There are no missing values in the dataset")

There are no missing values in the dataset


### Set the poverty level of the members and the head of the house same in a family.

People below poverty level can be the people paying less rent and don't own a house,and it also depends on whether a house is in urban area or rural area. 

In [42]:
people_paying_rent=train_data[train_data['v2a1']!=0]
poverty_level=people_paying_rent.groupby('area1')['v2a1'].apply(np.median)
poverty_level

area1
0     80000.0
1    140000.0
Name: v2a1, dtype: float64

* In Rural area level if people paying rent less than 8000 is under poverty level. 
* In Urban area level if people paying rent less than 140000 is under poverty level.

In [43]:
def poverty(x):
    if x<8000:
        return('Below poverty level')    
    elif x>140000:
        return('Above poverty level')
    elif x<140000:
        return('Below poverty level : Urban ; Above poverty level : Rural ')   

In [44]:
df=people_paying_rent['v2a1'].apply(poverty)
df

0                                     Above poverty level
1       Below poverty level : Urban ; Above poverty le...
3                                     Above poverty level
4                                     Above poverty level
5                                     Above poverty level
                              ...                        
9552    Below poverty level : Urban ; Above poverty le...
9553    Below poverty level : Urban ; Above poverty le...
9554    Below poverty level : Urban ; Above poverty le...
9555    Below poverty level : Urban ; Above poverty le...
9556    Below poverty level : Urban ; Above poverty le...
Name: v2a1, Length: 2668, dtype: object

In [45]:
df.shape

(2668,)

In [46]:
pd.crosstab(df,people_paying_rent['area1'])

area1,0,1
v2a1,Unnamed: 1_level_1,Unnamed: 2_level_1
Above poverty level,139,1103
Below poverty level : Urban ; Above poverty level : Rural,306,1081


From the above results we can conclude that : 
* ***There are total 1242 people above poverty level independent of area whether rural or Urban***
* ***Remaining peoples poverty level depends on their area***:

   > **Rural :**
   > * Above poverty level= 445
    
   > **Urban :** 
   > * Above poverty level =1103
   > * Below poverty level=1081

In [47]:
X_train_data=train_data.drop('Target',axis=1)
X_train_data.head(3)

Unnamed: 0,v2a1,hacdor,rooms,hacapo,v14a,refrig,v18q1,r4h1,r4h2,r4h3,r4m1,r4m2,r4m3,r4t1,r4t2,tamhog,tamviv,escolari,hhsize,paredblolad,paredzocalo,paredpreb,pareddes,paredmad,paredzinc,paredfibras,paredother,pisomoscer,pisocemento,pisoother,pisonatur,pisonotiene,pisomadera,techozinc,techoentrepiso,techocane,techootro,cielorazo,abastaguadentro,abastaguafuera,abastaguano,public,planpri,noelec,coopele,sanitario1,sanitario2,sanitario3,sanitario5,sanitario6,energcocinar1,energcocinar2,energcocinar3,energcocinar4,elimbasu1,elimbasu2,elimbasu3,elimbasu4,elimbasu6,epared1,epared2,epared3,etecho1,etecho2,etecho3,eviv1,eviv2,eviv3,dis,male,female,estadocivil1,estadocivil2,estadocivil3,estadocivil4,estadocivil5,estadocivil6,estadocivil7,parentesco1,parentesco2,parentesco3,parentesco4,parentesco5,parentesco6,parentesco7,parentesco8,parentesco9,parentesco10,parentesco11,parentesco12,hogar_nin,hogar_adul,hogar_mayor,hogar_total,dependency,edjefe,edjefa,meaneduc,instlevel1,instlevel2,instlevel3,instlevel4,instlevel5,instlevel6,instlevel7,instlevel8,instlevel9,bedrooms,overcrowding,tipovivi1,tipovivi2,tipovivi4,tipovivi5,computer,television,mobilephone,qmobilephone,lugar1,lugar2,lugar3,lugar4,lugar5,lugar6,area1,area2,age,SQBescolari,SQBage,SQBhogar_total,SQBedjefe,SQBhogar_nin,SQBovercrowding,SQBdependency,SQBmeaned,agesq
0,190000.0,0,3,0,1,1,0.0,0,1,1,0,0,0,0,1,1,1,10,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0.0,10.0,0.0,10.0,0,0,0,1,0,0,0,0,0,1,1.0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,1,0,43,100,1849,1,100,0,1.0,0.0,100.0,1849
1,135000.0,0,4,0,1,1,1.0,0,1,1,0,0,0,0,1,1,1,12,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,8.0,12.0,0.0,12.0,0,0,0,0,0,0,0,1,0,1,1.0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,1,0,67,144,4489,1,144,0,1.0,64.0,144.0,4489
2,0.0,0,8,0,1,1,0.0,0,0,0,0,1,1,0,1,1,1,11,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,1,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,8.0,0.0,11.0,11.0,0,0,0,0,1,0,0,0,0,2,0.5,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,92,121,8464,1,0,0,0.25,64.0,121.0,8464


In [48]:
y_train_data=train_data['Target']
y_train_data.head()

0    4
1    4
2    4
3    4
4    4
Name: Target, dtype: int64

In [49]:
print('Shape of X_train_data : ',X_train_data.shape)
print('Shape of y_train_data : ',y_train_data.shape)

Shape of X_train_data :  (9557, 135)
Shape of y_train_data :  (9557,)


### Applying Standard Scaler to the dataset

In [50]:
#importing standard scaler
from sklearn.preprocessing import StandardScaler

In [51]:
#creating a scaler object
ss=StandardScaler()

In [52]:
scaled_X_train_data=ss.fit_transform(X_train_data)
scaled_X_train_data=pd.DataFrame(scaled_X_train_data,columns=X_train_data.columns)
scaled_X_train_data.head()

Unnamed: 0,v2a1,hacdor,rooms,hacapo,v14a,refrig,v18q1,r4h1,r4h2,r4h3,r4m1,r4m2,r4m3,r4t1,r4t2,tamhog,tamviv,escolari,hhsize,paredblolad,paredzocalo,paredpreb,pareddes,paredmad,paredzinc,paredfibras,paredother,pisomoscer,pisocemento,pisoother,pisonatur,pisonotiene,pisomadera,techozinc,techoentrepiso,techocane,techootro,cielorazo,abastaguadentro,abastaguafuera,abastaguano,public,planpri,noelec,coopele,sanitario1,sanitario2,sanitario3,sanitario5,sanitario6,energcocinar1,energcocinar2,energcocinar3,energcocinar4,elimbasu1,elimbasu2,elimbasu3,elimbasu4,elimbasu6,epared1,epared2,epared3,etecho1,etecho2,etecho3,eviv1,eviv2,eviv3,dis,male,female,estadocivil1,estadocivil2,estadocivil3,estadocivil4,estadocivil5,estadocivil6,estadocivil7,parentesco1,parentesco2,parentesco3,parentesco4,parentesco5,parentesco6,parentesco7,parentesco8,parentesco9,parentesco10,parentesco11,parentesco12,hogar_nin,hogar_adul,hogar_mayor,hogar_total,dependency,edjefe,edjefa,meaneduc,instlevel1,instlevel2,instlevel3,instlevel4,instlevel5,instlevel6,instlevel7,instlevel8,instlevel9,bedrooms,overcrowding,tipovivi1,tipovivi2,tipovivi4,tipovivi5,computer,television,mobilephone,qmobilephone,lugar1,lugar2,lugar3,lugar4,lugar5,lugar6,area1,area2,age,SQBescolari,SQBage,SQBhogar_total,SQBedjefe,SQBhogar_nin,SQBovercrowding,SQBdependency,SQBmeaned,agesq
0,1.313389,-0.198986,-1.331829,-0.155629,0.072521,0.210363,-0.466827,-0.566874,-0.53947,-0.794982,-0.576502,-1.781038,-1.708716,-0.749476,-1.541297,-1.692353,-1.649279,0.59183,-1.692353,0.826716,-0.290341,-0.481219,-0.093029,-0.361577,-0.115121,-0.038302,-0.038302,0.665911,-0.535529,-0.030702,-0.032364,-0.129237,-0.265655,-5.693512,7.453207,-0.056115,-0.046927,0.690082,0.191183,-0.180949,-0.059752,0.360281,-0.01772,-0.046927,-0.353012,-0.062342,1.916598,-1.800528,-0.124987,-0.050175,-0.04344,-0.97939,1.086952,-0.229706,0.364531,-0.175955,-0.303897,-0.038302,-0.035457,-0.33783,1.433294,-1.15171,2.605555,-0.636093,-1.183747,2.982176,-0.58166,-1.351237,-0.247111,1.033738,-1.033738,-0.399788,-0.374953,-0.60568,5.50767,-0.258818,-0.185222,-0.722032,1.488153,-0.474943,-0.785899,-0.110846,-0.098048,-0.232219,-0.100732,-0.049116,-0.11605,-0.056115,-0.11324,-0.093599,-1.029646,-1.365718,-0.475749,-1.692353,-0.715826,0.934615,-0.628132,0.184447,-0.39449,-0.45346,-0.512006,2.09603,-0.356377,-0.134976,-0.125847,-0.403126,-0.124987,-1.842307,-0.738356,-1.273275,-0.334359,-0.131725,-0.299355,-0.337253,-0.630742,0.15912,-1.228106,0.837702,-0.319656,-0.257896,-0.300391,-0.321838,-0.296232,0.632039,-0.632039,0.402406,0.335757,0.117871,-0.967066,0.592794,-0.553536,-0.544758,-0.311754,-0.027692,0.117871
1,0.809548,-0.198986,-0.650771,-0.155629,0.072521,0.210363,0.967727,-0.566874,-0.53947,-0.794982,-0.576502,-1.781038,-1.708716,-0.749476,-1.541297,-1.692353,-1.649279,1.014607,-1.692353,-1.209605,-0.290341,-0.481219,-0.093029,2.765659,-0.115121,-0.038302,-0.038302,-1.501702,-0.535529,-0.030702,-0.032364,-0.129237,3.764285,0.175639,-0.13417,-0.056115,-0.046927,0.690082,0.191183,-0.180949,-0.059752,0.360281,-0.01772,-0.046927,-0.353012,-0.062342,1.916598,-1.800528,-0.124987,-0.050175,-0.04344,1.021044,-0.920004,-0.229706,0.364531,-0.175955,-0.303897,-0.038302,-0.035457,-0.33783,1.433294,-1.15171,-0.383795,1.572096,-1.183747,-0.335326,1.719218,-1.351237,-0.247111,1.033738,-1.033738,-0.399788,-0.374953,-0.60568,5.50767,-0.258818,-0.185222,-0.722032,1.488153,-0.474943,-0.785899,-0.110846,-0.098048,-0.232219,-0.100732,-0.049116,-0.11605,-0.056115,-0.11324,-0.093599,-1.029646,-1.365718,1.198922,-1.692353,4.265778,1.31584,-0.628132,0.664479,-0.39449,-0.45346,-0.512006,-0.477092,-0.356377,-0.134976,-0.125847,2.480613,-0.124987,-1.842307,-0.738356,-1.273275,-0.334359,-0.131725,-0.299355,-0.337253,-0.630742,0.15912,-1.228106,0.837702,-0.319656,-0.257896,-0.300391,-0.321838,-0.296232,0.632039,-0.632039,1.512945,0.908871,1.634149,-0.967066,1.15372,-0.553536,-0.544758,4.803672,0.442959,1.634149
2,-0.427153,-0.198986,2.07346,-0.155629,0.072521,0.210363,-0.466827,-0.566874,-1.504237,-1.636173,-0.576502,-0.70923,-0.879604,-0.749476,-1.541297,-1.692353,-1.649279,0.803218,-1.692353,-1.209605,-0.290341,-0.481219,-0.093029,2.765659,-0.115121,-0.038302,-0.038302,0.665911,-0.535529,-0.030702,-0.032364,-0.129237,-0.265655,0.175639,-0.13417,-0.056115,-0.046927,0.690082,0.191183,-0.180949,-0.059752,0.360281,-0.01772,-0.046927,-0.353012,-0.062342,1.916598,-1.800528,-0.124987,-0.050175,-0.04344,1.021044,-0.920004,-0.229706,0.364531,-0.175955,-0.303897,-0.038302,-0.035457,-0.33783,1.433294,-1.15171,-0.383795,-0.636093,0.844775,-0.335326,-0.58166,0.740062,4.046772,-0.967363,0.967363,-0.399788,-0.374953,-0.60568,-0.181565,-0.258818,5.398913,-0.722032,1.488153,-0.474943,-0.785899,-0.110846,-0.098048,-0.232219,-0.100732,-0.049116,-0.11605,-0.056115,-0.11324,-0.093599,-1.029646,-1.365718,1.198922,-1.692353,4.265778,-0.971513,1.757046,0.424463,-0.39449,-0.45346,-0.512006,-0.477092,2.806016,-0.134976,-0.125847,-0.403126,-0.124987,-0.783498,-1.348184,0.785376,-0.334359,-0.131725,-0.299355,-0.337253,-0.630742,-6.284565,-1.902337,0.837702,-0.319656,-0.257896,-0.300391,-0.321838,-0.296232,0.632039,-0.632039,2.669756,0.609289,3.91718,-0.967066,-0.682039,-0.553536,-0.726385,4.803672,0.196937,3.91718
3,1.221781,-0.198986,0.030287,-0.155629,0.072521,0.210363,0.967727,-0.566874,0.425297,0.04621,0.8677,-0.70923,-0.050491,0.205174,-0.153295,0.000531,-0.050412,0.380442,0.000531,0.826716,-0.290341,-0.481219,-0.093029,-0.361577,-0.115121,-0.038302,-0.038302,0.665911,-0.535529,-0.030702,-0.032364,-0.129237,-0.265655,0.175639,-0.13417,-0.056115,-0.046927,0.690082,0.191183,-0.180949,-0.059752,0.360281,-0.01772,-0.046927,-0.353012,-0.062342,1.916598,-1.800528,-0.124987,-0.050175,-0.04344,1.021044,-0.920004,-0.229706,0.364531,-0.175955,-0.303897,-0.038302,-0.035457,-0.33783,-0.697694,0.868274,-0.383795,-0.636093,0.844775,-0.335326,-0.58166,0.740062,-0.247111,1.033738,-1.033738,-0.399788,-0.374953,-0.60568,-0.181565,-0.258818,-0.185222,1.384979,-0.671974,-0.474943,1.272428,-0.110846,-0.098048,-0.232219,-0.100732,-0.049116,-0.11605,-0.056115,-0.11324,-0.093599,0.434361,-0.508095,-0.475749,0.000531,-0.093125,1.125228,-0.628132,0.424463,-0.39449,-0.45346,-0.512006,2.09603,-0.356377,-0.134976,-0.125847,-0.403126,-0.124987,0.27531,-0.331804,-1.273275,-0.334359,-0.131725,-0.299355,-0.337253,-0.630742,0.15912,0.120356,0.837702,-0.319656,-0.257896,-0.300391,-0.321838,-0.296232,0.632039,-0.632039,-0.800678,0.088276,-0.778111,-0.167084,0.860508,0.02234,-0.356403,-0.231825,0.196937,-0.778111
4,1.221781,-0.198986,0.030287,-0.155629,0.072521,0.210363,0.967727,-0.566874,0.425297,0.04621,0.8677,-0.70923,-0.050491,0.205174,-0.153295,0.000531,-0.050412,0.803218,0.000531,0.826716,-0.290341,-0.481219,-0.093029,-0.361577,-0.115121,-0.038302,-0.038302,0.665911,-0.535529,-0.030702,-0.032364,-0.129237,-0.265655,0.175639,-0.13417,-0.056115,-0.046927,0.690082,0.191183,-0.180949,-0.059752,0.360281,-0.01772,-0.046927,-0.353012,-0.062342,1.916598,-1.800528,-0.124987,-0.050175,-0.04344,1.021044,-0.920004,-0.229706,0.364531,-0.175955,-0.303897,-0.038302,-0.035457,-0.33783,-0.697694,0.868274,-0.383795,-0.636093,0.844775,-0.335326,-0.58166,0.740062,-0.247111,-0.967363,0.967363,-0.399788,2.667003,-0.60568,-0.181565,-0.258818,-0.185222,-0.722032,-0.671974,2.105517,-0.785899,-0.110846,-0.098048,-0.232219,-0.100732,-0.049116,-0.11605,-0.056115,-0.11324,-0.093599,0.434361,-0.508095,-0.475749,0.000531,-0.093125,1.125228,-0.628132,0.424463,-0.39449,-0.45346,-0.512006,-0.477092,2.806016,-0.134976,-0.125847,-0.403126,-0.124987,0.27531,-0.331804,-1.273275,-0.334359,-0.131725,-0.299355,-0.337253,-0.630742,0.15912,0.120356,0.837702,-0.319656,-0.257896,-0.300391,-0.321838,-0.296232,0.632039,-0.632039,0.124771,0.609289,-0.157816,-0.167084,0.860508,0.02234,-0.356403,-0.231825,0.196937,-0.157816


### Model Building

In [115]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [116]:
### Train-Test split
X_train,X_test,y_train,y_test=train_test_split(scaled_X_train_data,y_train_data,test_size=0.25,stratify=y_train_data,random_state=42)

In [117]:
X_train.shape,X_test.shape,y_train.shape,y_test.shape

((7167, 135), (2390, 135), (7167,), (2390,))

### Hyper parameter tuning using GridsearchCV

In [58]:
from sklearn.model_selection import GridSearchCV

rfc=RandomForestClassifier(random_state=42)
params={
    'n_estimators':[10,50,80,100,115,130],
    'max_depth':[3,5,10,7,15,18],
}

grid=zip([rfc],[params])

best_=None

for i, j in grid:
    a=GridSearchCV(i,param_grid=j,cv=3,n_jobs=1)
    a.fit(X_train,y_train)
    if best_ is None:
        best_=a
    elif a.best_score_>best_.best_score_:
        best_=a
        
        
print ("Best CV Score",best_.best_score_)
print ("Model Parameters",best_.best_params_)
print("Best Estimator",best_.best_estimator_)

Best CV Score 0.8620064183061253
Model Parameters {'max_depth': 18, 'n_estimators': 130}
Best Estimator RandomForestClassifier(max_depth=18, n_estimators=130, random_state=42)


In [118]:
RFC=best_.best_estimator_
model=RFC.fit(X_train,y_train)
preds=model.predict(X_test)

In [119]:
preds[:10]

array([2, 4, 4, 4, 2, 4, 4, 4, 4, 4])

In [99]:
#cross validation
print('Model Score of train data : {}'.format(model.score(X_train,y_train)))
print('Model Score of test data : {}'.format(model.score(X_test,y_test)))

Model Score of train data : 0.9974884889074926
Model Score of test data : 0.9112970711297071


In [100]:
important_features=pd.DataFrame(model.feature_importances_,scaled_X_train_data.columns,columns=['feature_importance'])

In [101]:
top_50_features=important_features.sort_values(by='feature_importance',ascending=False).head(50).index
top_50_features

Index(['meaneduc', 'SQBmeaned', 'SQBdependency', 'dependency',
       'SQBovercrowding', 'overcrowding', 'qmobilephone', 'SQBhogar_nin',
       'SQBedjefe', 'edjefe', 'hogar_nin', 'rooms', 'cielorazo', 'SQBage',
       'agesq', 'age', 'v2a1', 'edjefa', 'r4t2', 'r4h2', 'r4m3', 'r4t1',
       'SQBescolari', 'r4h3', 'escolari', 'hogar_adul', 'bedrooms', 'r4m1',
       'r4m2', 'hogar_total', 'eviv3', 'tamviv', 'v18q1', 'hhsize', 'tamhog',
       'SQBhogar_total', 'epared3', 'paredblolad', 'pisomoscer', 'r4h1',
       'etecho3', 'lugar1', 'energcocinar2', 'energcocinar3', 'etecho2',
       'epared2', 'hogar_mayor', 'pisocemento', 'eviv2', 'paredpreb'],
      dtype='object')

In [102]:
Top50_scaled_X_train_data=scaled_X_train_data[top_50_features]

In [103]:
X_train,X_test,y_train,y_test=train_test_split(Top50_scaled_X_train_data,y_train_data,test_size=0.25,stratify=y_train_data,random_state=42)

In [104]:
model1=RFC.fit(X_train,y_train)
preds=model1.predict(X_test)

In [105]:
preds

array([2, 4, 1, ..., 4, 4, 4])

In [106]:
from sklearn.metrics import confusion_matrix,f1_score,accuracy_score

In [107]:
confusion_matrix(y_test,preds)

array([[ 153,   14,    1,   21],
       [   6,  335,   12,   46],
       [   0,   14,  235,   53],
       [   4,    8,    2, 1486]])

In [108]:
f1_score(y_test,preds,average='weighted')

0.9221368945110074

In [109]:
accuracy_score(y_test,preds)

0.9242677824267782

### Apply cleaning on test_data and then find the prediction for that.

In [72]:
# lets drop some of the variable which are not that much important for analysis
test_data.drop(['Id','idhogar'],axis=1,inplace=True)
test_data.drop('elimbasu5',axis=1,inplace=True)
test_data.drop(['r4t3','tipovivi3','v18q','rez_esc'],axis=1,inplace=True)

In [74]:
#Handling missing values 
test_data['dependency']=test_data['dependency'].apply(map)
test_data['edjefe']=test_data['edjefe'].apply(map)
test_data['edjefa']=test_data['edjefa'].apply(map)
test_data['v2a1'].fillna(0,inplace=True)
test_data['v18q1'].fillna(0,inplace=True)
test_data['meaneduc'].fillna(np.mean(test_data['meaneduc']),inplace=True)
test_data['SQBmeaned'].fillna(np.mean(test_data['SQBmeaned']),inplace=True)

In [110]:
if test_data.isnull().sum().any()==True:
    print('There are mising values in the test_data')
else:
    print('There are no missing values in the test_data')

There are no missing values in the test_data


In [111]:
#scaling test_data
scaled_test_data=ss.transform(test_data)
scaled_test_data=pd.DataFrame(scaled_test_data,columns=test_data.columns)
scaled_test_data.head()

Unnamed: 0,v2a1,hacdor,rooms,hacapo,v14a,refrig,v18q1,r4h1,r4h2,r4h3,r4m1,r4m2,r4m3,r4t1,r4t2,tamhog,tamviv,escolari,hhsize,paredblolad,paredzocalo,paredpreb,pareddes,paredmad,paredzinc,paredfibras,paredother,pisomoscer,pisocemento,pisoother,pisonatur,pisonotiene,pisomadera,techozinc,techoentrepiso,techocane,techootro,cielorazo,abastaguadentro,abastaguafuera,abastaguano,public,planpri,noelec,coopele,sanitario1,sanitario2,sanitario3,sanitario5,sanitario6,energcocinar1,energcocinar2,energcocinar3,energcocinar4,elimbasu1,elimbasu2,elimbasu3,elimbasu4,elimbasu6,epared1,epared2,epared3,etecho1,etecho2,etecho3,eviv1,eviv2,eviv3,dis,male,female,estadocivil1,estadocivil2,estadocivil3,estadocivil4,estadocivil5,estadocivil6,estadocivil7,parentesco1,parentesco2,parentesco3,parentesco4,parentesco5,parentesco6,parentesco7,parentesco8,parentesco9,parentesco10,parentesco11,parentesco12,hogar_nin,hogar_adul,hogar_mayor,hogar_total,dependency,edjefe,edjefa,meaneduc,instlevel1,instlevel2,instlevel3,instlevel4,instlevel5,instlevel6,instlevel7,instlevel8,instlevel9,bedrooms,overcrowding,tipovivi1,tipovivi2,tipovivi4,tipovivi5,computer,television,mobilephone,qmobilephone,lugar1,lugar2,lugar3,lugar4,lugar5,lugar6,area1,area2,age,SQBescolari,SQBage,SQBhogar_total,SQBedjefe,SQBhogar_nin,SQBovercrowding,SQBdependency,SQBmeaned,agesq
0,-0.427153,-0.198986,0.030287,-0.155629,0.072521,0.210363,-0.466827,0.90211,-0.53947,0.04621,-0.576502,-0.70923,-0.879604,0.205174,-0.847296,-0.563764,-0.583368,-1.522054,-0.563764,0.826716,-0.290341,-0.481219,-0.093029,-0.361577,-0.115121,-0.038302,-0.038302,0.665911,-0.535529,-0.030702,-0.032364,-0.129237,-0.265655,0.175639,-0.13417,-0.056115,-0.046927,0.690082,0.191183,-0.180949,-0.059752,0.360281,-0.01772,-0.046927,-0.353012,-0.062342,1.916598,-1.800528,-0.124987,-0.050175,-0.04344,-0.97939,1.086952,-0.229706,0.364531,-0.175955,-0.303897,-0.038302,-0.035457,-0.33783,-0.697694,0.868274,-0.383795,-0.636093,0.844775,-0.335326,-0.58166,0.740062,-0.247111,1.033738,-1.033738,2.501328,-0.374953,-0.60568,-0.181565,-0.258818,-0.185222,-0.722032,-0.671974,-0.474943,1.272428,-0.110846,-0.098048,-0.232219,-0.100732,-0.049116,-0.11605,-0.056115,-0.11324,-0.093599,-0.297642,-0.508095,-0.475749,-0.563764,-0.404475,-0.971513,3.058052,1.744552,2.534915,-0.45346,-0.512006,-0.477092,-0.356377,-0.134976,-0.125847,-0.403126,-0.124987,-0.783498,-0.128527,0.785376,-0.334359,-0.131725,-0.299355,2.965132,-0.630742,0.15912,-0.553875,0.837702,-0.319656,-0.257896,-0.300391,-0.321838,-0.296232,0.632039,-0.632039,-1.402219,-0.966775,-0.934908,-0.540409,-0.682039,-0.409567,-0.242045,-0.291772,1.8148,-0.934908
1,-0.427153,-0.198986,0.030287,-0.155629,0.072521,0.210363,-0.466827,0.90211,-0.53947,0.04621,-0.576502,-0.70923,-0.879604,0.205174,-0.847296,-0.563764,-0.583368,1.86016,-0.563764,0.826716,-0.290341,-0.481219,-0.093029,-0.361577,-0.115121,-0.038302,-0.038302,0.665911,-0.535529,-0.030702,-0.032364,-0.129237,-0.265655,0.175639,-0.13417,-0.056115,-0.046927,0.690082,0.191183,-0.180949,-0.059752,0.360281,-0.01772,-0.046927,-0.353012,-0.062342,1.916598,-1.800528,-0.124987,-0.050175,-0.04344,-0.97939,1.086952,-0.229706,0.364531,-0.175955,-0.303897,-0.038302,-0.035457,-0.33783,-0.697694,0.868274,-0.383795,-0.636093,0.844775,-0.335326,-0.58166,0.740062,-0.247111,1.033738,-1.033738,-0.399788,-0.374953,1.651038,-0.181565,-0.258818,-0.185222,-0.722032,-0.671974,2.105517,-0.785899,-0.110846,-0.098048,-0.232219,-0.100732,-0.049116,-0.11605,-0.056115,-0.11324,-0.093599,-0.297642,-0.508095,-0.475749,-0.563764,-0.404475,-0.971513,3.058052,1.744552,-0.39449,-0.45346,-0.512006,-0.477092,-0.356377,-0.134976,-0.125847,2.480613,-0.124987,-0.783498,-0.128527,0.785376,-0.334359,-0.131725,-0.299355,2.965132,-0.630742,0.15912,-0.553875,0.837702,-0.319656,-0.257896,-0.300391,-0.321838,-0.296232,0.632039,-0.632039,0.309861,2.367707,0.02138,-0.540409,-0.682039,-0.409567,-0.242045,-0.291772,1.8148,0.02138
2,-0.427153,-0.198986,0.030287,-0.155629,0.072521,0.210363,-0.466827,0.90211,-0.53947,0.04621,-0.576502,-0.70923,-0.879604,0.205174,-0.847296,-0.563764,-0.583368,2.071549,-0.563764,0.826716,-0.290341,-0.481219,-0.093029,-0.361577,-0.115121,-0.038302,-0.038302,0.665911,-0.535529,-0.030702,-0.032364,-0.129237,-0.265655,0.175639,-0.13417,-0.056115,-0.046927,0.690082,0.191183,-0.180949,-0.059752,0.360281,-0.01772,-0.046927,-0.353012,-0.062342,1.916598,-1.800528,-0.124987,-0.050175,-0.04344,-0.97939,1.086952,-0.229706,0.364531,-0.175955,-0.303897,-0.038302,-0.035457,-0.33783,-0.697694,0.868274,-0.383795,-0.636093,0.844775,-0.335326,-0.58166,0.740062,-0.247111,-0.967363,0.967363,-0.399788,-0.374953,1.651038,-0.181565,-0.258818,-0.185222,-0.722032,1.488153,-0.474943,-0.785899,-0.110846,-0.098048,-0.232219,-0.100732,-0.049116,-0.11605,-0.056115,-0.11324,-0.093599,-0.297642,-0.508095,-0.475749,-0.563764,-0.404475,-0.971513,3.058052,1.744552,-0.39449,-0.45346,-0.512006,-0.477092,-0.356377,-0.134976,-0.125847,-0.403126,8.00085,-0.783498,-0.128527,0.785376,-0.334359,-0.131725,-0.299355,2.965132,-0.630742,0.15912,-0.553875,0.837702,-0.319656,-0.257896,-0.300391,-0.321838,-0.296232,0.632039,-0.632039,0.309861,2.797543,0.02138,-0.540409,-0.682039,-0.409567,-0.242045,-0.291772,1.8148,0.02138
3,-0.427153,-0.198986,6.159807,-0.155629,0.072521,0.210363,0.967727,-0.566874,-0.53947,-0.794982,-0.576502,-1.781038,-1.708716,-0.749476,-1.541297,-1.692353,-1.649279,1.86016,-1.692353,0.826716,-0.290341,-0.481219,-0.093029,-0.361577,-0.115121,-0.038302,-0.038302,0.665911,-0.535529,-0.030702,-0.032364,-0.129237,-0.265655,0.175639,-0.13417,-0.056115,-0.046927,0.690082,0.191183,-0.180949,-0.059752,0.360281,-0.01772,-0.046927,-0.353012,-0.062342,1.916598,-1.800528,-0.124987,-0.050175,-0.04344,1.021044,-0.920004,-0.229706,0.364531,-0.175955,-0.303897,-0.038302,-0.035457,-0.33783,-0.697694,0.868274,-0.383795,-0.636093,0.844775,-0.335326,-0.58166,0.740062,-0.247111,1.033738,-1.033738,-0.399788,-0.374953,-0.60568,-0.181565,3.86372,-0.185222,-0.722032,1.488153,-0.474943,-0.785899,-0.110846,-0.098048,-0.232219,-0.100732,-0.049116,-0.11605,-0.056115,-0.11324,-0.093599,-1.029646,-1.365718,-0.475749,-1.692353,-0.715826,2.078291,-0.628132,1.624544,-0.39449,-0.45346,-0.512006,-0.477092,-0.356377,-0.134976,-0.125847,2.480613,-0.124987,-1.842307,-0.738356,0.785376,-0.334359,-0.131725,-0.299355,2.965132,-0.630742,0.15912,-0.553875,0.837702,-0.319656,-0.257896,-0.300391,-0.321838,-0.296232,0.632039,-0.632039,1.142765,2.367707,1.055206,-0.967066,2.581532,-0.553536,-0.544758,-0.311754,1.64098,1.055206
4,1.175977,-0.198986,-0.650771,-0.155629,0.072521,0.210363,0.967727,-0.566874,-1.504237,-1.636173,-0.576502,-0.70923,-0.879604,-0.749476,-1.541297,-1.692353,-1.649279,0.803218,-1.692353,0.826716,-0.290341,-0.481219,-0.093029,-0.361577,-0.115121,-0.038302,-0.038302,0.665911,-0.535529,-0.030702,-0.032364,-0.129237,-0.265655,0.175639,-0.13417,-0.056115,-0.046927,0.690082,0.191183,-0.180949,-0.059752,0.360281,-0.01772,-0.046927,-0.353012,-0.062342,1.916598,-1.800528,-0.124987,-0.050175,-0.04344,1.021044,-0.920004,-0.229706,0.364531,-0.175955,-0.303897,-0.038302,-0.035457,-0.33783,1.433294,-1.15171,-0.383795,-0.636093,0.844775,-0.335326,-0.58166,0.740062,-0.247111,-0.967363,0.967363,-0.399788,-0.374953,-0.60568,-0.181565,-0.258818,-0.185222,1.384979,1.488153,-0.474943,-0.785899,-0.110846,-0.098048,-0.232219,-0.100732,-0.049116,-0.11605,-0.056115,-0.11324,-0.093599,-0.297642,-2.223341,-0.475749,-1.692353,4.265778,-0.971513,1.757046,-0.017773,-0.39449,-0.45346,-0.512006,-0.477092,2.806016,-0.134976,-0.125847,-0.403126,-0.124987,-0.783498,-1.348184,-1.273275,-0.334359,-0.131725,-0.299355,-0.337253,-0.630742,0.15912,-1.228106,0.837702,-0.319656,-0.257896,-0.300391,-0.321838,-0.296232,0.632039,-0.632039,-0.754405,0.609289,-0.758009,-0.967066,-0.682039,-0.409567,-0.726385,4.803672,-0.022245,-0.758009


In [112]:
top_50_scaled_test_data=scaled_test_data[top_50_features]

In [113]:
test_preds=model1.predict(top_50_scaled_test_data)
test_preds

array([4, 4, 4, ..., 2, 2, 4])

### Conclusion :

***Using RandomForest Classifier we can predict test_data with an accuracy of 92%.***