### Identify the level of income qualification needed for the families in Latin America.

### Problem Statement Scenario:
Many social programs have a hard time ensuring that the right people are given enough aid. It’s tricky when a program focuses on the poorest segment of the population. This segment of the population can’t provide the necessary income and expense records to prove that they qualify.

In Latin America, a popular method called Proxy Means Test (PMT) uses an algorithm to verify income qualification. With PMT, agencies use a model that considers a family’s observable household attributes like the material of their walls and ceiling or the assets found in their homes to
classify them and predict their level of need.

While this is an improvement, accuracy remains a problem as the region’s population grows and poverty declines.

The Inter-American Development Bank (IDB)believes that new methods beyond traditional econometrics, based on a dataset of Costa Rican household characteristics, might help improve PMT’s performance.

### Following actions should be performed:

- Identify the output variable.
- Understand the type of data.
- Check if there are any biases in your dataset.
- Check whether all members of the house have the same poverty level.
- Check if there is a house without a family head.
- Set poverty level of the members and the head of the house within a family.
- Count how many null values are existing in columns.
- Remove null value rows of the target variable.
- Predict the accuracy using random forest classifier.
- Check the accuracy using random forest with cross validation.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
train=pd.read_csv(r'A:\MachineLearning\Project2\train.csv')
test=pd.read_csv(r'A:\MachineLearning\Project2\test.csv')

In [3]:
print('Shape of train dataset is {}'.format(train.shape))
print('Shape of test dataset is {}'.format(test.shape))

Shape of train dataset is (9557, 143)
Shape of test dataset is (23856, 142)


In [4]:
train.head()

Unnamed: 0,Id,v2a1,hacdor,rooms,hacapo,v14a,refrig,v18q,v18q1,r4h1,...,SQBescolari,SQBage,SQBhogar_total,SQBedjefe,SQBhogar_nin,SQBovercrowding,SQBdependency,SQBmeaned,agesq,Target
0,ID_279628684,190000.0,0,3,0,1,1,0,,0,...,100,1849,1,100,0,1.0,0.0,100.0,1849,4
1,ID_f29eb3ddd,135000.0,0,4,0,1,1,1,1.0,0,...,144,4489,1,144,0,1.0,64.0,144.0,4489,4
2,ID_68de51c94,,0,8,0,1,1,0,,0,...,121,8464,1,0,0,0.25,64.0,121.0,8464,4
3,ID_d671db89c,180000.0,0,5,0,1,1,1,1.0,0,...,81,289,16,121,4,1.777778,1.0,121.0,289,4
4,ID_d56d6f5f5,180000.0,0,5,0,1,1,1,1.0,0,...,121,1369,16,121,4,1.777778,1.0,121.0,1369,4


In [5]:
test.head()

Unnamed: 0,Id,v2a1,hacdor,rooms,hacapo,v14a,refrig,v18q,v18q1,r4h1,...,age,SQBescolari,SQBage,SQBhogar_total,SQBedjefe,SQBhogar_nin,SQBovercrowding,SQBdependency,SQBmeaned,agesq
0,ID_2f6873615,,0,5,0,1,1,0,,1,...,4,0,16,9,0,1,2.25,0.25,272.25,16
1,ID_1c78846d2,,0,5,0,1,1,0,,1,...,41,256,1681,9,0,1,2.25,0.25,272.25,1681
2,ID_e5442cf6a,,0,5,0,1,1,0,,1,...,41,289,1681,9,0,1,2.25,0.25,272.25,1681
3,ID_a8db26a79,,0,14,0,1,1,1,1.0,0,...,59,256,3481,1,256,0,1.0,0.0,256.0,3481
4,ID_a62966799,175000.0,0,4,0,1,1,1,1.0,0,...,18,121,324,1,0,1,0.25,64.0,,324


### Identify the output variable.

In [6]:
for i in train.columns:
    if i not in test.columns:
        print("Our Target variable is {}".format(i))

Our Target variable is Target


### Understand the type of data.

In [7]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9557 entries, 0 to 9556
Columns: 143 entries, Id to Target
dtypes: float64(8), int64(130), object(5)
memory usage: 10.4+ MB


In [8]:
print(train.dtypes.value_counts())

int64      130
float64      8
object       5
dtype: int64


In [9]:
for i in train.columns:
    a=train[i].dtype
    if a == 'object':
        print(i)

Id
idhogar
dependency
edjefe
edjefa


In [10]:
train['idhogar'].value_counts()

fd8a6d014    13
ae6cf0558    12
0c7436de6    12
4476ccd4c    11
b7a0b59d7    11
6b35cdcf0    11
3fe29a56b    11
0fc6c05f7    10
63f11d6ea    10
a18c0c0be    10
f2a4cd356    10
6a96a96c0    10
7cad2d6c4    10
06ca88023     9
9d70c1551     9
322cefd2f     9
476b3f2ee     9
d4e1dc02c     9
9fd143d1f     9
ae489f548     9
d43a04997     9
efec7e82c     9
1ed926340     9
493f97dcb     8
2f8fab5de     8
cd7c2ef1d     8
8857dd685     8
4f2bd02b9     8
1c0b1cbd8     8
7b7ebaf70     8
             ..
ec9506d71     1
045038655     1
d501218d4     1
72c73d9a6     1
36b3da67a     1
f36033ff7     1
1b31fd159     1
c86397adb     1
e8bbf32c4     1
64a069f16     1
5575b936f     1
3886a7737     1
15e096859     1
87a666cdd     1
f9ddb6edf     1
fed6bc0bd     1
0a3be8b29     1
1637ac45b     1
00e443b00     1
19a9cacc3     1
1f5ef45bf     1
f1265ca75     1
5463ebd0e     1
49e535bed     1
257e2e949     1
87dfa584d     1
9062ed6bc     1
a00d7a1be     1
d1f953457     1
cc3b2206b     1
Name: idhogar, Length: 2

In [11]:
# drop Id variable
train.drop(['Id','idhogar'],axis=1,inplace=True)

In [12]:
train['dependency'].value_counts()

yes          2192
no           1747
.5           1497
2             730
1.5           713
.33333334     598
.66666669     487
8             378
.25           260
3             236
4             100
.75            98
.2             90
1.3333334      84
.40000001      84
2.5            77
5              24
1.25           18
.80000001      18
3.5            18
2.25           13
.71428573      12
1.2            11
.22222222      11
.83333331      11
1.75           11
.2857143        9
1.6666666       8
.60000002       8
6               7
.16666667       7
Name: dependency, dtype: int64

In [13]:
# Convert object variables into numerical data
def map(i):
    if i=='yes':
        return(float(1))
    elif i=='no':
        return(float(0))
    else:
        return(float(i))
train['dependency']=train['dependency'].apply(map)

In [14]:
# from sklearn.preprocessing import LabelEncoder
# Label_encoder = LabelEncoder()
# train['dependency'] = Label_encoder.fit_transform(train['dependency'])
# train['dependency'].value_counts()

In [15]:
train['dependency'].value_counts()

1.000000    2192
0.000000    1747
0.500000    1497
2.000000     730
1.500000     713
0.333333     598
0.666667     487
8.000000     378
0.250000     260
3.000000     236
4.000000     100
0.750000      98
0.200000      90
1.333333      84
0.400000      84
2.500000      77
5.000000      24
0.800000      18
3.500000      18
1.250000      18
2.250000      13
0.714286      12
0.222222      11
1.200000      11
0.833333      11
1.750000      11
0.285714       9
0.600000       8
1.666667       8
6.000000       7
0.166667       7
Name: dependency, dtype: int64

In [16]:
train['edjefe'].value_counts()

no     3762
6      1845
11      751
9       486
3       307
15      285
8       257
7       234
5       222
14      208
17      202
2       194
4       137
16      134
yes     123
12      113
10      111
13      103
21       43
18       19
19       14
20        7
Name: edjefe, dtype: int64

In [17]:
train['edjefe']=train['edjefe'].apply(map)
train['edjefa']=train['edjefa'].apply(map)

In [18]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9557 entries, 0 to 9556
Columns: 141 entries, v2a1 to Target
dtypes: float64(11), int64(130)
memory usage: 10.3 MB


In [19]:
## Variance check of data
var_df=pd.DataFrame(np.var(train,0),columns=['variance'])
var_df.sort_values(by='variance').head(15)
print('Below are columns with variance 0.')
col=list((var_df[var_df['variance']==0]).index)
print(col)

Below are columns with variance 0.
['elimbasu5']


### Check if there are any biases in your dataset.

In [20]:
contingency_tab=pd.crosstab(train['r4t3'],train['hogar_total'])
Observed_Values=contingency_tab.values
Observed_Values

array([[ 378,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0],
       [   7, 1348,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0],
       [   1,   10, 2249,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0],
       [   0,    2,    6, 2439,    0,    0,    0,    0,    0,    0,    0,
           0,    0],
       [   0,    2,    3,    8, 1585,    0,    0,    0,    0,    0,    0,
           0,    0],
       [   0,    0,    0,    0,    5,  819,    0,    0,    0,    0,    0,
           0,    0],
       [   0,    0,    0,    4,    0,    0,  364,    0,    0,    0,    0,
           0,    0],
       [   0,    0,    0,    0,    0,    0,    0,   96,    0,    0,    0,
           0,    0],
       [   0,    0,    0,    0,    0,    0,    0,    0,   90,    0,    0,
           0,    0],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,   60,    0,
           0,    0],
       [   0,    0,    0,    0,    0,    0,    0, 

In [21]:
contingency_tab=pd.crosstab(train['r4t3'],train['hogar_total'])
Observed_Values=contingency_tab.values

import scipy.stats
b=scipy.stats.chi2_contingency(contingency_tab)
Expected_Values = b[3]
no_of_rows=len(contingency_tab.iloc[0:2,0])
no_of_columns=len(contingency_tab.iloc[0,0:2])
df=(no_of_rows-1)*(no_of_columns-1)
print("Degree of Freedom:-",df)

from scipy.stats import chi2
chi_square=sum([(o-e)**2./e for o,e in zip(Observed_Values,Expected_Values)])
chi_square_statistic=chi_square[0]+chi_square[1]
print("chi-square statistic:-",chi_square_statistic)
alpha=0.05
critical_value=chi2.ppf(q=1-alpha,df=df)
print('critical_value:',critical_value)
p_value=1-chi2.cdf(x=chi_square_statistic,df=df)
print('p-value:',p_value)
print('Significance level: ',alpha)
print('Degree of Freedom: ',df)
print('chi-square statistic:',chi_square_statistic)
print('critical_value:',critical_value)
print('p-value:',p_value)
if chi_square_statistic>=critical_value:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")
if p_value<=alpha:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")

Degree of Freedom:- 1
chi-square statistic:- 17022.072400560897
critical_value: 3.841458820694124
p-value: 0.0
Significance level:  0.05
Degree of Freedom:  1
chi-square statistic: 17022.072400560897
critical_value: 3.841458820694124
p-value: 0.0
Reject H0,There is a relationship between 2 categorical variables
Reject H0,There is a relationship between 2 categorical variables



### Check if there is a house without a family head

In [22]:
train.head()

Unnamed: 0,v2a1,hacdor,rooms,hacapo,v14a,refrig,v18q,v18q1,r4h1,r4h2,...,SQBescolari,SQBage,SQBhogar_total,SQBedjefe,SQBhogar_nin,SQBovercrowding,SQBdependency,SQBmeaned,agesq,Target
0,190000.0,0,3,0,1,1,0,,0,1,...,100,1849,1,100,0,1.0,0.0,100.0,1849,4
1,135000.0,0,4,0,1,1,1,1.0,0,1,...,144,4489,1,144,0,1.0,64.0,144.0,4489,4
2,,0,8,0,1,1,0,,0,0,...,121,8464,1,0,0,0.25,64.0,121.0,8464,4
3,180000.0,0,5,0,1,1,1,1.0,0,2,...,81,289,16,121,4,1.777778,1.0,121.0,289,4
4,180000.0,0,5,0,1,1,1,1.0,0,2,...,121,1369,16,121,4,1.777778,1.0,121.0,1369,4


In [23]:
print(train.parentesco1.value_counts())
pd.crosstab(train['edjefa'],train['edjefe'])

0    6584
1    2973
Name: parentesco1, dtype: int64


edjefe,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,...,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,20.0,21.0
edjefa,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0.0,435,123,194,307,137,222,1845,234,257,486,...,113,103,208,285,134,202,19,14,7,43
1.0,69,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2.0,84,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3.0,152,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4.0,136,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5.0,176,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6.0,947,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7.0,179,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8.0,217,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9.0,237,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Above cross tab shows 0 male head and 0 female head which implies that there are 435
families with no family head

### Count how many null values are existing in columns.

In [24]:
train.isna().sum().value_counts()

0       136
5         2
7928      1
6860      1
7342      1
dtype: int64


### Remove null value rows of the target variable.

In [25]:
float_col=[]
for i in train.columns:
    a=train[i].dtype
    if a == 'float64':
        float_col.append(i)
print(float_col)

['v2a1', 'v18q1', 'rez_esc', 'dependency', 'edjefe', 'edjefa', 'meaneduc', 'overcrowding', 'SQBovercrowding', 'SQBdependency', 'SQBmeaned']


In [26]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9557 entries, 0 to 9556
Columns: 141 entries, v2a1 to Target
dtypes: float64(11), int64(130)
memory usage: 10.3 MB


In [27]:
train[float_col].isna().sum()

v2a1               6860
v18q1              7342
rez_esc            7928
dependency            0
edjefe                0
edjefa                0
meaneduc              5
overcrowding          0
SQBovercrowding       0
SQBdependency         0
SQBmeaned             5
dtype: int64

In [28]:
train['v18q1'].value_counts()

1.0    1586
2.0     444
3.0     129
4.0      37
5.0      13
6.0       6
Name: v18q1, dtype: int64

In [29]:
# 'v2a1', 'v18q1', 'rez_esc' have more than 50% null values, because for v18q1, there are families with their own house so they won't pay rent in that case it should be 0 and similar is for v18q1 there can be families with 0 tablets.
train['v2a1'].fillna(0,inplace=True)
train['v18q1'].fillna(0,inplace=True)

In [30]:
train.drop(['tipovivi3', 'v18q','rez_esc','elimbasu5'],axis=1,inplace=True)

In [31]:
train['meaneduc'].fillna(np.mean(train['meaneduc']),inplace=True)
train['SQBmeaned'].fillna(np.mean(train['SQBmeaned']),inplace=True)
print(train.isna().sum().value_counts())

0    137
dtype: int64


In [36]:
int_col=[]
for i in train.columns:
    a=train[i].dtype
    if a == 'int64':
        int_col.append(i)
print(int_col)

['hacdor', 'rooms', 'hacapo', 'v14a', 'refrig', 'r4h1', 'r4h2', 'r4h3', 'r4m1', 'r4m2', 'r4m3', 'r4t1', 'r4t2', 'r4t3', 'tamhog', 'tamviv', 'escolari', 'hhsize', 'paredblolad', 'paredzocalo', 'paredpreb', 'pareddes', 'paredmad', 'paredzinc', 'paredfibras', 'paredother', 'pisomoscer', 'pisocemento', 'pisoother', 'pisonatur', 'pisonotiene', 'pisomadera', 'techozinc', 'techoentrepiso', 'techocane', 'techootro', 'cielorazo', 'abastaguadentro', 'abastaguafuera', 'abastaguano', 'public', 'planpri', 'noelec', 'coopele', 'sanitario1', 'sanitario2', 'sanitario3', 'sanitario5', 'sanitario6', 'energcocinar1', 'energcocinar2', 'energcocinar3', 'energcocinar4', 'elimbasu1', 'elimbasu2', 'elimbasu3', 'elimbasu4', 'elimbasu6', 'epared1', 'epared2', 'epared3', 'etecho1', 'etecho2', 'etecho3', 'eviv1', 'eviv2', 'eviv3', 'dis', 'male', 'female', 'estadocivil1', 'estadocivil2', 'estadocivil3', 'estadocivil4', 'estadocivil5', 'estadocivil6', 'estadocivil7', 'parentesco1', 'parentesco2', 'parentesco3', 'pa

In [37]:
train[int_col].isna().sum().value_counts()

0    127
dtype: int64

### Set poverty level of the members and the head of the house within a family

In [38]:
train.head()

Unnamed: 0,v2a1,hacdor,rooms,hacapo,v14a,refrig,v18q1,r4h1,r4h2,r4h3,...,SQBescolari,SQBage,SQBhogar_total,SQBedjefe,SQBhogar_nin,SQBovercrowding,SQBdependency,SQBmeaned,agesq,Target
0,190000.0,0,3,0,1,1,0.0,0,1,1,...,100,1849,1,100,0,1.0,0.0,100.0,1849,4
1,135000.0,0,4,0,1,1,1.0,0,1,1,...,144,4489,1,144,0,1.0,64.0,144.0,4489,4
2,0.0,0,8,0,1,1,0.0,0,0,0,...,121,8464,1,0,0,0.25,64.0,121.0,8464,4
3,180000.0,0,5,0,1,1,1.0,0,2,2,...,81,289,16,121,4,1.777778,1.0,121.0,289,4
4,180000.0,0,5,0,1,1,1.0,0,2,2,...,121,1369,16,121,4,1.777778,1.0,121.0,1369,4


In [39]:
Poverty_level=train[train['v2a1'] !=0]
Poverty_level.shape

(2668, 137)

In [40]:
poverty_level=Poverty_level.groupby('area1')['v2a1'].apply(np.median)
poverty_level

area1
0     80000.0
1    140000.0
Name: v2a1, dtype: float64

For rural area level if people paying rent less than 8000 is under poverty level. For Urban area
level if people paying rent less than 140000 is under poverty level.

In [42]:
def povert(x):
    if x<8000:
        return('Below poverty level')
    elif x>140000:
        return('Above poverty level')
    elif x<140000:
        return('Below poverty level: Ur-ban ; Above poverty level : Rural ')
c=Poverty_level['v2a1'].apply(povert)
pd.crosstab(c,Poverty_level['area1'])

area1,0,1
v2a1,Unnamed: 1_level_1,Unnamed: 2_level_1
Above poverty level,139,1103
Below poverty level: Ur-ban ; Above poverty level : Rural,306,1081


## Predict the accuracy using random forest classifier.

In [43]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

In [44]:
X_data=train.drop('Target',axis=1)
Y_data=train.Target
X_data_col=X_data.columns

In [45]:
X_train,X_test,Y_train,Y_test=train_test_split(X_data,Y_data,test_size=0.25,stratify=Y_data,random_state=0)
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

In [48]:
rfc=RandomForestClassifier(random_state=0)
parameters={'n_estimators':[10,50,100,300],'max_depth':[3,5,10,15]}
grid=zip([rfc],[parameters])
best_=None
for i, j in grid:
    a=GridSearchCV(i,param_grid=j,cv=3,n_jobs=1) 
    a.fit(X_train,Y_train)
    if best_ is None:
        best_=a
    elif a.best_score_>best_.best_score_:
        best_=a

In [49]:
print ("Best CV Score",best_.best_score_)
print ("Model Parameters",best_.best_params_)
print("Best Estimator",best_.best_estimator_)

Best CV Score 0.851960373936096
Model Parameters {'max_depth': 15, 'n_estimators': 300}
Best Estimator RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=15, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=300, n_jobs=None,
            oob_score=False, random_state=0, verbose=0, warm_start=False)


In [50]:
RFC=best_.best_estimator_
Model=RFC.fit(X_train,Y_train)
pred=Model.predict(X_test)
print('Model Score of train data : {}'.format(Model.score(X_train,Y_train)))
print('Model Score of test data : {}'.format(Model.score(X_test,Y_test)))

Model Score of train data : 0.9838147062927306
Model Score of test data : 0.8845188284518829


In [51]:
Important_features=pd.DataFrame(Model.feature_importances_,X_data_col,columns=['feature_importance'])
Top50Features=Important_features.sort_values(by='feature_importance',ascending=False).head(50).index
Top50Features

Index(['SQBmeaned', 'meaneduc', 'dependency', 'SQBdependency', 'overcrowding',
       'qmobilephone', 'SQBovercrowding', 'SQBhogar_nin', 'hogar_nin',
       'SQBedjefe', 'edjefe', 'rooms', 'cielorazo', 'escolari', 'r4t1', 'r4h2',
       'v2a1', 'eviv3', 'edjefa', 'age', 'SQBage', 'agesq', 'r4m3', 'r4t2',
       'SQBescolari', 'r4h3', 'hogar_adul', 'r4m1', 'r4m2', 'bedrooms',
       'pisomoscer', 'paredblolad', 'tamviv', 'epared3', 'SQBhogar_total',
       'v18q1', 'tamhog', 'hhsize', 'hogar_total', 'r4t3', 'etecho3', 'r4h1',
       'lugar1', 'energcocinar2', 'eviv2', 'energcocinar3', 'tipovivi1',
       'epared2', 'etecho1', 'television'],
      dtype='object')

In [52]:
X_data_Top50=X_data[Top50Features]
X_train,X_test,Y_train,Y_test=train_test_split(X_data_Top50,Y_data,test_size=0.25,stratify=Y_data,random_state=0)
Model_1=RFC.fit(X_train,Y_train)
pred=Model_1.predict(X_test)
from sklearn.metrics import confusion_matrix,f1_score,accuracy_score
confusion_matrix(Y_test,pred)

array([[ 141,   16,    1,   31],
       [  12,  315,    5,   67],
       [   1,   13,  211,   77],
       [   1,   11,    1, 1487]], dtype=int64)

In [53]:
f1_score(Y_test,pred,average='weighted')

0.8971625790961933

In [54]:
accuracy_score(Y_test,pred)

0.901255230125523