![logo_ironhack_blue 7](https://user-images.githubusercontent.com/23629340/40541063-a07a0a8a-601a-11e8-91b5-2f13e4e6b441.png)

# Lab | Handling Data Imbalance in Classification Models

For this lab and in the next lessons we will use the dataset 'Healthcare For All' building a model to predict who will donate (TargetB) and how much they will give (TargetD) (will be used for lab on Friday). You will be using `files_for_lab/learningSet.csv` file which you have already downloaded from class.

### Scenario

You are revisiting the Healthcare for All Case Study. You are provided with this historical data about Donors and how much they donated. Your task is to build a machine learning model that will help the company identify people who are more likely to donate and then try to predict the donation amount.

### Instructions

In this lab, we will first take a look at the degree of imbalance in the data and correct it using the techniques we learned in the class.

Here is the list of steps to be followed (building a simple model without balancing the data):

- Import the required libraries and modules that you would need.
- Read that data into Python and call the dataframe `donors`.
- Check the datatypes of all the columns in the data. 
- Check for null values in the dataframe. Replace the null values using the methods learned in class.
- Split the data into numerical and catagorical.  Decide if any columns need their dtype changed.
- Concatenate numerical and categorical back together again for your X dataframe.  Designate the Target as y.
  
  - Split the data into a training set and a test set.
  - Split further into train_num and train_cat.  Also test_num and test_cat.
  - Scale the features either by using normalizer or a standard scaler. (train_num, test_num)
  - Encode the categorical features using One-Hot Encoding or Ordinal Encoding.  (train_cat, test_cat)
      - **fit** only on train data transform both train and test
      - again re-concatenate train_num and train_cat as X_train as well as test_num and test_cat as X_test
  - Fit a logistic regression model on the training data.
  - Check the accuracy on the test data.

**Note**: So far we have not balanced the data.

In [79]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings("ignore")
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

In [44]:
numerical = pd.read_csv('numerical_df.csv',index_col=0)
categorical = pd.read_csv('categorical_df.csv')
y = pd.read_csv('Y.csv')

In [None]:
# Check for null values in the dataframe. Replace the null values using the methods learned in class.

In [None]:
# numerical

In [45]:
numerical.isnull().sum()

ODATEDW     0
TCODE       0
DOB         0
AGE         0
INCOME      0
           ..
AVGGIFT     0
CONTROLN    0
HPHONE_D    0
RFA_2F      0
CLUSTER2    0
Length: 322, dtype: int64

In [46]:
df= pd.DataFrame(numerical.isna().sum()/len(numerical)).reset_index()
df.columns = ['column_name', 'nulls']
df[df['nulls']>0].sort_values(by='nulls',ascending=False)

Unnamed: 0,column_name,nulls
5,WEALTH1,0.46883
315,NEXTDATE,0.104526
135,MSA,0.001383
136,ADI,0.001383
137,DMA,0.001383


In [47]:
numerical['WEALTH1'].value_counts()

9.0    7585
8.0    6793
7.0    6198
6.0    5825
5.0    5280
4.0    4810
3.0    4237
2.0    4085
1.0    3454
0.0    2413
Name: WEALTH1, dtype: int64

In [48]:
numerical['WEALTH1'].fillna('9.0', inplace=True)

In [49]:
numerical['NEXTDATE'] = numerical['NEXTDATE'].fillna(round(np.mean(numerical['NEXTDATE'])))

In [50]:
numerical['MSA'].value_counts()

0.0       21333
4480.0     4606
1600.0     4059
2160.0     2586
520.0      1685
          ...  
9140.0        1
3200.0        1
9280.0        1
743.0         1
8480.0        1
Name: MSA, Length: 298, dtype: int64

In [51]:
numerical['MSA'].fillna('0.0', inplace=True)

In [52]:
numerical['ADI'].value_counts()

13.0     7296
51.0     4622
65.0     3765
57.0     2836
105.0    2617
         ... 
651.0       1
103.0       1
601.0       1
161.0       1
147.0       1
Name: ADI, Length: 204, dtype: int64

In [53]:
numerical['ADI'].fillna('13.0', inplace=True)

In [54]:
numerical['DMA'].value_counts()

803.0    7296
602.0    4632
807.0    3765
505.0    2839
819.0    2588
         ... 
569.0       1
554.0       1
584.0       1
552.0       1
516.0       1
Name: DMA, Length: 206, dtype: int64

In [55]:
numerical['DMA'].fillna('803.0', inplace=True)

In [56]:
df= pd.DataFrame(numerical.isna().sum()/len(numerical)).reset_index()
df.columns = ['column_name', 'nulls']
df[df['nulls']>0].sort_values(by='nulls',ascending=False)

Unnamed: 0,column_name,nulls


In [None]:
# categorical

In [57]:
df= pd.DataFrame(categorical.isna().sum()/len(categorical)).reset_index()
df.columns = ['column_name', 'nulls']
df[df['nulls']>0].sort_values(by='nulls',ascending=False)

Unnamed: 0,column_name,nulls
8,SOLIH,0.935019
9,VETERANS,0.890727
6,GENDER,0.030992
1,OSOURCE,0.009726


In [58]:
categorical.drop(['SOLIH'], axis=1,inplace=True)
categorical.drop(['VETERANS'], axis=1,inplace=True)

In [59]:
categorical['GENDER'].value_counts()

F    51277
M    39094
U     1715
J      365
C        2
A        2
Name: GENDER, dtype: int64

In [60]:
categorical['GENDER'].fillna('U', inplace=True)

In [61]:
categorical['GENDER'] = categorical['GENDER'].replace('J', 'U').replace('C', 'U').replace('A', 'U')

In [62]:
categorical['OSOURCE'].value_counts()

MBC    4539
SYN    3563
AML    3430
BHG    3324
IMP    2986
       ... 
MDD       1
NRM       1
HDP       1
CRP       1
VIC       1
Name: OSOURCE, Length: 895, dtype: int64

In [64]:
categorical.drop(['OSOURCE'], axis=1,inplace=True)

In [65]:
df= pd.DataFrame(categorical.isna().sum()/len(categorical)).reset_index()
df.columns = ['column_name', 'nulls']
df[df['nulls']>0].sort_values(by='nulls',ascending=False)

Unnamed: 0,column_name,nulls


In [68]:
categorical.drop(['Unnamed: 0'], axis=1,inplace=True)

In [None]:
#Decide if any columns need their dtype changed.

In [69]:
categorical.dtypes

STATE       object
ZIP         object
CLUSTER      int64
HOMEOWNR    object
GENDER      object
DATASRCE     int64
RFA_2       object
RFA_2R      object
GEOCODE2    object
DOMAIN_A    object
DOMAIN_B     int64
dtype: object

In [73]:
categorical['CLUSTER'] = categorical['CLUSTER'].astype(object)
categorical['DATASRCE'] = categorical['DATASRCE'].astype(object)
categorical['DOMAIN_B'] = categorical['DOMAIN_B'].astype(object)

In [74]:
numerical.dtypes

ODATEDW       int64
TCODE         int64
DOB           int64
AGE         float64
INCOME      float64
             ...   
AVGGIFT     float64
CONTROLN      int64
HPHONE_D      int64
RFA_2F        int64
CLUSTER2    float64
Length: 322, dtype: object

In [75]:
categorical

Unnamed: 0,STATE,ZIP,CLUSTER,HOMEOWNR,GENDER,DATASRCE,RFA_2,RFA_2R,GEOCODE2,DOMAIN_A,DOMAIN_B
0,IL,61081,36,U,F,3,L4E,L,C,T,2
1,CA,91326,14,H,M,3,L2G,L,A,S,1
2,NC,27017,43,U,M,3,L4E,L,C,R,2
3,CA,95953,44,U,F,3,L4E,L,C,R,2
4,FL,33176,16,H,F,3,L2F,L,A,S,2
...,...,...,...,...,...,...,...,...,...,...,...
95407,OTHER,99504,27,U,M,3,L1G,L,C,C,2
95408,TX,77379,24,H,M,3,L1F,L,A,C,1
95409,MI,48910,30,U,M,3,L3E,L,B,C,3
95410,CA,91320,24,H,F,2,L4F,L,A,C,1


In [76]:
numerical

Unnamed: 0,ODATEDW,TCODE,DOB,AGE,INCOME,WEALTH1,HIT,MALEMILI,MALEVET,VIETVETS,...,LASTGIFT,LASTDATE,FISTDATE,NEXTDATE,TIMELAG,AVGGIFT,CONTROLN,HPHONE_D,RFA_2F,CLUSTER2
0,8901,0,3712,60.000000,0.0,9.0,0,0,39,34,...,10.0,9512,8911,9003.0,4.0,7.741935,95515,0,4,39.0
1,9401,1,5202,46.000000,6.0,9.0,16,0,15,55,...,25.0,9512,9310,9504.0,18.0,15.666667,148535,0,2,1.0
2,9001,1,0,61.611649,3.0,1.0,2,0,20,29,...,5.0,9512,9001,9101.0,12.0,7.481481,15078,1,4,60.0
3,8701,0,2801,70.000000,1.0,4.0,2,0,23,14,...,10.0,9512,8702,8711.0,9.0,6.812500,172556,1,4,41.0
4,8601,0,2001,78.000000,3.0,2.0,60,1,28,9,...,15.0,9601,7903,8005.0,14.0,6.864865,7112,1,2,26.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95407,9601,1,0,61.611649,0.0,9.0,0,14,36,47,...,25.0,9602,9602,9151.0,8.0,25.000000,184568,0,1,12.0
95408,9601,1,5001,48.000000,7.0,9.0,1,0,31,43,...,20.0,9603,9603,9151.0,8.0,20.000000,122706,1,1,2.0
95409,9501,1,3801,60.000000,0.0,9.0,0,0,18,46,...,10.0,9610,9410,9501.0,3.0,8.285714,189641,1,3,34.0
95410,8601,0,4005,58.000000,7.0,9.0,0,0,28,35,...,18.0,9701,8612,8704.0,4.0,12.146341,4693,1,4,11.0


In [80]:
#Scale the features either by using normalizer or a standard scaler.

scaler = MinMaxScaler()

for col in numerical.columns:
    numerical[col] = scaler.fit_transform(numerical[[col]])

numerical.head()

Unnamed: 0,ODATEDW,TCODE,DOB,AGE,INCOME,WEALTH1,HIT,MALEMILI,MALEVET,VIETVETS,...,LASTGIFT,LASTDATE,FISTDATE,NEXTDATE,TIMELAG,AVGGIFT,CONTROLN,HPHONE_D,RFA_2F,CLUSTER2
0,0.426523,0.0,0.382286,0.608247,0.0,1.0,0.0,0.0,0.393939,0.343434,...,0.01,0.045226,0.927939,0.71939,0.003676,0.006465,0.498045,0.0,1.0,0.622951
1,0.784946,1.4e-05,0.535736,0.463918,0.857143,1.0,0.06639,0.0,0.151515,0.555556,...,0.025,0.045226,0.969489,0.920514,0.016544,0.014399,0.77451,0.0,0.333333,0.0
2,0.498208,1.4e-05,0.0,0.624862,0.428571,0.111111,0.008299,0.0,0.20202,0.292929,...,0.005,0.045226,0.937311,0.758731,0.011029,0.006204,0.078617,1.0,1.0,0.967213
3,0.283154,0.0,0.288465,0.71134,0.142857,0.444444,0.008299,0.0,0.232323,0.141414,...,0.01,0.045226,0.906175,0.602168,0.008272,0.005534,0.899764,1.0,1.0,0.655738
4,0.21147,0.0,0.206076,0.793814,0.428571,0.222222,0.248963,0.010101,0.282828,0.090909,...,0.015,0.492462,0.822972,0.318747,0.012868,0.005586,0.037079,1.0,0.333333,0.409836


In [None]:
#Encode the categorical features using One-Hot Encoding or Ordinal Encoding. (train_cat, test_cat)

In [81]:
categorical.nunique()

STATE          12
ZIP         19938
CLUSTER        53
HOMEOWNR        2
GENDER          3
DATASRCE        3
RFA_2          14
RFA_2R          1
GEOCODE2        4
DOMAIN_A        5
DOMAIN_B        4
dtype: int64

In [82]:
categorical.drop(['ZIP'], axis=1,inplace=True)

In [83]:
one_hot_names = []
for col in categorical.columns:
    col_uniques = sorted(categorical[col].astype(str).unique())
    for unique in col_uniques:
        one_hot_names.append(col+"_"+unique)
        
categorical = pd.DataFrame(OneHotEncoder().fit_transform(categorical.astype(str)).toarray())
categorical.columns = one_hot_names
categorical.head()

Unnamed: 0,STATE_CA,STATE_FL,STATE_GA,STATE_IL,STATE_IN,STATE_MI,STATE_MO,STATE_NC,STATE_OTHER,STATE_TX,...,GEOCODE2_D,DOMAIN_A_C,DOMAIN_A_R,DOMAIN_A_S,DOMAIN_A_T,DOMAIN_A_U,DOMAIN_B_1,DOMAIN_B_2,DOMAIN_B_3,DOMAIN_B_4
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0


In [None]:
#Concatenate numerical and categorical back together again for your X dataframe. Designate the Target as y.

In [84]:
X = pd.concat([numerical, categorical], axis=1)

In [85]:
X

Unnamed: 0,ODATEDW,TCODE,DOB,AGE,INCOME,WEALTH1,HIT,MALEMILI,MALEVET,VIETVETS,...,GEOCODE2_D,DOMAIN_A_C,DOMAIN_A_R,DOMAIN_A_S,DOMAIN_A_T,DOMAIN_A_U,DOMAIN_B_1,DOMAIN_B_2,DOMAIN_B_3,DOMAIN_B_4
0,0.426523,0.000000,0.382286,0.608247,0.000000,1.000000,0.000000,0.000000,0.393939,0.343434,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
1,0.784946,0.000014,0.535736,0.463918,0.857143,1.000000,0.066390,0.000000,0.151515,0.555556,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.498208,0.000014,0.000000,0.624862,0.428571,0.111111,0.008299,0.000000,0.202020,0.292929,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.283154,0.000000,0.288465,0.711340,0.142857,0.444444,0.008299,0.000000,0.232323,0.141414,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.211470,0.000000,0.206076,0.793814,0.428571,0.222222,0.248963,0.010101,0.282828,0.090909,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95407,0.928315,0.000014,0.000000,0.624862,0.000000,1.000000,0.000000,0.141414,0.363636,0.474747,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
95408,0.928315,0.000014,0.515036,0.484536,1.000000,1.000000,0.004149,0.000000,0.313131,0.434343,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
95409,0.856631,0.000014,0.391452,0.608247,0.000000,1.000000,0.000000,0.000000,0.181818,0.464646,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
95410,0.211470,0.000000,0.412461,0.587629,1.000000,1.000000,0.000000,0.000000,0.282828,0.353535,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [87]:
y.drop(['Unnamed: 0'], axis=1,inplace=True)

In [98]:
y.drop(['TARGET_D'], axis=1, inplace=True)

In [99]:
y

Unnamed: 0,TARGET_B
0,0
1,0
2,0
3,0
4,0
...,...
95407,0
95408,0
95409,0
95410,1


In [100]:
columns_with_nan = X.columns[X.isnull().any()]
print(columns_with_nan)

Index([], dtype='object')


In [101]:
#Split the data into a training set and a test set.
#Split further into train_num and train_cat. Also test_num and test_cat.

In [102]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.head()

Unnamed: 0,ODATEDW,TCODE,DOB,AGE,INCOME,WEALTH1,HIT,MALEMILI,MALEVET,VIETVETS,...,GEOCODE2_D,DOMAIN_A_C,DOMAIN_A_R,DOMAIN_A_S,DOMAIN_A_T,DOMAIN_A_U,DOMAIN_B_1,DOMAIN_B_2,DOMAIN_B_3,DOMAIN_B_4
85225,0.928315,0.000389,0.0,0.624862,0.0,1.0,0.0,0.0,0.292929,0.363636,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
70004,0.498208,0.0,0.350257,0.649485,0.571429,1.0,0.0,0.0,0.232323,0.232323,...,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
88133,0.354839,1.4e-05,0.360556,0.639175,1.0,0.666667,0.008299,0.0,0.252525,0.272727,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
79106,0.354839,2.8e-05,0.267868,0.731959,0.285714,1.0,0.0,0.0,0.343434,0.191919,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
35476,0.426523,0.0,0.0,0.624862,0.142857,1.0,0.0,0.0,0.373737,0.464646,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [103]:
#Fit a logistic regression model on the training data.
#Check the accuracy on the test data.

In [104]:
print(y_train.shape)

(76329, 1)


In [105]:
print(X_train.shape)

(76329, 423)


In [106]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [107]:
accuracy = model.score(X_test, y_test)
accuracy

0.9487501965099827

In [108]:
y.value_counts()

TARGET_B
0           90569
1            4843
dtype: int64

In [None]:
#Managing imbalance in the dataset
#Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.
#Each time fit the model and see how the accuracy of the model has changed.

In [110]:
# OVERSAMPLING
smote = SMOTE()

x_resampled,y_resampled=smote.fit_resample(X,y)
y_resampled.value_counts()

TARGET_B
0           90569
1           90569
dtype: int64

In [111]:
X_train, X_test, y_train, y_test = train_test_split(x_resampled, y_resampled, test_size=0.2, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracy

0.6087556586066026

In [113]:
# UNDERSAMPLING
RUS=RandomUnderSampler(random_state=0)
x_resampled,y_resampled=RUS.fit_resample(X,y)

y_resampled.value_counts()

TARGET_B
0           4843
1           4843
dtype: int64

In [114]:
X_train, X_test, y_train, y_test = train_test_split(x_resampled, y_resampled, test_size=0.2, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracy

0.5789473684210527

In [115]:
# TOMEKLINKS
from imblearn.under_sampling import TomekLinks

tl = TomekLinks(sampling_strategy='majority')
X_tl, y_tl=tl.fit_resample(X,y)
y_tl.value_counts()

TARGET_B
0           88837
1            4843
dtype: int64

In [116]:
X_train, X_test, y_train, y_test = train_test_split(X_tl, y_tl, test_size=0.2, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracy

0.9490819812126388