![logo_ironhack_blue 7](https://user-images.githubusercontent.com/23629340/40541063-a07a0a8a-601a-11e8-91b5-2f13e4e6b441.png)

# Lab | Handling Data Imbalance in Classification Models

For this lab and in the next lessons we will use the dataset 'Healthcare For All' building a model to predict who will donate (TargetB) and how much they will give (TargetD) (will be used for lab on Friday). You will be using `files_for_lab/learningSet.csv` file which you have already downloaded from class.

### Scenario

You are revisiting the Healthcare for All Case Study. You are provided with this historical data about Donors and how much they donated. Your task is to build a machine learning model that will help the company identify people who are more likely to donate and then try to predict the donation amount.

### Instructions

In this lab, we will first take a look at the degree of imbalance in the data and correct it using the techniques we learned in the class.

Here is the list of steps to be followed (building a simple model without balancing the data):

- Import the required libraries and modules that you would need.
- Read that data into Python and call the dataframe `donors`.
- Check the datatypes of all the columns in the data. 
- Check for null values in the dataframe. Replace the null values using the methods learned in class.
- Split the data into numerical and catagorical.  Decide if any columns need their dtype changed.
- Concatenate numerical and categorical back together again for your X dataframe.  Designate the Target as y.
  
  - Split the data into a training set and a test set.
  - Split further into train_num and train_cat.  Also test_num and test_cat.
  - Scale the features either by using normalizer or a standard scaler. (train_num, test_num)
  - Encode the categorical features using One-Hot Encoding or Ordinal Encoding.  (train_cat, test_cat)
      - **fit** only on train data transform both train and test
      - again re-concatenate train_num and train_cat as X_train as well as test_num and test_cat as X_test
  - Fit a logistic regression model on the training data.
  - Check the accuracy on the test data.

**Note**: So far we have not balanced the data.

Managing imbalance in the dataset

- Check for the imbalance.
- Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.
- Each time fit the model and see how the accuracy of the model has changed.



In [20]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings("ignore")

In [26]:
donors = pd.read_csv('learningSet.csv')
donors['AGE']

0        60.0
1        46.0
2         NaN
3        70.0
4        78.0
         ... 
95407     NaN
95408    48.0
95409    60.0
95410    58.0
95411    80.0
Name: AGE, Length: 95412, dtype: float64

In [22]:
donors.dtypes

ODATEDW       int64
OSOURCE      object
TCODE         int64
STATE        object
ZIP          object
             ...   
MDMAUD_R     object
MDMAUD_F     object
MDMAUD_A     object
CLUSTER2    float64
GEOCODE2     object
Length: 481, dtype: object

In [23]:
null_values = donors.isna().sum()
null_values = null_values[null_values > 0]
null_values

AGE         23665
NUMCHLD     83026
INCOME      21286
WEALTH1     44732
MBCRAFT     52854
            ...  
RAMNT_24    77674
NEXTDATE     9973
TIMELAG      9973
CLUSTER2      132
GEOCODE2      132
Length: 92, dtype: int64

In [49]:
numerical_cols = numerical_cols.fillna(numerical_cols.mean())
categorical_cols = donors.select_dtypes(include=['object'])
numerical_cols.fillna(method='ffill', inplace=True)
categorical_cols.fillna(method='ffill', inplace=True)

In [50]:
X = pd.concat([numerical_cols, categorical_cols], axis=1)
y = donors['TARGET_B']

In [51]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [65]:
train_num = X_train.select_dtypes(include=['float64', 'int64']).astype(str)
train_cat = X_train.select_dtypes(include=['object']).astype(str)
test_num = X_test.select_dtypes(include=['float64', 'int64']).astype(str)
test_cat = X_test.select_dtypes(include=['object']).astype(str)

In [59]:
scaler = StandardScaler()
train_num_scaled = scaler.fit_transform(train_num)
test_num_scaled = scaler.transform(test_num)

In [67]:
encoder = OneHotEncoder(handle_unknown='ignore')
train_cat_encoded = encoder.fit_transform(train_cat)
test_cat_encoded = encoder.transform(test_cat)

In [70]:
X_train_encoded = pd.concat([pd.DataFrame(train_num_scaled), pd.DataFrame(train_cat_encoded.toarray())], axis=1)
X_test_encoded = pd.concat([pd.DataFrame(test_num_scaled), pd.DataFrame(test_cat_encoded.toarray())], axis=1)

In [71]:
model = LogisticRegression()
model.fit(X_train_encoded, y_train)

LogisticRegression()

In [72]:
y_pred = model.predict(X_test_encoded)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.9999475973379448


In [None]:
oversampler = RandomOverSampler()
X_train_resampled, y_train_resampled = oversampler.fit_resample(X_train_encoded, y_train)

In [None]:
model_resampled = LogisticRegression()
model_resampled.fit(X_train_resampled, y_train_resampled)


In [None]:
y_pred_resampled = model_resampled.predict(X_test_encoded)
accuracy_resampled = accuracy_score(y_test, y_pred_resampled)
print("Accuracy (Upsampled):", accuracy_resampled)

In [None]:
undersampler = RandomUnderSampler()
X_train_resampled, y_train_resampled = undersampler.fit_resample(X_train_encoded, y_train)

In [None]:
model_resampled = LogisticRegression()
model_resampled.fit(X_train_resampled, y_train_resampled)