## What is data imbalance?
Imbalanced data or classes is a common problem in machine learning classification where there are a disproportionate ratio of observations in each class.

## Why data imbalance?
Data imbalance occurs when you collect disproportionate ratio of data, thus one class of data is much more than the other.
It is important to balance data because of biasness in predicting the output, since the machine will learn more from one class making it have very high probability of classifying the output as the class with higher proportion.

## What data are we balancing?
Data that contains unequal proportion of the classes of data to be used in training the model.

## Methods used to balance data

1.**Change the performance metric**
Accuracy is not the best metric to use when evaluating imbalanced datasets as it can be very misleading. Metrics that can provide better insight include:
  - Confusion matrix : a table showing correct prediction and types of incorrect prediction.
  - Precision : the number of the positive divided by all positive prediction.
  - Recall : the number of true positived divided by the number of positive value in the test data.
  - F1 score : the weighted average of precision and recall.

2.**Resampling Techniques** – Oversampling can be defined as adding more copies of  minority class
Oversampling can be defined as adding more copies of the minority class. Oversampling can be a good choice when you don’t have a ton of data to work with. Always split into test and train sets BEFORE trying oversampling techniques! Oversampling before splitting the data can allow the exact same observations to be present in both the test and train sets. This can allow our model to simply memorize specific data points and cause overfitting and poor generalization to the test data.

3.**Change the algorithm**
While in every machine learning problem, it’s a good rule of thumb to try a variety of algorithms, it can be especially beneficial with imbalanced datasets. Decision trees frequently perform well on imbalanced data. They work by learning a hierarchy of if/else questions and this can force both classes to be addressed.

4.**Resampling techniques** — Undersample majority class
Undersampling can be defined as removing some observations of the majority class. Undersampling can be a good choice when you have a ton of data -think millions of rows. But a drawback is that we are removing information that may be valuable. This could lead to underfitting and poor generalization to the test set.

5.**Generate synthetic samples**
A technique similar to upsampling is to create synthetic samples. Here we will use imblearn’s SMOTE or Synthetic Minority Oversampling Technique. SMOTE uses a nearest neighbors algorithm to generate new and synthetic data we can use for training our model. Again, it’s important to generate the new samples only in the training set to ensure our model generalizes well to unseen data.







In [34]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from pandas import read_csv,set_option
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

In [35]:
data = pd.read_csv('ML-week1-challenge-master/data/train_data_week_1_challenge.csv')
data.head()

Unnamed: 0,continue_drop,student_id,gender,caste,mathematics_marks,english_marks,science_marks,science_teacher,languages_teacher,guardian,internet,school_id,total_students,total_toilets,establishment_year
0,continue,s01746,M,BC,0.666,0.468,0.666,7,6,other,True,305,354,86.0,1986.0
1,continue,s16986,M,BC,0.172,0.42,0.172,8,10,mother,False,331,516,15.0,1996.0
2,continue,s00147,F,BC,0.212,0.601,0.212,1,4,mother,False,311,209,14.0,1976.0
3,continue,s08104,F,ST,0.434,0.611,0.434,2,5,father,True,364,147,28.0,1911.0
4,continue,s11132,F,SC,0.283,0.478,0.283,1,10,mother,True,394,122,15.0,1889.0


In [36]:
# encoding categorical values
cleanup_nums = {"continue_drop":{"continue":1,"drop":0},
                "gender":{"F":0,"M":1},
                "caste":{"BC":0,"SC":1,"OC":2,"ST":3},
                "guardian":{"mother":0,"father":1,"other":2,"mixed":3}
               }

data.replace(cleanup_nums, inplace=True)

data.internet = data.internet.astype(int)
data.head()

Unnamed: 0,continue_drop,student_id,gender,caste,mathematics_marks,english_marks,science_marks,science_teacher,languages_teacher,guardian,internet,school_id,total_students,total_toilets,establishment_year
0,1,s01746,1,0,0.666,0.468,0.666,7,6,2,1,305,354,86.0,1986.0
1,1,s16986,1,0,0.172,0.42,0.172,8,10,0,0,331,516,15.0,1996.0
2,1,s00147,0,0,0.212,0.601,0.212,1,4,0,0,311,209,14.0,1976.0
3,1,s08104,0,3,0.434,0.611,0.434,2,5,1,1,364,147,28.0,1911.0
4,1,s11132,0,1,0.283,0.478,0.283,1,10,0,1,394,122,15.0,1889.0


In [37]:
# dropping unused columns
data.drop('student_id', axis=1, inplace=True)
data.drop('school_id', axis=1, inplace=True)
data.drop('establishment_year', axis=1, inplace=True)
data.head()

Unnamed: 0,continue_drop,gender,caste,mathematics_marks,english_marks,science_marks,science_teacher,languages_teacher,guardian,internet,total_students,total_toilets
0,1,1,0,0.666,0.468,0.666,7,6,2,1,354,86.0
1,1,1,0,0.172,0.42,0.172,8,10,0,0,516,15.0
2,1,0,0,0.212,0.601,0.212,1,4,0,0,209,14.0
3,1,0,3,0.434,0.611,0.434,2,5,1,1,147,28.0
4,1,0,1,0.283,0.478,0.283,1,10,0,1,122,15.0


In [38]:
data.groupby('continue_drop').size()

continue_drop
0      806
1    16384
dtype: int64

In [39]:
# checking for null values in the dataset
data.isnull().sum()

continue_drop          0
gender                 0
caste                  0
mathematics_marks      0
english_marks          0
science_marks          0
science_teacher        0
languages_teacher      0
guardian               0
internet               0
total_students         0
total_toilets        312
dtype: int64

In [40]:
# filling null values in the dataset
data["total_toilets"].fillna(data["total_toilets"].mean(), inplace = True)
data.isnull().sum()


continue_drop        0
gender               0
caste                0
mathematics_marks    0
english_marks        0
science_marks        0
science_teacher      0
languages_teacher    0
guardian             0
internet             0
total_students       0
total_toilets        0
dtype: int64

In [41]:
feature = ['gender', 'caste', 'mathematics_marks', 'english_marks', 'science_marks', 'science_teacher', 'languages_teacher', 'guardian', 'internet', 'total_students',
       'total_toilets']

X= data[feature]
y= data.continue_drop

In [42]:
# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = data.continue_drop, random_state=0)

In [43]:
# Random Forest classifier
from sklearn.ensemble import RandomForestClassifier
rForest = RandomForestClassifier(n_estimators=10)
rForest.fit(X_train, y_train)             

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [45]:
# Prediction and accuracy check
from sklearn.metrics import f1_score
y_pred2 = rForest.predict(X_test)                      
test_acc  = f1_score(y_test, y_pred2)
print("F1_score: {}" .format(test_acc))

F1_score: 1.0
