# MATH2319 Machine Learning
## Semester 1, 2020
## Assignment 3

## Honour Code
I solemnly swear that I have not discussed my assignment solutions with anyone in any way and the solutions I am submitting are my own personal work.

Full Name: **Akshay Sunil Salunke** - *s3730440*

In [None]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)

In [None]:
df = pd.read_csv("A3_Q1_train.csv")

## Part A - Data preparation
### Task 1

We first check the shape of our df. Then we print first 5 rows. The target feature is `annual_income` which has 2 values, `low_income` & `high_income`. We consider `high_income` as positive target class i.e `1` hereafter.

In [None]:
print(df.shape)
df.head()

We now extract all the categorical features in new `df_cat`.

In [None]:
df_cat = df.drop(columns=['age', 'education_years', 'annual_income', 'row_id'])

Next, we perform `equal width binning` on continous features `age` & `education_years`.

In [None]:
labels = ['low', 'mid', 'high']
age_cat = pd.cut(df['age'], bins=3, labels=labels)
ed_cat = pd.cut(df['education_years'], bins=3, labels=labels)

Then we add these binned features to `df_new` dataframe. `age` and `education_years` are now categorical with following unique values: `low, mid, high`

In [None]:
df_cat['age_cat'], df_cat['education_years_cat'] = age_cat.astype(object), ed_cat.astype(object)

We now append the target feature `annual_income` as the last column of `df_cat`

In [None]:
df_cat = df_cat.join(df['annual_income'])
df_all_cat = df_cat.copy()

### Wrap-up

In [None]:
# so that we can see all the columns
print(df_all_cat.shape)
df_all_cat.head()

In [None]:
# please run below in a separate cell!!!
for col in df_all_cat.columns.tolist():  
    print(col + ':')
    print(df_all_cat[col].value_counts())
    print('********')

### Task 2
In this section, we perform **One Hot Encoding (OHE)** on our dataset.

First, we check the datatypes for all our columns.

In [None]:
df_all_cat.dtypes

We now perform *OHE* on all columns except target.

In [None]:
df_all_cat_ohe = pd.get_dummies(df_all_cat.drop(columns=['annual_income']))

Now we add the target column after performing **integer encoding** on it. Instead of using `level_map` with `replace()` to integer encode, we use `get_dummies()` and drop the column `low_income`. Then we rename the column `high_income` generated by `get_dummies()` to `annual_income`, so we get similar results as if we had done integer encoding.

You can get the same result by using `dropFirst=True` in `get_dummies()` and then reversing the encoding.

In [None]:
df_all_cat_ohe['annual_income'] = pd.get_dummies(df['annual_income']).drop(columns='low_income')

### Wrap-up

In [None]:
print(df_all_cat_ohe.shape)
df_all_cat_ohe.head()

## Part B - Bernoulli NB
Here we fit a *Bernoulli NB* model with default parameters on our data, and score it again using same data. (Although this is cheating, this is what the assignment wants)

In [None]:
Data = df_all_cat_ohe.drop(columns=['annual_income']).values
target = df_all_cat_ohe['annual_income'].values

In [None]:
from sklearn.naive_bayes import BernoulliNB, GaussianNB
bnb = BernoulliNB()
bnb.fit(Data, target)
bs = bnb.score(Data, target)
print(bs)

In [None]:
df.loc[len(df)-1]

Above is the score for a *Bernoulli model* on our dataset.

## Part C - Gaussian NB
Now we fit a *Guassian NB* model with default parameters on the dataset, and then calculate it's score.

In [None]:
gnb = GaussianNB()
gnb.fit(Data, target)
gs = gnb.score(Data, target)
print(gs)

## Part D - Tuning the models
### Task 1 - Tuning

We write a function `test_params()` which accepts the `Data, target` and `clf`. `clf` is the classifier which runs with different values of parameters. 

**Parameters**: For *Bernoulli NB*, `alpha` is varied, whereas for *Gaussian NB*, `var_smoothing` is varied. 

This function returns `results` dataframe with all parameters `p` tested and their mean accuracy `test_score`.

In [None]:
from sklearn import metrics
def test_params(Data, target, clf):
    if isinstance(clf, BernoulliNB):
        # param = alpha 
        params = [0, 0.5, 1, 2, 3, 5]
    elif isinstance(clf, GaussianNB):
        # param = var_smoothing
        params = np.logspace(0,-9, num=10)
    else:
        raise Exception("Classifier not supported.")

    results = pd.DataFrame(params, columns=['p'])
    results['test_score'] = None
    for p in params:
            if isinstance(clf, BernoulliNB):
                clf.alpha = p
            elif isinstance(clf, GaussianNB):
                clf.var_smoothing = p

            clf.fit(Data, target)
            predict = clf.predict(Data)
            score = metrics.accuracy_score(target, predict)
            #print("Classifier:", clf, "Score:", score)
            results.loc[results['p']==p, 'test_score'] = score

    return results

We then call our `test_params()` function with `bnb`(*Bernoulli NB*) classifier and print the results df. 

Here, `p = alpha`

In [None]:
bnb_result = test_params(Data, target, bnb)
bts = bnb_result['test_score'].max()
bnb_result

Next, call our `test_params()` function with `gnb`(*Gaussian NB*) classifier and print the results df. 

Here, `p = var_smoothing`

In [None]:
gnb_result = test_params(Data, target, gnb)
gts = gnb_result['test_score'].max()
gnb_result

### Task 2 - Plotting
In this section we plot graphs for performance of both NB models with respect to different parameters.

Below line plot shows performance of *Bernoulli NB* with different values for `alpa` parameter.

In [None]:
import altair as alt
alt.Chart(bnb_result, 
          title='Bernoulli NB Performance Comparison'
         ).mark_line(point=True).encode(
    alt.X('p', title='alpha'),
    alt.Y('test_score', title='Mean accuracy', scale=alt.Scale(zero=False))
).interactive()

Below line plot shows performance of *Gaussian NB* with different values for `var_smoothing` parameter.

In [None]:
alt.Chart(gnb_result, 
          title='Gaussian NB Performance Comparison'
         ).mark_line(point=True).encode(
    alt.X('p', title='var_smoothing'),
    alt.Y('test_score', title='Mean accuracy', scale=alt.Scale(zero=False))
).interactive()

## Part E - Hybrid NB
In this section we try to create an ensemble of *Bernoulli NB* and *Guassian NB*. This is because *Bernoulli NB* performs better on categorical features, whereas *Guasssian NB* performs better on continuos features that follow *Guassian Probability distribution*.

We drop the ID like column `row_id`.

In [None]:
df2 = pd.read_csv("A3_Q1_train.csv").drop(columns='row_id')
df2.head()

We first seperate the categorical features in `Data` variable and then perform *One Hot Encoding* on them. Then we seperate the target feature in `target` variable and integer encode positive target class `high_income` to `1` and `low_income` to `0`.

In [None]:
Data = df2.select_dtypes(object).drop(columns='annual_income')
Data = pd.get_dummies(Data)
target = df2['annual_income'].replace({'high_income':1, 'low_income':0})

We then fit a *Bernoulli NB* classifier on categorical features.

In [None]:
bnb_clf = BernoulliNB()
bnb_clf.fit(Data, target)

Next, we fit a *Guassian NB* classifier on continuos features `age` & `education_years`.

In [None]:
Data2 = df2.select_dtypes(int)

from sklearn.preprocessing import PowerTransformer
Data2 = PowerTransformer().fit_transform(Data2)

In [None]:
gnb_clf = GaussianNB()
gnb_clf.fit(Data2, target)

Then we predict the target on the same data using the abovee 2 classifiers, and store the target class probabilities in corresponding variables.

In [None]:
bnb_prob = bnb_clf.predict_proba(Data)
gnb_prob = gnb_clf.predict_proba(Data2)

Then we multiply both the probabilities given by *Bernoulli & Guassian NB*, in a variable `total_prob` by multiplying them. 

We are multiplying probabilities because of the **probability product rule** which states that *the probability of two (or more) independent events occurring together can be calculated by multiplying the individual probabilities of the events*.

In [None]:
# This is a 2d array, which contains probabilities where each element is in the format: 
# [prob. of low_income, prob. of high_income]
total_prob = []
for i in range(0, len(bnb_prob)):
    total_prob.append(bnb_prob[i] * gnb_prob[i])

# We add the 1 to prediction if probability of of high_income is >= 0.5, 0 otherwise
prediction = []
for p in total_prob:
    if p[1] >= 0.5:
        prediction.append(1)
    else:
        prediction.append(0)

Now we can calculate `accuracy_score` by comparing our predictions from *Hybrid NB* classifier vs actual target class.

In [None]:
hs = metrics.accuracy_score(target, prediction)
print(hs)

## Part 4 - Wrapping-up
Below we summarize the results of various classifiers we have trained on our dataset in `df_summary` dataframe.

In [None]:
df_summary = pd.DataFrame(columns=['method', 'accuracy'])
df_summary['method'] = ['Bernoulli NB (default)', 'Gaussian NB (default)', 'Bernoulli NB (tuned)', 'Gaussian NB (tuned)', 'Hybrid NB']
df_summary['accuracy'] = [bs, gs, bts, gts, hs]

In [None]:
df_summary

#### (i) Whether hyper-parameter tuning improves the performance of the Bernoulli and Gaussian NB models respectively. 

We tried tuning *Bernoulli NB* by varying the `alpha` parameter, but there was no performance increase.
But *Guassian NB* shows an improvement in performance(~10%) after tuning it's `var_smoothing` parameter.

#### **(ii) Whether your Hybrid NB model has more predictive power than the (untuned) Bernoulli and Gaussian NB models respectively.**

Based on `Fig. 1` & `Fig. 2`, we could say, out of the 2 continuos features `age` & `education_years`, `age` has curve similar to *bell curve* which signifies *normal distribution(Guassian probability distribution)*, whereas `education_years` has data biased towards right, which suggests data is not *normally distributed*.

As we know *Guassian NB* performs well on *normal distribution* but only one out of two continuos features if *normally distributed*. Hence the *Bernoulli NB* performs well(0.83) as compared to *Hybrid NB*(0.77).

But *Hybrid NB* performs better when compared with *Guassian NB*(0.72) since most (3 out of 5) features in our dataset are catgorical, and that's where *Bernoulli NB* shines, making *Hybrid NB*(0.77) a better performer.

In [None]:
import matplotlib as plt
df['age'].plot(kind='hist', bins=5, title="Fig.1 - age frequency")

In [None]:
df['education_years'].plot(kind='hist', bins=10, title="Fig.2 - education_years frequency")