# MATH2319 Machine Learning
## Semester 1, 2020
## Assignment 3

## Honour Code
I solemnly swear that I have not discussed my assignment solutions with anyone in any way and the solutions I am submitting are my own personal work.

Full Name: **Akshay Sunil Salunke** - *s3730440*

In [None]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)

In [None]:
df = pd.read_csv("A3_Q1_train.csv")

## Part A - Data preparation
### Task 1

We first check the shape of our df. Then we print first 5 rows. The target feature is `annual_income` which has 2 values, `low_income` & `high_income`. We consider `high_income` as positive target class i.e `1` hereafter.

In [None]:
print(df.shape)
df.head()

We now extract all the categorical features in new `df_cat`.

In [None]:
df_cat = df.drop(columns=['age', 'education_years', 'annual_income', 'row_id'])

Next, we perform `equal width binning` on continous features `age` & `education_years`.

In [None]:
labels = ['low', 'mid', 'high']
age_cat = pd.cut(df['age'], bins=3, labels=labels)
ed_cat = pd.cut(df['education_years'], bins=3, labels=labels)

Then we add these binned features to `df_new` dataframe. `age` and `education_years` are now categorical with following unique values: `low, mid, high`

In [None]:
df_cat['age_cat'], df_cat['education_years_cat'] = age_cat.astype(object), ed_cat.astype(object)

We now append the target feature `annual_income` as the last column of `df_cat`

In [None]:
df_cat = df_cat.join(df['annual_income'])
df_all_cat = df_cat.copy()

### Wrap-up

In [None]:
# so that we can see all the columns
print(df_all_cat.shape)
df_all_cat.head()

In [None]:
# please run below in a separate cell!!!
for col in df_all_cat.columns.tolist():  
    print(col + ':')
    print(df_all_cat[col].value_counts())
    print('********')

### Task 2
In this section, we perform **One Hot Encoding (OHE)** on our dataset.

First, we check the datatypes for all our columns.

In [None]:
df_all_cat.dtypes

We now perform *OHE* on all columns except target.

In [None]:
df_all_cat_ohe = pd.get_dummies(df_all_cat.drop(columns=['annual_income']))

Now we add the target column after performing **integer encoding** on it. Instead of using `level_map` with `replace()` to integer encode, we use `get_dummies()` and drop the column `low_income`. Then we rename the column `high_income` generated by `get_dummies()` to `annual_income`, so we get similar results as if we had done integer encoding.

You can get the same result by using `dropFirst=True` in `get_dummies()` and then reversing the encoding.

In [None]:
df_all_cat_ohe['annual_income'] = pd.get_dummies(df['annual_income']).drop(columns='low_income')

### Wrap-up

In [None]:
print(df_all_cat_ohe.shape)
df_all_cat_ohe.head()

## Part B - Bernoulli NB
Here we fit a *Bernoulli NB* model with default parameters on our data, and score it again using same data. (Although this is cheating, this is what the assignment wants)

In [None]:
Data = df_all_cat_ohe.drop(columns=['annual_income']).values
target = df_all_cat_ohe['annual_income'].values

In [None]:
from sklearn.naive_bayes import BernoulliNB, GaussianNB
bnb = BernoulliNB()
bnb.fit(Data, target)
bnb.score(Data, target)

Above is the score for a *Bernoulli model* on our dataset.

## Part C - Gaussian NB
Now we fit a *Guassian NB* model with default parameters on the dataset, and then calculate it's score.

In [None]:
gnb = GaussianNB()
gnb.fit(Data, target)
gnb.score(Data, target)

## Part D - Tuning the models
### Task 1 - Tuning

We write a function `best_params()` which accepts the `Data, target` and `clf`. `clf` is the classifier which runs with different values of parameters. 

**Parameters**: For *Bernoulli NB*, `alpha` is varied, whereas for *Gaussian NB*, `var_smoothing` is varied. 

This function returns `results` dataframe with all parameters `p` tested and their mean accuracy `test_score`.

In [None]:
def best_params(Data, target, clf):
    if isinstance(clf, BernoulliNB):
        # param = alpha
        params = [0, 0.5, 1, 2, 3, 5]
    elif isinstance(clf, GaussianNB):
        # param = var_smoothing
        params = np.logspace(0,-9, num=10)
    else:
        raise Exception("Classifier not supported.")

    results = pd.DataFrame(params, columns=['p'])
    results['test_score'] = None
    for p in params:
            if isinstance(clf, BernoulliNB):
                clf.alpha = p
            elif isinstance(clf, GaussianNB):
                clf.var_smoothing = p

            clf.fit(Data, target)
            score = clf.score(Data, target)
            #print("Classifier:", clf, "Score:", score)
            results.loc[results['p']==p, 'test_score'] = score

    return results

We then call our `best_params()` function with `bnb`(*Bernoulli NB*) classifier and print the results df. 

Here, `p = alpha`

In [None]:
bnb_result = best_params(Data, target, bnb)
bnb_result

Next, call our `best_params()` function with `gnb`(*Gaussian NB*) classifier and print the results df. 

Here, `p = var_smoothing`

In [None]:
gnb_result = best_params(Data, target, gnb)
gnb_result

### Task 2 - Plotting
In this section we plot graphs for performance of both NB models with respect to different parameters.

Below line plot shows performance of *Bernoulli NB* with different values for `alpa` parameter.

In [None]:
import altair as alt
alt.Chart(bnb_result, 
          title='Bernoulli NB Performance Comparison'
         ).mark_line(point=True).encode(
    alt.X('p', title='alpha'),
    alt.Y('test_score', title='Mean accuracy', scale=alt.Scale(zero=False))
).interactive()

Below line plot shows performance of *Gaussian NB* with different values for `var_smoothing` parameter.

In [None]:
alt.Chart(gnb_result, 
          title='Gaussian NB Performance Comparison'
         ).mark_line(point=True).encode(
    alt.X('p', title='var_smoothing'),
    alt.Y('test_score', title='Mean accuracy', scale=alt.Scale(zero=False))
).interactive()

## Part E - Hybrid NB