### Question 3: Apply Logistic Regression to the Dataset 

We want to predict the income based on experience and age. 

Create a logistic regression model that will return the weights for the features to classify the input as High Income (Income >= 70000) and Low Income (Income < 70000). Use the first 500 rows ONLY for training. 

# Result

- Create a function log_reg(path), where path is the location of the csv file
- This function should load the file, train the model, and then test the model
- The function should return a 3-tuple, (coefficient array, intercept, R^2 of model when run on last 500 rows)
- E.g. ([12 ,   98], 10, 0.8)

In [1]:
# data management

import pandas as pd
import numpy as np

# ML prediction

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

In [2]:
def log_reg(path):
    df = pd.read_csv(path, index_col=0)
    # create income categories (1 = High Income / 0 = Low Income)
    df['High Income'] = np.where(df['Income']>= 70000, 1, 0)
    train = df.iloc[0:500]
    test = df.iloc[500:]
    # cross valid hyperparameters tuning (due to small dataset liblinear was selected and tested against l1 / l2 penalty)
    try_grid = [{'penalty':['l1','l2']}]
    reg_init = GridSearchCV(LogisticRegression(solver='liblinear',max_iter=100000), param_grid=try_grid, cv=10).fit(train[['Experience','Age']], train['High Income'])
    reg = LogisticRegression(penalty=reg_init.best_params_['penalty'],solver='liblinear',max_iter=100000).fit(train[['Experience','Age']], train['High Income'])
    return (reg.coef_, reg.intercept_, reg.score(test[['Experience','Age']], test['High Income']))

In [3]:
log_reg('HKUST_FinTech_Income_Dataset.csv')

(array([[ 5.05778021, -0.91952619]]), array([-2.70636228]), 0.992)