# Assignment 6  
## Applied Machine Learning

Andrew Chan 
EBE869

This assignment uses the `Suicide Rates Overview 1985 to 2016` dataset from Kaggle: https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016. This assumes that you have downloaded the dataset in `master.csv` which is in the same directory as this notebook.

## Preprocessing

In [74]:
import pandas as pd
import math
import numpy as np

# Locate and load the data file
df = pd.read_csv('master.csv')

# Sanity check
print(f'N rows={len(df)}, M columns={len(df.columns)}')
df.head()

N rows=27820, M columns=12


Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers


### Adjust the gdp_for_year ($) from string to float.

In [75]:
df[' gdp_for_year ($) ']

0         2,156,624,900
1         2,156,624,900
2         2,156,624,900
3         2,156,624,900
4         2,156,624,900
              ...      
27815    63,067,077,179
27816    63,067,077,179
27817    63,067,077,179
27818    63,067,077,179
27819    63,067,077,179
Name:  gdp_for_year ($) , Length: 27820, dtype: object

In [76]:
df[' gdp_for_year ($) '] = df[' gdp_for_year ($) '].str.replace(',', '')
df[' gdp_for_year ($) '] = df[' gdp_for_year ($) '].astype(float) 
df[' gdp_for_year ($) ']

0        2.156625e+09
1        2.156625e+09
2        2.156625e+09
3        2.156625e+09
4        2.156625e+09
             ...     
27815    6.306708e+10
27816    6.306708e+10
27817    6.306708e+10
27818    6.306708e+10
27819    6.306708e+10
Name:  gdp_for_year ($) , Length: 27820, dtype: float64

In [77]:
df.head()

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156625000.0,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156625000.0,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156625000.0,796,Generation X
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156625000.0,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156625000.0,796,Boomers


### Removal redundant features

#### Remove `suicides_no`

`suicides_no` is highly correlated with with `suicides/100k pop`. Since `suicides/100k pop` was chosen as the dependent variable, it does not make sense to use `suicides_no` since `suicides/100k pop` is $$\frac{suicides\_no}{population/100000}$$ which is a derived attribute. 

Another reason why I chose to keep `suicides/100k pop` is because it helps **normalize suicides by population**. A large country may have many suicides because there are more people that live there, but a smaller country may have more suicides per population. 

In [78]:
df = df.drop(['suicides_no'], axis=1)

#### Remove `gdp_for_year ($)`

`gdp_for_year ($)` is highly correlated with with `gdp_per_capita ($)`. Since `gdp_per_capita ($)` was chosen as the dependent variable, it does not make sense to use `gdp_for_year ($)` since `gdp_per_capita ($)` is $$\frac{gdp\_for\_year ($)}{population}$$ which is a derived attribute.


Another reason why I chose to keep `gdp_per_capita ($)` is because it helps **normalize GDP by population**. A large country may have more gross domestic product because there are more people that live there, but a smaller country may have more gdp per capita. 

In [79]:
df = df.drop([' gdp_for_year ($) '], axis=1)

#### Remove `country-year`

Since `country-year` already is derived from `country` and `year`, it is redundant to include the `country-year`. Thus, we drop `country-year`.

I also chose to keep `country` and `year` because upon one-hot encoding, there will be **N countries** + **M years** columns versus **N x M** additional columns, which should help with computational speed. 

In [80]:
df = df.drop(['country-year'], axis=1)

#### Remove `country`, `year`, `population`, `HDI for year`, `gdp_per_capita` since we are doing regression

In [81]:
df = df.drop(['country'], axis=1)
df = df.drop(['year'], axis=1)
df = df.drop(['population'], axis=1)
df = df.drop(['HDI for year'], axis=1)
df = df.drop(['gdp_per_capita ($)'], axis=1)

#### Remove target variable

In [82]:
df_no_target = df.drop(['suicides/100k pop'], axis=1)

In [83]:
df_no_target

Unnamed: 0,sex,age,generation
0,male,15-24 years,Generation X
1,male,35-54 years,Silent
2,female,15-24 years,Generation X
3,male,75+ years,G.I. Generation
4,male,25-34 years,Boomers
...,...,...,...
27815,female,35-54 years,Generation X
27816,female,75+ years,Silent
27817,male,5-14 years,Generation Z
27818,female,5-14 years,Generation Z


### Most Frequent imputation

For all values with NaN, we will replace with most frequent

In [84]:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy="most_frequent") 
df_no_target[:] = imp.fit_transform(df_no_target)

In [85]:
df_no_target

Unnamed: 0,sex,age,generation
0,male,15-24 years,Generation X
1,male,35-54 years,Silent
2,female,15-24 years,Generation X
3,male,75+ years,G.I. Generation
4,male,25-34 years,Boomers
...,...,...,...
27815,female,35-54 years,Generation X
27816,female,75+ years,Silent
27817,male,5-14 years,Generation Z
27818,female,5-14 years,Generation Z


In [86]:
final_features = df_no_target.columns

In [87]:
final_features

Index(['sex', 'age', 'generation'], dtype='object')

### One hot encoding of all nominal features

In [88]:
from sklearn.preprocessing import OneHotEncoder
X = df_no_target_norm[['sex','age','generation']].values


In [89]:
X.shape

(27820, 3)

In [90]:
X

array([['male', '15-24 years', 'Generation X'],
       ['male', '35-54 years', 'Silent'],
       ['female', '15-24 years', 'Generation X'],
       ...,
       ['male', '5-14 years', 'Generation Z'],
       ['female', '5-14 years', 'Generation Z'],
       ['female', '55-74 years', 'Boomers']], dtype=object)

In [93]:
from sklearn.compose import ColumnTransformer
c_transf = ColumnTransformer([ 
    ('onehot', OneHotEncoder(), [0,1,2])
])
X = c_transf.fit_transform(X).astype(float)

In [94]:
X.shape

(27820, 14)

In [96]:
X[0].todense()

matrix([[0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.]])

# 1. [20 pts] Keep the variables as one-hot encoded and develop a multiple linear regression model. 
Use your model to predict the target variable for the people with age 20, male, and generation X. What is the MAE error of this prediction? How many line coefficients are there?

We will create a model using Logistic Regression to get probabilities of suicide rates since logistic regression calculates value between [0,1].

Set the x and y values:

In [97]:
df_y = df['suicides/100k pop']
y = df_y.values
from sklearn.model_selection import train_test_split
from sklearn import metrics

In [98]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X, y)
print('score',reg.score(X, y))
print('coef',reg.coef_)
print('intercept',reg.intercept_)

score 0.29561613779037377
coef [ -7.42323113   7.42323113  -2.88739403  -0.224249     1.93315305
 -10.96728546   2.62825292   9.51752253   0.24596088   3.06094323
  -0.56435247  -1.34004708  -1.57562376   0.1731192 ]
intercept 12.949631859521169


## Prediction: 
Transform feature vector for age 20, male, and generation X

In [99]:
input = { 
    'age':['15-24 years'],
    'generation':['Generation X'],
    'sex':['male'] 
}

df_input = pd.DataFrame(input,columns = [  
                                         'sex', 
                                         'age', 
                                       'generation'])

In [100]:
df_input

Unnamed: 0,sex,age,generation
0,male,15-24 years,Generation X


#### Imputation

In [101]:
imp.transform(df_input)

array([['male', '15-24 years', 'Generation X']], dtype=object)

In [102]:
df_input[:] = imp.transform(df_input)

In [103]:
df_input

Unnamed: 0,sex,age,generation
0,male,15-24 years,Generation X


#### One-hot

In [104]:
X_input = df_input[['sex','age','generation']].values
X_input = c_transf.transform(X_input).astype(float)

## Answer: Use your model to predict the target variable for the people with age 20, male, and generation X.

In [105]:
reg.predict(X_input)

array([16.92111649])

## Answer: What is the MAE error of this prediction? 

$$MAE = \frac{1}{N}\sum_{i=1}^{N}(\hat{y_i}-y_i)^2$$

In [106]:
def mae(_y, _y_pred):
    return (len(_y)**-1) * np.sum(np.abs(_y_pred-_y))

In [107]:
y_pred = reg.predict(X)
y_pred.shape

(27820,)

In [108]:
y.shape

(27820,)

In [109]:
mae(y,y_pred)

10.180346495478293

## Answer: How many line coefficients are there?

In [110]:
print('coef count:',reg.coef_.size)

coef count: 14


# #2 [30 pts] Now use the original sex, age and generation variables in numerical form and develop a new model. 
Use your model to predict the target value for the people with age
20, male, and generation X. What is the MAE error of this prediction? How many line coefficients are there? (Note that for this step you have to think of a way of encoding the original nominal age feature and generation feature into numerical features.)

In [111]:
df_num = df_no_target.copy()

## Map sex to numerical values
Ordered sex from strings to [0,1]

In [112]:
df_num['sex'].unique()

array(['male', 'female'], dtype=object)

In [113]:
sex_mapping = { 'male':0,
               'female':1, 
              }

In [114]:
df_num['sex'] = df_num['sex'].map(sex_mapping)

In [115]:
df_num.head()

Unnamed: 0,sex,age,generation
0,0,15-24 years,Generation X
1,0,35-54 years,Silent
2,1,15-24 years,Generation X
3,0,75+ years,G.I. Generation
4,0,25-34 years,Boomers


## Map age to numerical values
Ordered age from strings to [0,...,5]

In [116]:
df_num['age'].unique()

array(['15-24 years', '35-54 years', '75+ years', '25-34 years',
       '55-74 years', '5-14 years'], dtype=object)

In [117]:
age_mapping = { '5-14 years':0,
               '15-24 years':1, 
               '25-34 years':2,
               '35-54 years':3,
               '55-74 years':4,  
               '75+ years':5
              }

In [118]:
df_num['age'] = df_num['age'].map(age_mapping)

In [119]:
df_num.head()

Unnamed: 0,sex,age,generation
0,0,1,Generation X
1,0,3,Silent
2,1,1,Generation X
3,0,5,G.I. Generation
4,0,2,Boomers


## Map generation to numerical values
Ordered age from strings to [0,...,5]

In [120]:
df_num['generation'].unique()

array(['Generation X', 'Silent', 'G.I. Generation', 'Boomers',
       'Millenials', 'Generation Z'], dtype=object)

In [121]:
generation_mapping = { 'Generation Z':0,
                'Millenials':1, 
                'Generation X':2, 
                'Boomers':3,
                'Silent':4, 
                'G.I. Generation':5
              }

In [122]:
df_num['generation'] = df_num['generation'].map(generation_mapping)

In [123]:
df_num.head()

Unnamed: 0,sex,age,generation
0,0,1,2
1,0,3,4
2,1,1,2
3,0,5,5
4,0,2,3


In [124]:
df_num.columns

Index(['sex', 'age', 'generation'], dtype='object')

#### Standardization

In [125]:
scaler_num = StandardScaler()
df_num[['sex', 
        'age', 
        'generation']] = scaler_num.fit_transform(df_num[[
                                                          'sex', 
                                                          'age', 
                                                          'generation']])

In [126]:
df_num.head()

Unnamed: 0,sex,age,generation
0,-1.0,-0.880574,-0.433847
1,-1.0,0.291278,0.972378
2,1.0,-0.880574,-0.433847
3,-1.0,1.46313,1.675491
4,-1.0,-0.294648,0.269265


#### Fit a new model

In [127]:
X_num = df_num.values

In [128]:
X_num.shape

(27820, 3)

In [129]:
y_num = df_y.values

In [130]:
y_num.shape

(27820,)

In [131]:
reg_num = LinearRegression()
reg_num.fit(X_num, y_num)
print('score',reg_num.score(X_num, y_num))
print('coef',reg_num.coef_)
print('intercept',reg_num.intercept_)

score 0.28490936364899955
coef [-7.42323113  6.32876361  0.5998983 ]
intercept 12.816097411933864


## ANSWER: Use your model to predict the target value for the people with age 20, male, and generation X. 

In [132]:
input_num = { 
    'age':['15-24 years'],
    'generation':['Generation X'],
    'sex':['male'] 
}

df_input_num = pd.DataFrame(input,columns = ['sex', 
                                             'age', 
                                             'generation'])

In [133]:
df_input_num

Unnamed: 0,sex,age,generation
0,male,15-24 years,Generation X


#### Imputation

In [134]:
df_input_num[:] = imp.transform(df_input_num)

In [135]:
df_input_num

Unnamed: 0,sex,age,generation
0,male,15-24 years,Generation X


#### Mappings

In [136]:
df_input_num['sex'] = df_input_num['sex'].map(sex_mapping)
df_input_num['age'] = df_input_num['age'].map(age_mapping)
df_input_num['generation'] = df_input_num['generation'].map(generation_mapping)

In [137]:
df_input_num

Unnamed: 0,sex,age,generation
0,0,1,2


#### Standardization

In [138]:
df_input_num[[
        'sex', 
        'age', 
        'generation']] = scaler_num.transform(df_input_num[[
                                                          'sex', 
                                                          'age', 
                                                        'generation']])

In [139]:
df_input_num

Unnamed: 0,sex,age,generation
0,-1.0,-0.880574,-0.433847


In [140]:
reg_num.predict(df_input_num.values)

array([14.406119])

## ANSWER: What is the MAE error of this prediction? 

In [141]:
y_pred_num = reg_num.predict(X_num)
mae(y,y_pred_num)

10.33154736948262

## ANSWER: How many line coefficients are there?

In [142]:
print('reg_num coef count:',reg_num.coef_.size)

reg_num coef count: 3


# #3. [10 pts] Did you note any change in these two model performances?

# #4. [10 pts] What is the prediction for age 33, male and generation Alpha (i.e. the generation after generation Z)?

# #5. [10 pts] Give one advantage when using regression (as opposed to classification with nominal features) in terms of input data features.

We have **fewer input data features** () when using regression since we do not one hot encode the categorial features.

# 6. [10 pts] Give one advantage when using regular numerical values rather than one-hot encoding for regression

More **simplified model** due to fewer features leads to a more generalized model that fits the data more accurately with fewer coefficients.