# Assignment 7  
## Applied Machine Learning

Andrew Chan 
EBE869

This assignment uses the `Suicide Rates Overview 1985 to 2016` dataset from Kaggle: https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016. This assumes that you have downloaded the dataset in `master.csv` which is in the same directory as this notebook.

## Preprocessing

In [1]:
import pandas as pd
import math
import numpy as np

# Locate and load the data file
df = pd.read_csv('master.csv')

# Sanity check
print(f'N rows={len(df)}, M columns={len(df.columns)}')
df.head()

N rows=27820, M columns=12


Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers


### Adjust the gdp_for_year ($) from string to float.

In [2]:
df[' gdp_for_year ($) ']

0         2,156,624,900
1         2,156,624,900
2         2,156,624,900
3         2,156,624,900
4         2,156,624,900
              ...      
27815    63,067,077,179
27816    63,067,077,179
27817    63,067,077,179
27818    63,067,077,179
27819    63,067,077,179
Name:  gdp_for_year ($) , Length: 27820, dtype: object

In [3]:
df[' gdp_for_year ($) '] = df[' gdp_for_year ($) '].str.replace(',', '')
df[' gdp_for_year ($) '] = df[' gdp_for_year ($) '].astype(float) 
df[' gdp_for_year ($) ']

0        2.156625e+09
1        2.156625e+09
2        2.156625e+09
3        2.156625e+09
4        2.156625e+09
             ...     
27815    6.306708e+10
27816    6.306708e+10
27817    6.306708e+10
27818    6.306708e+10
27819    6.306708e+10
Name:  gdp_for_year ($) , Length: 27820, dtype: float64

In [4]:
df.head()

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156625000.0,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156625000.0,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156625000.0,796,Generation X
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156625000.0,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156625000.0,796,Boomers


### Removal redundant features

#### Remove `suicides_no`

`suicides_no` is highly correlated with with `suicides/100k pop`. Since `suicides/100k pop` was chosen as the dependent variable, it does not make sense to use `suicides_no` since `suicides/100k pop` is $$\frac{suicides\_no}{population/100000}$$ which is a derived attribute. 

Another reason why I chose to keep `suicides/100k pop` is because it helps **normalize suicides by population**. A large country may have many suicides because there are more people that live there, but a smaller country may have more suicides per population. 

In [5]:
df = df.drop(['suicides_no'], axis=1)

#### Remove `gdp_for_year ($)`

`gdp_for_year ($)` is highly correlated with with `gdp_per_capita ($)`. Since `gdp_per_capita ($)` was chosen as the dependent variable, it does not make sense to use `gdp_for_year ($)` since `gdp_per_capita ($)` is $$\frac{gdp\_for\_year ($)}{population}$$ which is a derived attribute.


Another reason why I chose to keep `gdp_per_capita ($)` is because it helps **normalize GDP by population**. A large country may have more gross domestic product because there are more people that live there, but a smaller country may have more gdp per capita. 

In [6]:
df = df.drop([' gdp_for_year ($) '], axis=1)

#### Remove `country-year`

Since `country-year` already is derived from `country` and `year`, it is redundant to include the `country-year`. Thus, we drop `country-year`.

I also chose to keep `country` and `year` because upon one-hot encoding, there will be **N countries** + **M years** columns versus **N x M** additional columns, which should help with computational speed. 

In [7]:
df = df.drop(['country-year'], axis=1)

In [8]:
#### Remove `country` since we are doing regression

In [9]:
df = df.drop(['country'], axis=1)

#### Remove target variable

In [10]:
df_no_target = df.drop(['suicides/100k pop'], axis=1)

In [11]:
df_no_target

Unnamed: 0,year,sex,age,population,HDI for year,gdp_per_capita ($),generation
0,1987,male,15-24 years,312900,,796,Generation X
1,1987,male,35-54 years,308000,,796,Silent
2,1987,female,15-24 years,289700,,796,Generation X
3,1987,male,75+ years,21800,,796,G.I. Generation
4,1987,male,25-34 years,274300,,796,Boomers
...,...,...,...,...,...,...,...
27815,2014,female,35-54 years,3620833,0.675,2309,Generation X
27816,2014,female,75+ years,348465,0.675,2309,Silent
27817,2014,male,5-14 years,2762158,0.675,2309,Generation Z
27818,2014,female,5-14 years,2631600,0.675,2309,Generation Z


### Most Frequent imputation

For all values with NaN, we will replace with most frequent

In [12]:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy="most_frequent") 
df_no_target[:] = imp.fit_transform(df_no_target)

In [13]:
df_no_target

Unnamed: 0,year,sex,age,population,HDI for year,gdp_per_capita ($),generation
0,1987,male,15-24 years,312900,0.713,796,Generation X
1,1987,male,35-54 years,308000,0.713,796,Silent
2,1987,female,15-24 years,289700,0.713,796,Generation X
3,1987,male,75+ years,21800,0.713,796,G.I. Generation
4,1987,male,25-34 years,274300,0.713,796,Boomers
...,...,...,...,...,...,...,...
27815,2014,female,35-54 years,3620833,0.675,2309,Generation X
27816,2014,female,75+ years,348465,0.675,2309,Silent
27817,2014,male,5-14 years,2762158,0.675,2309,Generation Z
27818,2014,female,5-14 years,2631600,0.675,2309,Generation Z


In [14]:
final_features = df_no_target.columns

In [15]:
final_features

Index(['year', 'sex', 'age', 'population', 'HDI for year',
       'gdp_per_capita ($)', 'generation'],
      dtype='object')

### Normalization and Standardization

In [16]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_no_target_norm = df_no_target.copy()
df_no_target_norm[['year', 'population', 'HDI for year', 'gdp_per_capita ($)']] = scaler.fit_transform(df_no_target_norm[['year', 'population', 'HDI for year', 'gdp_per_capita ($)']])

### One hot encoding of all nominal features

In [17]:
from sklearn.preprocessing import OneHotEncoder
X = df_no_target_norm[['year','sex','age','population','HDI for year','gdp_per_capita ($)','generation']].values


In [18]:
X.shape

(27820, 7)

In [19]:
X

array([[-1.6836154088722834, 'male', '15-24 years', ...,
        -0.32455746341615943, -0.8508637026494903, 'Generation X'],
       [-1.6836154088722834, 'male', '35-54 years', ...,
        -0.32455746341615943, -0.8508637026494903, 'Silent'],
       [-1.6836154088722834, 'female', '15-24 years', ...,
        -0.32455746341615943, -0.8508637026494903, 'Generation X'],
       ...,
       [1.5045189458533905, 'male', '5-14 years', ...,
        -0.9695480041053613, -0.7707566970933892, 'Generation Z'],
       [1.5045189458533905, 'female', '5-14 years', ...,
        -0.9695480041053613, -0.7707566970933892, 'Generation Z'],
       [1.5045189458533905, 'female', '55-74 years', ...,
        -0.9695480041053613, -0.7707566970933892, 'Boomers']],
      dtype=object)

In [20]:
from sklearn.compose import ColumnTransformer
c_transf = ColumnTransformer([ 
    ('onehot', OneHotEncoder(), [1,2,6]),
    ('nothing', 'passthrough', [0,3,4,5])
])
X = c_transf.fit_transform(X).astype(float)

In [21]:
X.shape

(27820, 18)

In [22]:
X[0]

array([ 0.        ,  1.        ,  1.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        1.        ,  0.        ,  0.        ,  0.        , -1.68361541,
       -0.39161747, -0.32455746, -0.8508637 ])

# 1. [20 pts] Keep the variables as one-hot encoded and develop a multiple linear regression model. 
Use your model to predict the target variable for the people with age 20, male, and generation X. What is the MAE error of this prediction? How many line coefficients are there?

We will create a model using Logistic Regression to get probabilities of suicide rates since logistic regression calculates value between [0,1].

Set the x and y values:

In [23]:
df_y = df['suicides/100k pop']
y = df_y.values
from sklearn.model_selection import train_test_split
from sklearn import metrics

# ANSWER: New Model:

In [24]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X, y)
print('score',reg.score(X, y))
print('coef',reg.coef_)
print('intercept',reg.intercept_)

score 0.29857875800200584
coef [ 1.84904210e+12  1.84904210e+12 -4.51501158e+13 -4.51501158e+13
 -4.51501158e+13 -4.51501158e+13 -4.51501158e+13 -4.51501158e+13
 -3.31210233e+13 -3.31210233e+13 -3.31210233e+13 -3.31210233e+13
 -3.31210233e+13 -3.31210233e+13 -1.07718327e+00  7.34825534e-01
  6.26454072e-01 -2.85741976e-02]
intercept 76422096999623.77


## Use your model to predict the target variable for the people with age 20, male, and generation X.

In [25]:
input = { 
    'age':['15-24 years'],
    'generation':['Generation X'],
    'sex':['male'] 
}

df_input = pd.DataFrame(input,columns = [ 'year', 
                                         'sex', 
                                         'age', 
                                         'population', 
                                         'HDI for year',
                                         'gdp_per_capita ($)', 
                                         'generation'])

In [26]:
df_input

Unnamed: 0,year,sex,age,population,HDI for year,gdp_per_capita ($),generation
0,,male,15-24 years,,,,Generation X


#### Imputation

In [27]:
imp.transform(df_input)

array([[2009, 'male', '15-24 years', 24000, 0.713, 1299, 'Generation X']],
      dtype=object)

In [28]:
df_input[:] = imp.transform(df_input)

In [29]:
df_input

Unnamed: 0,year,sex,age,population,HDI for year,gdp_per_capita ($),generation
0,2009,male,15-24 years,24000,0.713,1299,Generation X


#### Standardization

In [30]:
df_input[['year', 'population', 'HDI for year', 'gdp_per_capita ($)']] = scaler.transform(df_input[['year', 'population', 'HDI for year', 'gdp_per_capita ($)']])

In [31]:
df_input

Unnamed: 0,year,sex,age,population,HDI for year,gdp_per_capita ($),generation
0,0.914124,male,15-24 years,-0.465473,-0.324557,-0.824232,Generation X


#### One-hot

In [32]:
X_input = df_input[['year','sex','age','population','HDI for year','gdp_per_capita ($)','generation']].values
X_input = c_transf.transform(X_input).astype(float)

## Answer: 14.390625 suicides/100k pop

In [33]:
reg.predict(X_input)

array([14.390625])

$$MAE = \frac{1}{N}\sum_{i=1}^{N}(\hat{y_i}-y_i)^2$$

In [34]:
def mae(_y, _y_pred):
    return (len(_y)**-1) * np.sum(np.abs(_y_pred-_y))

In [35]:
y_pred = reg.predict(X)
y_pred.shape

(27820,)

In [36]:
y.shape

(27820,)

## What is the MAE error of this prediction? 

## Answer: 10.165012266355141

In [37]:
mae(y,y_pred)

10.165012266355141

## How many line coefficients are there?

## Answer: 18

In [38]:
print('coef count:',reg.coef_.size)

coef count: 18


# #2 [30 pts] Now use the original sex, age and generation variables in numerical form and develop a new model. 
Use your model to predict the target value for the people with age
20, male, and generation X. What is the MAE error of this prediction? How many line coefficients are there? (Note that for this step you have to think of a way of encoding the original nominal age feature and generation feature into numerical features.)

In [39]:
df_num = df_no_target.copy()

####  Map sex to numerical values
Ordered sex from strings to [0,1]

In [40]:
df_num['sex'].unique()

array(['male', 'female'], dtype=object)

In [41]:
sex_mapping = { 'male':0,
               'female':1, 
              }

In [42]:
df_num['sex'] = df_num['sex'].map(sex_mapping)

In [43]:
df_num.head()

Unnamed: 0,year,sex,age,population,HDI for year,gdp_per_capita ($),generation
0,1987,0,15-24 years,312900,0.713,796,Generation X
1,1987,0,35-54 years,308000,0.713,796,Silent
2,1987,1,15-24 years,289700,0.713,796,Generation X
3,1987,0,75+ years,21800,0.713,796,G.I. Generation
4,1987,0,25-34 years,274300,0.713,796,Boomers


####  Map age to numerical values
Ordered age from strings to [0,...,5]

In [44]:
df_num['age'].unique()

array(['15-24 years', '35-54 years', '75+ years', '25-34 years',
       '55-74 years', '5-14 years'], dtype=object)

In [45]:
age_mapping = { '5-14 years':0,
               '15-24 years':1, 
               '25-34 years':2,
               '35-54 years':3,
               '55-74 years':4,  
               '75+ years':5
              }

In [46]:
df_num['age'] = df_num['age'].map(age_mapping)

In [47]:
df_num.head()

Unnamed: 0,year,sex,age,population,HDI for year,gdp_per_capita ($),generation
0,1987,0,1,312900,0.713,796,Generation X
1,1987,0,3,308000,0.713,796,Silent
2,1987,1,1,289700,0.713,796,Generation X
3,1987,0,5,21800,0.713,796,G.I. Generation
4,1987,0,2,274300,0.713,796,Boomers


#### Map generation to numerical values
Ordered age from strings to [0,...,5]

In [48]:
df_num['generation'].unique()

array(['Generation X', 'Silent', 'G.I. Generation', 'Boomers',
       'Millenials', 'Generation Z'], dtype=object)

In [49]:
generation_mapping = { 'Generation Z':0,
                'Millenials':1, 
                'Generation X':2, 
                'Boomers':3,
                'Silent':4, 
                'G.I. Generation':5
              }

In [50]:
df_num['generation'] = df_num['generation'].map(generation_mapping)

In [51]:
df_num.head()

Unnamed: 0,year,sex,age,population,HDI for year,gdp_per_capita ($),generation
0,1987,0,1,312900,0.713,796,2
1,1987,0,3,308000,0.713,796,4
2,1987,1,1,289700,0.713,796,2
3,1987,0,5,21800,0.713,796,5
4,1987,0,2,274300,0.713,796,3


In [52]:
df_num.columns

Index(['year', 'sex', 'age', 'population', 'HDI for year',
       'gdp_per_capita ($)', 'generation'],
      dtype='object')

#### Standardization

In [53]:
scaler_num = StandardScaler()
df_num[['year', 
        'sex', 
        'age', 
        'population', 
        'HDI for year',
       'gdp_per_capita ($)', 
        'generation']] = scaler_num.fit_transform(df_num[['year', 
                                                          'sex', 
                                                          'age', 
                                                          'population', 
                                                          'HDI for year',
                                                           'gdp_per_capita ($)', 
                                                          'generation']])

In [54]:
df_num.head()

Unnamed: 0,year,sex,age,population,HDI for year,gdp_per_capita ($),generation
0,-1.683615,-1.0,-0.880574,-0.391617,-0.324557,-0.850864,-0.433847
1,-1.683615,-1.0,0.291278,-0.39287,-0.324557,-0.850864,0.972378
2,-1.683615,1.0,-0.880574,-0.397548,-0.324557,-0.850864,-0.433847
3,-1.683615,-1.0,1.46313,-0.466035,-0.324557,-0.850864,1.675491
4,-1.683615,-1.0,-0.294648,-0.401485,-0.324557,-0.850864,0.269265


#### Fit a new model

In [55]:
X_num = df_num.values

In [56]:
X_num.shape

(27820, 7)

In [57]:
y_num = df_y.values

In [58]:
y_num.shape

(27820,)

# ANSWER: New Model:

In [59]:
reg_num = LinearRegression()
reg_num.fit(X_num, y_num)
print('score',reg_num.score(X_num, y_num))
print('coef',reg_num.coef_)
print('intercept',reg_num.intercept_)

score 0.2899472666470345
coef [-2.04486228e+00 -7.43038681e+00  9.73151535e+00  6.36508865e-01
  6.14307245e-01 -5.89970483e-03 -3.08729256e+00]
intercept 12.816097411933855


## Use your model to predict the target value for the people with age 20, male, and generation X. 

In [60]:
input_num = { 
    'age':['15-24 years'],
    'generation':['Generation X'],
    'sex':['male'] 
}

df_input_num = pd.DataFrame(input,columns = ['year', 'sex', 'age', 'population', 'HDI for year',
       'gdp_per_capita ($)', 'generation'])

In [61]:
df_input_num

Unnamed: 0,year,sex,age,population,HDI for year,gdp_per_capita ($),generation
0,,male,15-24 years,,,,Generation X


#### Imputation

In [62]:
df_input_num[:] = imp.transform(df_input_num)

In [63]:
df_input_num

Unnamed: 0,year,sex,age,population,HDI for year,gdp_per_capita ($),generation
0,2009,male,15-24 years,24000,0.713,1299,Generation X


#### Mappings

In [64]:
df_input_num['sex'] = df_input_num['sex'].map(sex_mapping)
df_input_num['age'] = df_input_num['age'].map(age_mapping)
df_input_num['generation'] = df_input_num['generation'].map(generation_mapping)

In [65]:
df_input_num

Unnamed: 0,year,sex,age,population,HDI for year,gdp_per_capita ($),generation
0,2009,0,1,24000,0.713,1299,2


#### Standardization

In [66]:
df_input_num[['year', 
        'sex', 
        'age', 
        'population', 
        'HDI for year',
       'gdp_per_capita ($)', 
        'generation']] = scaler_num.transform(df_input_num[['year', 
                                                          'sex', 
                                                          'age', 
                                                          'population', 
                                                          'HDI for year',
                                                           'gdp_per_capita ($)', 
                                                          'generation']])

In [67]:
df_input_num

Unnamed: 0,year,sex,age,population,HDI for year,gdp_per_capita ($),generation
0,0.914124,-1.0,-0.880574,-0.465473,-0.324557,-0.824232,-0.433847


## ANSWER: 10.65652782

In [68]:
reg_num.predict(df_input_num.values)

array([10.65652782])

## What is the MAE error of this prediction? 

## ANSWER: 10.29593502788541

In [69]:
y_pred_num = reg_num.predict(X_num)
mae(y,y_pred_num)

10.29593502788541

## How many line coefficients are there?

## ANSWER: 7

In [70]:
print('reg_num coef count:',reg_num.coef_.size)

reg_num coef count: 7


# #3. [10 pts] Did you note any change in these two model performances?

The performance of the   
#1 one-hot encoded categorial features has `MAE` = 10.165   
#2  numerical features encoding in has `MAE` = 10.29   
Thus it seems that **one-hot encoding has slightly better performance.**

# #4. [10 pts] What is the prediction for age 33, male and generation Alpha (i.e. the generation after generation Z)?

In [71]:
input_num = { 
    'age':['25-34 years'],
    'generation':[-1],
    'sex':['male'] 
}

df_input_num = pd.DataFrame(input_num,columns = ['year', 'sex', 'age', 'population', 'HDI for year',
       'gdp_per_capita ($)', 'generation'])

In [72]:
df_input_num

Unnamed: 0,year,sex,age,population,HDI for year,gdp_per_capita ($),generation
0,,male,25-34 years,,,,-1


#### Imputation

In [73]:
df_input_num[:] = imp.transform(df_input_num)

In [74]:
df_input_num

Unnamed: 0,year,sex,age,population,HDI for year,gdp_per_capita ($),generation
0,2009,male,25-34 years,24000,0.713,1299,-1


#### Mappings

In [75]:
df_input_num['sex'] = df_input_num['sex'].map(sex_mapping)
df_input_num['age'] = df_input_num['age'].map(age_mapping)


In [76]:
df_input_num

Unnamed: 0,year,sex,age,population,HDI for year,gdp_per_capita ($),generation
0,2009,0,2,24000,0.713,1299,-1


#### Standardization

In [77]:
df_input_num[['year', 
        'sex', 
        'age', 
        'population', 
        'HDI for year',
       'gdp_per_capita ($)', 
        'generation']] = scaler_num.transform(df_input_num[['year', 
                                                          'sex', 
                                                          'age', 
                                                          'population', 
                                                          'HDI for year',
                                                           'gdp_per_capita ($)', 
                                                          'generation']])

In [78]:
df_input_num

Unnamed: 0,year,sex,age,population,HDI for year,gdp_per_capita ($),generation
0,0.914124,-1.0,-0.294648,-0.465473,-0.324557,-0.824232,-2.543186


## ANSWER: 22.87062101

In [79]:
reg_num.predict(df_input_num.values)

array([22.87062101])

# #5. [10 pts] Give one advantage when using regression (as opposed to classification with nominal features) in terms of input data features.

ANSWER: We have **fewer input data features** () when using regression since we do not one hot encode the categorial features, which should save memory and computational space.

# #6. [10 pts] Give one advantage when using regular numerical values rather than one-hot encoding for regression

ANSWER:  More **simplified model** due to fewer features leads to a more generalized model that should be less prone to overfitting. For one-hot encoding we had 18 coefficients versus 7 for numerical encoding.