# Assignment 6  
## Applied Machine Learning

Andrew Chan 
EBE869

This assignment uses the `Suicide Rates Overview 1985 to 2016` dataset from Kaggle: https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016. This assumes that you have downloaded the dataset in `master.csv` which is in the same directory as this notebook.

# 1. [20 pts] What is the dependent variable you decided? Why?

The dependent variable will be generated from the `suicides/100k pop`. We have chosen to use this versus `suicides_no` because this is normalized to population versus an absolute number of suicides. This will help provide a better prediction so that the model does not skew towards larger countries which may naturally have more suicides. 

## Preprocessing
The assingment states to use the "let's get back to the pre-processed dataset Suicide Rates Overview 1985 to 2016 file", thus we repeat the preprocessing steps here:

In [148]:
import pandas as pd
import math
import numpy as np

# Locate and load the data file
df = pd.read_csv('master.csv')

# Sanity check
print(f'N rows={len(df)}, M columns={len(df.columns)}')
df.head()

N rows=27820, M columns=12


Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers


### Adjust the gdp_for_year ($) from string to float.

In [149]:
df[' gdp_for_year ($) ']

0         2,156,624,900
1         2,156,624,900
2         2,156,624,900
3         2,156,624,900
4         2,156,624,900
              ...      
27815    63,067,077,179
27816    63,067,077,179
27817    63,067,077,179
27818    63,067,077,179
27819    63,067,077,179
Name:  gdp_for_year ($) , Length: 27820, dtype: object

In [150]:
df[' gdp_for_year ($) '] = df[' gdp_for_year ($) '].str.replace(',', '')
df[' gdp_for_year ($) '] = df[' gdp_for_year ($) '].astype(float) 
df[' gdp_for_year ($) ']

0        2.156625e+09
1        2.156625e+09
2        2.156625e+09
3        2.156625e+09
4        2.156625e+09
             ...     
27815    6.306708e+10
27816    6.306708e+10
27817    6.306708e+10
27818    6.306708e+10
27819    6.306708e+10
Name:  gdp_for_year ($) , Length: 27820, dtype: float64

In [151]:
df.head()

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156625000.0,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156625000.0,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156625000.0,796,Generation X
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156625000.0,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156625000.0,796,Boomers


### Removal redundant features

#### Remove `suicides_no`

`suicides_no` is highly correlated with with `suicides/100k pop`. Since `suicides/100k pop` was chosen as the dependent variable, it does not make sense to use `suicides_no` since `suicides/100k pop` is $$\frac{suicides\_no}{population/100000}$$ which is a derived attribute. 

Another reason why I chose to keep `suicides/100k pop` is because it helps **normalize suicides by population**. A large country may have many suicides because there are more people that live there, but a smaller country may have more suicides per population. 

In [152]:
df = df.drop(['suicides_no'], axis=1)

#### Remove `gdp_for_year ($)`

`gdp_for_year ($)` is highly correlated with with `gdp_per_capita ($)`. Since `gdp_per_capita ($)` was chosen as the dependent variable, it does not make sense to use `gdp_for_year ($)` since `gdp_per_capita ($)` is $$\frac{gdp\_for\_year ($)}{population}$$ which is a derived attribute.


Another reason why I chose to keep `gdp_per_capita ($)` is because it helps **normalize GDP by population**. A large country may have more gross domestic product because there are more people that live there, but a smaller country may have more gdp per capita. 

In [153]:
df = df.drop([' gdp_for_year ($) '], axis=1)

#### Remove `country-year`

Since `country-year` already is derived from `country` and `year`, it is redundant to include the `country-year`. Thus, we drop `country-year`.

I also chose to keep `country` and `year` because upon one-hot encoding, there will be **N countries** + **M years** columns versus **N x M** additional columns, which should help with computational speed. 

In [154]:
df = df.drop(['country-year'], axis=1)

#### Remove target variable

In [155]:
df_no_target = df.drop(['suicides/100k pop'], axis=1)

In [156]:
df_no_target

Unnamed: 0,country,year,sex,age,population,HDI for year,gdp_per_capita ($),generation
0,Albania,1987,male,15-24 years,312900,,796,Generation X
1,Albania,1987,male,35-54 years,308000,,796,Silent
2,Albania,1987,female,15-24 years,289700,,796,Generation X
3,Albania,1987,male,75+ years,21800,,796,G.I. Generation
4,Albania,1987,male,25-34 years,274300,,796,Boomers
...,...,...,...,...,...,...,...,...
27815,Uzbekistan,2014,female,35-54 years,3620833,0.675,2309,Generation X
27816,Uzbekistan,2014,female,75+ years,348465,0.675,2309,Silent
27817,Uzbekistan,2014,male,5-14 years,2762158,0.675,2309,Generation Z
27818,Uzbekistan,2014,female,5-14 years,2631600,0.675,2309,Generation Z


### Most Frequent imputation

For all values with NaN, we will replace with most frequent

In [157]:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy="most_frequent") 
df_no_target[:] = imp.fit_transform(df_no_target)

In [158]:
df_no_target

Unnamed: 0,country,year,sex,age,population,HDI for year,gdp_per_capita ($),generation
0,Albania,1987,male,15-24 years,312900,0.713,796,Generation X
1,Albania,1987,male,35-54 years,308000,0.713,796,Silent
2,Albania,1987,female,15-24 years,289700,0.713,796,Generation X
3,Albania,1987,male,75+ years,21800,0.713,796,G.I. Generation
4,Albania,1987,male,25-34 years,274300,0.713,796,Boomers
...,...,...,...,...,...,...,...,...
27815,Uzbekistan,2014,female,35-54 years,3620833,0.675,2309,Generation X
27816,Uzbekistan,2014,female,75+ years,348465,0.675,2309,Silent
27817,Uzbekistan,2014,male,5-14 years,2762158,0.675,2309,Generation Z
27818,Uzbekistan,2014,female,5-14 years,2631600,0.675,2309,Generation Z


In [159]:
final_features = df_no_target.columns

In [160]:
final_features

Index(['country', 'year', 'sex', 'age', 'population', 'HDI for year',
       'gdp_per_capita ($)', 'generation'],
      dtype='object')

### Normalization and Standardization

In [161]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_no_target[['year', 'population', 'HDI for year', 'gdp_per_capita ($)']] = scaler.fit_transform(df_no_target[['year', 'population', 'HDI for year', 'gdp_per_capita ($)']])

### One hot encoding of all nominal features

In [162]:
from sklearn.preprocessing import OneHotEncoder
X = df_no_target[['country','year','sex','age','population','HDI for year','gdp_per_capita ($)','generation']].values


In [163]:
X.shape

(27820, 8)

In [164]:
X

array([['Albania', -1.6836154088722834, 'male', ...,
        -0.32455746341615943, -0.8508637026494903, 'Generation X'],
       ['Albania', -1.6836154088722834, 'male', ...,
        -0.32455746341615943, -0.8508637026494903, 'Silent'],
       ['Albania', -1.6836154088722834, 'female', ...,
        -0.32455746341615943, -0.8508637026494903, 'Generation X'],
       ...,
       ['Uzbekistan', 1.5045189458533905, 'male', ...,
        -0.9695480041053613, -0.7707566970933892, 'Generation Z'],
       ['Uzbekistan', 1.5045189458533905, 'female', ...,
        -0.9695480041053613, -0.7707566970933892, 'Generation Z'],
       ['Uzbekistan', 1.5045189458533905, 'female', ...,
        -0.9695480041053613, -0.7707566970933892, 'Boomers']],
      dtype=object)

In [165]:
from sklearn.compose import ColumnTransformer
c_transf = ColumnTransformer([ 
    ('onehot', OneHotEncoder(), [0,2,3,7]),
    ('nothing', 'passthrough', [1,4,5,6])
])
X = c_transf.fit_transform(X).astype(float)

In [166]:
X.shape

(27820, 119)

In [167]:
X[0]

<1x119 sparse matrix of type '<class 'numpy.float64'>'
	with 8 stored elements in Compressed Sparse Row format>

In [169]:
X[0].todense()

matrix([[ 1.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.

# 3. [20 pts] Develop your classification model(s) to solve your defined problem.

We will create a model using Logistic Regression to get probabilities of suicide rates since logistic regression calculates value between [0,1].

Set the x and y values:

In [170]:
df_y = df['suicides/100k pop']
y = df_y.values
from sklearn.model_selection import train_test_split
from sklearn import metrics

In [171]:
X_train, X_test, y_train, expected = train_test_split(X, y, test_size = 0.2, random_state=0)

In [173]:
from sklearn.linear_model import LinearRegression

reg = LinearRegression().fit(X, y)
print('score',reg.score(X, y))

print('coef',reg.coef_)

print('intercept',reg.intercept_)


score 0.5199851921874148
coef [ -9.70008277 -11.87495492  -2.44363435  -9.8608163   -1.42663548
   2.44321436  13.37793148 -11.57726111  -9.70710084  -9.60651188
  -9.28186174  18.07401166  10.71284028  -6.69199241  -7.70420839
  -8.67379267   6.41908892  -1.39250912   1.6170815   -2.20294001
  -8.05080514  -5.84209157  10.73363884   8.44313168  -7.07210626
   6.41958542   5.45297705 -14.23413357  -6.88193596  -2.59407165
  15.42744332  -7.40056721  12.63989858  10.57200418  -8.78373667
   4.78258816  -7.77172032 -10.59971315 -10.19752238   8.93129777
  20.4121927    2.93911064   0.49053308  -2.28324198  -2.99125062
 -12.62796662  10.235657    17.70491055  -6.79971826  -9.59238599
   1.01507628  17.13065248  28.26380286   9.17025452   2.15618392
 -11.20913995  -6.99789203  -1.22212716  -8.97356611   4.01829795
  -2.55195568   0.36056788   3.08512011  -6.40327784   4.68195872
 -10.0427215   -6.81913426  -9.03907051 -11.38595178   2.60423141
  -0.88252794  -1.63403359  -5.25644104  12.58

# 4. [20 pts] Evaluate (and report) the model performance(s) using some of the standard techniques (e.g. 80-20 split, 10-fold cross validation, etc.).

### 80-20 split has accuracy of `0.8909058231488138`

In [None]:
X_train, X_test, y_train, expected = train_test_split(X, y, test_size = 0.2, random_state=0)
pipe_lr.fit(X_train, y_train)
y_pred = pipe_lr.predict(X_test)

In [None]:
print(f'Classification report for classifier {pipe_lr}:\n{metrics.accuracy_score(expected, y_pred)}\n')

### 10 fold cross validation has mean accuracy of `0.6949676491732568`

In [None]:
from sklearn.model_selection import StratifiedKFold
kfold = StratifiedKFold(n_splits=10).split(X, y)
scores = []
for k, (train, test) in enumerate(kfold):
     pipe_lr.fit(X[train], y[train])
     score = pipe_lr.score(X[test], y[test])
     scores.append(score)
     print('Fold: %2d, Acc: %.3f' % (k+1, score))
np.mean(scores)


# 5. [20 pts] Using your classifier model, what is the predicted category of your dependent variable for the input: "year=2000, generation=Generation X, age=20, gender=male"?

In [None]:
input = { 'year':[2000],
         'generation':['Generation X'],
         'age':['15-24 years'],
         'sex':['male'] 
}

df_input = pd.DataFrame(input,columns = ['country', 'year', 'sex', 'age', 'population', 'HDI for year',
       'gdp_per_capita ($)', 'generation'])

In [None]:
df_input

#### Imputation

In [None]:
imp.transform(df_input)

In [None]:
df_input[:] = imp.transform(df_input)

In [None]:
df_input

#### Standardization

In [None]:
df_input[['year', 'population', 'HDI for year', 'gdp_per_capita ($)']] = scaler.transform(df_input[['year', 'population', 'HDI for year', 'gdp_per_capita ($)']])

#### One-hot

In [None]:
X_input = X = df_input[['country','year','sex','age','population','HDI for year','gdp_per_capita ($)','generation']].values

X_input = ohe.transform(X_input[:,[0,2,3,7]].reshape(-1,4)).toarray()

In [None]:
df_input[['country','year','sex','age','population','HDI for year','generation']]

## ANSWER: 

Prediction is `high suicide rate`

In [None]:
pipe_lr.predict(X_input[0].reshape(1, -1))

# 6. [20 pts bonus] Using your (perhaps a different?) model, what is the actual probability of a "Generation X 20-year-old male living in a country with 40000 gdp_per_capita" would commit suicide?

In [None]:
input = {'generation':['Generation X'],
         'gdp_per_capita ($)':[40000],
         'age':['15-24 years'],
         'sex':['male'] 
}

df_input = pd.DataFrame(input,columns = ['country', 'year', 'sex', 'age', 'population', 'HDI for year',
       'gdp_per_capita ($)', 'generation'])

In [None]:
df_input

### Imputation

In [None]:
imp.transform(df_input)

In [None]:
df_input[:] = imp.transform(df_input)

In [None]:
df_input

#### Standardization

In [None]:
df_input[['year', 'population', 'HDI for year', 'gdp_per_capita ($)']] = scaler.transform(df_input[['year', 'population', 'HDI for year', 'gdp_per_capita ($)']])

#### One-hot

In [None]:
X_input = X = df_input[['country','year','sex','age','population','HDI for year','gdp_per_capita ($)','generation']].values

X_input = ohe.transform(X_input[:,[0,2,3,7]].reshape(-1,4)).toarray()

In [None]:
df_input[['country','year','sex','age','population','HDI for year','generation']]

#### Prediction Probability

In [None]:
pipe_lr.predict_proba(X_input[0].reshape(1, -1))

In [None]:
pipe_lr.predict(X_input[0].reshape(1, -1))

## ANSWER: 
Prediction Probability is thus `0.90675194` for `high suicide rate`.