# Assignment 6  
## Applied Machine Learning

Andrew Chan 
EBE869

This assignment uses the `Suicide Rates Overview 1985 to 2016` dataset from Kaggle: https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016. This assumes that you have downloaded the dataset in `master.csv` which is in the same directory as this notebook.

In [23]:
import pandas as pd
import math
import numpy as np

# Locate and load the data file
df = pd.read_csv('master.csv')

# Sanity check
print(f'N rows={len(df)}, M columns={len(df.columns)}')
df.head()

N rows=27820, M columns=12


Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers


In [24]:
df.columns.values

array(['country', 'year', 'sex', 'age', 'suicides_no', 'population',
       'suicides/100k pop', 'country-year', 'HDI for year',
       ' gdp_for_year ($) ', 'gdp_per_capita ($)', 'generation'],
      dtype=object)

Adjust the gdp_for_year ($) from string to float.

In [25]:
df[' gdp_for_year ($) ']

0         2,156,624,900
1         2,156,624,900
2         2,156,624,900
3         2,156,624,900
4         2,156,624,900
              ...      
27815    63,067,077,179
27816    63,067,077,179
27817    63,067,077,179
27818    63,067,077,179
27819    63,067,077,179
Name:  gdp_for_year ($) , Length: 27820, dtype: object

In [26]:
df[' gdp_for_year ($) '] = df[' gdp_for_year ($) '].str.replace(',', '')

In [27]:
df[' gdp_for_year ($) '] = df[' gdp_for_year ($) '].astype(float) 

In [28]:
df[' gdp_for_year ($) '] 

0        2.156625e+09
1        2.156625e+09
2        2.156625e+09
3        2.156625e+09
4        2.156625e+09
             ...     
27815    6.306708e+10
27816    6.306708e+10
27817    6.306708e+10
27818    6.306708e+10
27819    6.306708e+10
Name:  gdp_for_year ($) , Length: 27820, dtype: float64

# Preprocessing (from Module 3)

The assingment states to use the "let's get back to the pre-processed dataset Suicide Rates Overview 1985 to 2016 file", thus we repeat the preprocessing steps here:

## Preprocessing Steps:

### Convert dependent variable to binary values

In [29]:
suicides_mean = df["suicides/100k pop"].mean()
suicides_mean

12.816097411933894

We will create a modify `suicides/100k pop` as follows:  
* All values larger than or equal to `suicides_mean` will be labelled with `high suicide rate` 
* All values smaller than `suicides_mean` will be labelled with `low suicide rate` 

In [30]:
def isHigh(x, mean_val):
    if x > mean_val:
        return 'high suicide rate'
    else:
        return 'low suicide rate'
df['suicides/100k pop'] = df['suicides/100k pop'].apply(isHigh, args=[suicides_mean])
df['suicides/100k pop'].value_counts()

low suicide rate     19061
high suicide rate     8759
Name: suicides/100k pop, dtype: int64

In [31]:
df.head()

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation
0,Albania,1987,male,15-24 years,21,312900,low suicide rate,Albania1987,,2156625000.0,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,low suicide rate,Albania1987,,2156625000.0,796,Silent
2,Albania,1987,female,15-24 years,14,289700,low suicide rate,Albania1987,,2156625000.0,796,Generation X
3,Albania,1987,male,75+ years,1,21800,low suicide rate,Albania1987,,2156625000.0,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,low suicide rate,Albania1987,,2156625000.0,796,Boomers


### Removal redundant features

#### Remove `suicides_no`

`suicides_no` is highly correlated with with `suicides/100k pop`. Since `suicides/100k pop` was chosen as the dependent variable, it does not make sense to use `suicides_no` since `suicides/100k pop` is $$\frac{suicides\_no}{population/100000}$$ which is a derived attribute. 

Another reason why I chose to keep `suicides/100k pop` is because it helps **normalize suicides by population**. A large country may have many suicides because there are more people that live there, but a smaller country may have more suicides per population. 

In [32]:
df = df.drop(['suicides_no'], axis=1)

#### Remove `gdp_for_year ($)`

`gdp_for_year ($)` is highly correlated with with `gdp_per_capita ($)`. Since `gdp_per_capita ($)` was chosen as the dependent variable, it does not make sense to use `gdp_for_year ($)` since `gdp_per_capita ($)` is $$\frac{gdp\_for\_year ($)}{population}$$ which is a derived attribute.


Another reason why I chose to keep `gdp_per_capita ($)` is because it helps **normalize GDP by population**. A large country may have more gross domestic product because there are more people that live there, but a smaller country may have more gdp per capita. 

In [33]:
df = df.drop([' gdp_for_year ($) '], axis=1)

#### Remove `country-year`

Since `country-year` already is derived from `country` and `year`, it is redundant to include the `country-year`. Thus, we drop `country-year`.

I also chose to keep `country` and `year` because upon one-hot encoding, there will be **N countries** + **M years** columns versus **N x M** additional columns, which should help with computational speed. 

In [34]:
df = df.drop(['country-year'], axis=1)

### One hot encoding of all nominal features

In [37]:
df_no_target = pd.get_dummies(df[['country','year','sex','age','population','HDI for year','gdp_per_capita ($)','generation']], drop_first = True)

### Mean imputation

For all values with NaN, we will replace with mean so that we do not lose information.

In [35]:
df = df.fillna(df.mean())

In [36]:
final_features = df.columns

In [81]:
final_features

Index(['country', 'year', 'sex', 'age', 'population', 'suicides/100k pop',
       'HDI for year', 'gdp_per_capita ($)', 'generation'],
      dtype='object')

### Normalization and Standardization

In [38]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_no_target[['year', 'population', 'HDI for year', 'gdp_per_capita ($)']] = scaler.fit_transform(df[['year', 'population', 'HDI for year', 'gdp_per_capita ($)']])

In [53]:
df_no_target.head()

Unnamed: 0,year,population,HDI for year,gdp_per_capita ($),country_Antigua and Barbuda,country_Argentina,country_Armenia,country_Aruba,country_Australia,country_Austria,...,age_25-34 years,age_35-54 years,age_5-14 years,age_55-74 years,age_75+ years,generation_G.I. Generation,generation_Generation X,generation_Generation Z,generation_Millenials,generation_Silent
0,-1.683615,-0.391617,-2.819415e-14,-0.850864,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,-1.683615,-0.39287,-2.819415e-14,-0.850864,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,1
2,-1.683615,-0.397548,-2.819415e-14,-0.850864,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,-1.683615,-0.466035,-2.819415e-14,-0.850864,0,0,0,0,0,0,...,0,0,0,0,1,1,0,0,0,0
4,-1.683615,-0.401485,-2.819415e-14,-0.850864,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


In [54]:
df_no_target.columns.values

array(['year', 'population', 'HDI for year', 'gdp_per_capita ($)',
       'country_Antigua and Barbuda', 'country_Argentina',
       'country_Armenia', 'country_Aruba', 'country_Australia',
       'country_Austria', 'country_Azerbaijan', 'country_Bahamas',
       'country_Bahrain', 'country_Barbados', 'country_Belarus',
       'country_Belgium', 'country_Belize',
       'country_Bosnia and Herzegovina', 'country_Brazil',
       'country_Bulgaria', 'country_Cabo Verde', 'country_Canada',
       'country_Chile', 'country_Colombia', 'country_Costa Rica',
       'country_Croatia', 'country_Cuba', 'country_Cyprus',
       'country_Czech Republic', 'country_Denmark', 'country_Dominica',
       'country_Ecuador', 'country_El Salvador', 'country_Estonia',
       'country_Fiji', 'country_Finland', 'country_France',
       'country_Georgia', 'country_Germany', 'country_Greece',
       'country_Grenada', 'country_Guatemala', 'country_Guyana',
       'country_Hungary', 'country_Iceland', 'country_

## Final list of features selected

* 'country'
* 'year'
* 'sex'
* 'age'
* 'population'
* 'suicides/100k pop'
* 'HDI for year' 
* 'gdp_per_capita ($)'
* 'generation'

# 1. [20 pts] Due to the severity of this real-world crisis, what information would be the most important one to "machine learn"? Can it be learned?

The most important target variable to machine learn is `suicides/100k pop` as we would like to be able to predict low versus high suicide rates. We would also like to determine what factors may lead to higher suicide rates. Predicting this variable will allow countries to issue multilateral policies to help reduce future deaths. For example, it if it found that `age` is a large factor in suicide rates, countries can focus on preventative social wellness programs for those specific groups. 

# 1. [20 pts] What is the dependent variable you decided? Why?

The dependent variable will be generated from the `suicides/100k pop`. We have chosen to use this versus `suicides_no` because this is normalized to population versus an absolute number of suicides. This will help provide a better prediction so that the model does not skew towards larger countries which may naturally have more suicides. 

# 2. [20 pts] Set the dependent variable into two categories based on a defensible criteria. 
(Hint: skirts of the probability density function)

To attempt a machine learning solution we will turn this into a binary classification problem of `low suicide rate` versus `high suicide rate`. Framing the problem this way will allow us to train a classifier. 

The steps to take will be the following:
1. We will use the dependent variable `suicides/100k pop`
2. Then calculate the mean of this feature `suicides_mean`
3. Convert the data into a binary classification problem such that   
    a. If the value is > than `suicides_mean` then convert to **`high suicide rate`**  
    b. If the value is < than `suicides_mean` then convert to **`low suicide rate`**   

# 3. [20 pts] Develop your classification model(s) to solve your defined problem.

We will create a model using svc and a train test split of 0.80 and 0.20.

Set the x and y values:

In [39]:
df_y = df['suicides/100k pop']
df_X = df_no_target
y = df_y.values
X = df_X.values

In [40]:
from sklearn.model_selection import train_test_split

In [41]:
X_train, X_test, y_train, expected = train_test_split(X, y, test_size = 0.2, random_state=0)

Use linear SVC which is a linear svm that uses one versus rest scheme:

In [42]:
from sklearn.svm import LinearSVC
from sklearn import metrics
clf = LinearSVC(random_state=0,max_iter=4000)
clf.fit(X_train, y_train)

LinearSVC(max_iter=4000, random_state=0)

In [43]:
predicted = clf.predict(X_test)

---

Try LogisticRegression

In [44]:
from sklearn.linear_model import LogisticRegression
pipe_lr = LogisticRegression(random_state=14,
               penalty='l1',
               solver='liblinear',
               class_weight='balanced',
               multi_class='auto',
               max_iter=10000)
pipe_lr.fit(X_train, y_train)
y_pred = pipe_lr.predict(X_test)

In [48]:
X_test.shape

(5564, 115)

In [52]:
pipe_lr.predict_proba(X_test[0].reshape(1, -1))

array([[0.35616371, 0.64383629]])

In [46]:
print(f'Classification report for classifier {pipe_lr}:\n{metrics.accuracy_score(expected, y_pred)}\n')

Classification report for classifier LogisticRegression(class_weight='balanced', max_iter=10000, penalty='l1',
                   random_state=14, solver='liblinear'):
0.8980948957584471



# 4. [20 pts] Evaluate (and report) the model performance(s) using some of the standard techniques (e.g. 80-20 split, 10-fold cross validation, etc.).

In [45]:
print(f'Classification report for classifier {clf}:\n{metrics.accuracy_score(expected, predicted)}\n')

Classification report for classifier LinearSVC(max_iter=4000, random_state=0):
0.9069015097052481



# 5. [20 pts] Using your classifier model, what is the predicted category of your dependent variable for the input: "year=2000, generation=Generation X, age=20, gender=male"?

In [89]:
input = { 'year':[2000],
         'generation':['Generation X'],
         'age':[20],
         'sex':['male']
         
    
}

df_input = pd.DataFrame(input,columns = ['country', 'year', 'sex', 'age', 'population', 'suicides/100k pop',
       'HDI for year', 'generation'])

In [90]:
df_input

Unnamed: 0,country,year,sex,age,population,suicides/100k pop,HDI for year,generation
0,,2000,male,20,,,,Generation X


In [91]:
df_input.fillna(df.mean())

Unnamed: 0,country,year,sex,age,population,suicides/100k pop,HDI for year,generation
0,,2000,male,20,1844794.0,,0.776601,Generation X


In [93]:
df_input_one_hot = pd.get_dummies(df_input[['country','year','sex','age','population','HDI for year','generation']], drop_first = True)

In [94]:
df_input_one_hot

Unnamed: 0,year,age
0,2000,20


# 6. [20 pts bonus] Using your (perhaps a different?) model, what is the actual probability of a "Generation X 20-year-old male living in a country with 40000 gdp_per_capita" would commit suicide?