##  1. What information would be the most important one to "machine learn?"

The most important information for the model to learn would be the factors that lead to an increased suicide rate. If the machine can predict which countries will have the highest suicide rate then we can study those countries to determine why they have such a high rate and possibly remediate the situation. I believe that there is a high chance that this can be successfully done, however I'm not sure that the dataset in question will fully capture the information necessary.

The problem I see with this dataset is that our data isn't really giving us that good of a look into all the possible factors leading towards increased suicide rates. For instance, a culture might normalize suicide as an acceptable form of death; this would hopefully be captured in the 'country' feature, but it might be hard to track down. Likewise, certain events can also lead to an up-tick in suicides, such as following the publication of <ins>The Sorrows of Young Werther</ins> (which our model would never be able to predict) or The Great Crash of 1929 (which our model would only be able to predict after the fact).

## 2. How should the problem be set up?

I think this problem is best set up as a supervised regression problem in order to predict the suicide rate of countries. To me this would look like a model where we can feed in the appropriate parameters and recieve an expected suicide rate out of it. I think this could be reasonably accomplished with a Random Forest model.

The only way I see to set this problem up as a classification problem would be to mark some threshold as 'high suicide rate' vs 'low suicide rate.' If we wanted to go through with that method then I think that unsupervised learning techniques would be appropriate, as that way the threshold can be found within the data instead of us setting it. The classification model would then yield countries that we should examine to see if we can determine why they have high/low suicide rates.

## 3. What should the dependent variable be?

Based on the information available in the data set I believe we should aim to have suicides/100k pop. as the dependent variable, as this captures the information we want out of the model (suicide rate) and allows for comparision by country/sex/age.

## 4. Rank the variables to find some strong correlations between the independent variables and the dependent variable you decided.

The cell below prints the correlation vector for the suicide rate along with correlation vectors of the best independent variables to the table as a whole. 

The variables with highest correlation to the suicide rate are (in descending order): *HDI for year*, *population*, *gdp_for_year (\$)*, and *gdp_per_capita (\$)*. I don't think the other variables will be able to tell us much about the reason behind the suicide rate, but I do think they are important to the understanding of the suicide rate.

Of intrest to note, HDI has a high correlation with the suicide rate, but GDP does not; however, GDP has a high correlation with HDI and GDP has a okay correlation with population.

In [1]:
# Load libraries and dataset
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy.stats import gaussian_kde

from os.path import isfile

sns.set(style='ticks', color_codes=True)

data_path = './datasets/Suicide_Rates.csv'

df = pd.DataFrame()
if not isfile(data_path):
    print("Dataset not found. Please check that the dataset exists and the path is correct.")
    
else:
    df = pd.read_csv(data_path)
    
    
# GDP per year is recorded as strings and not ints so fix that real quick
df = df.rename(columns={' gdp_for_year ($) ': 'gdp_for_year ($)'})

df['gdp_for_year ($)'] = df['gdp_for_year ($)'].str.replace(',', '')
df['gdp_for_year ($)'] = df['gdp_for_year ($)'].astype(int)
    
print("Dependent variable:\n", df.corr()['suicides/100k pop'], '\n\n')
    
print("Correlations:\n")

display(df.corr()['HDI for year'], '\n\n')

print(df.corr()['population'], '\n\n')

print(df.corr()['gdp_for_year ($)'], '\n\n')

print(df.corr()['gdp_per_capita ($)'])

Dependent variable:
 year                 -0.039037
suicides_no           0.306604
population            0.008285
suicides/100k pop     1.000000
HDI for year          0.074279
gdp_for_year ($)      0.025240
gdp_per_capita ($)    0.001785
Name: suicides/100k pop, dtype: float64 


Correlations:



year                  0.366786
suicides_no           0.151399
population            0.102943
suicides/100k pop     0.074279
HDI for year          1.000000
gdp_for_year ($)      0.305193
gdp_per_capita ($)    0.771228
Name: HDI for year, dtype: float64

'\n\n'

year                  0.008850
suicides_no           0.616162
population            1.000000
suicides/100k pop     0.008285
HDI for year          0.102943
gdp_for_year ($)      0.710697
gdp_per_capita ($)    0.081510
Name: population, dtype: float64 


year                  0.094529
suicides_no           0.430096
population            0.710697
suicides/100k pop     0.025240
HDI for year          0.305193
gdp_for_year ($)      1.000000
gdp_per_capita ($)    0.303405
Name: gdp_for_year ($), dtype: float64 


year                  0.339134
suicides_no           0.061330
population            0.081510
suicides/100k pop     0.001785
HDI for year          0.771228
gdp_for_year ($)      0.303405
gdp_per_capita ($)    1.000000
Name: gdp_per_capita ($), dtype: float64


## 5. Pre-process the dataset and list the major features you want to use.

The cells below clean the data as best I could and then finally discretize the data by encoding into one-hots or integers.

After clean-up the features I would use are *country* (integer encoded), *year*, *suicides_no*, *population*, *HDI for year*, *gdp_for_year ($)*, *sex* (one-hot encoded), *age* (one-hot encoded), and *generation* (integer encoded).

In [2]:
# Duplicates? (Nope but let's be sure)
df['duplicate'] = df.duplicated()

if len(df[df['duplicate'] == True]) > 0:
    print(df[df['duplicate'] == True])
    
    dups = df[df['duplicate'] == True].index
    df.drop(dups, inplace=True)
    
df.drop(columns='duplicate', inplace=True)

In [3]:
# Missing HDI values? (This cell produces an expected and handeled warning)

# print(df.isnull().any()) indicates NaNs in 'HDI for year'

# I think we should get each countries mean and fill that in for each country instead of the average
# for the table where possible; however, I want to fill the global mean from before we mess with values 
# for countries with no HDI reporting (like Russian Federation)

HDI_mean = df['HDI for year'].mean()

countries = df['country'].unique()

# probably a better way to do this
for c in countries:
    # get a countries mean HDI
    selection = df[df['country'] == c]
    HDI = np.mean(selection['HDI for year'])
    
    # apply to NaNs in the selection
    selection['HDI for year'] = selection['HDI for year'].fillna(HDI)
    
    # drop the country and replace with the selection (Impute)
    df.drop(df[df['country'] == c].index, inplace=True)
    df = pd.concat([df, selection])


# Replace the remaining NaNs with global mean from before
df['HDI for year'] = df['HDI for year'].fillna(HDI_mean)

# check that we've eliminated NaNs
display(df.isnull().any())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selection['HDI for year'] = selection['HDI for year'].fillna(HDI)


country               False
year                  False
sex                   False
age                   False
suicides_no           False
population            False
suicides/100k pop     False
country-year          False
HDI for year          False
gdp_for_year ($)      False
gdp_per_capita ($)    False
generation            False
dtype: bool

In [4]:
# Which and when had 0 reported suicides for an entire year?

# I assume this is entirely missing data and not actually the case that they had 
# zero suicides that year (we can hope that I'm wrong though).
# I was going impute on this, but the more I worked towards that the less I saw the point, as
# some of the zeros where quite probably true values.
country_years = df['country-year'].unique()

for cy in country_years:
    selection = df[df['country-year'] == cy]
    
    if np.mean(selection['suicides_no']) == 0.0:
        index = df[df['country-year'] == cy].index
        df.drop(index, inplace=True)

In [5]:
# Get rid of derivative features (country-year, suicides/100k pop, 'gdp_per_capita ($)')
df.drop(columns=['country-year', 'suicides/100k pop', 'gdp_per_capita ($)'], inplace=True)

df.head()

Unnamed: 0,country,year,sex,age,suicides_no,population,HDI for year,gdp_for_year ($),generation
0,Albania,1987,male,15-24 years,21,312900,0.673,2156624900,Generation X
1,Albania,1987,male,35-54 years,16,308000,0.673,2156624900,Silent
2,Albania,1987,female,15-24 years,14,289700,0.673,2156624900,Generation X
3,Albania,1987,male,75+ years,1,21800,0.673,2156624900,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,0.673,2156624900,Boomers


In [6]:
df.dtypes

country              object
year                  int64
sex                  object
age                  object
suicides_no           int64
population            int64
HDI for year        float64
gdp_for_year ($)      int64
generation           object
dtype: object

In [7]:
# From module notebook

# Check unique levels and see any marker is used for a missing level
for col in df.columns:
    if df[col].dtype == np.object:
        print(col, df[col].unique())
        
# We're fine though.

country ['Albania' 'Antigua and Barbuda' 'Argentina' 'Armenia' 'Aruba' 'Australia'
 'Austria' 'Azerbaijan' 'Bahamas' 'Bahrain' 'Barbados' 'Belarus' 'Belgium'
 'Belize' 'Bosnia and Herzegovina' 'Brazil' 'Bulgaria' 'Cabo Verde'
 'Canada' 'Chile' 'Colombia' 'Costa Rica' 'Croatia' 'Cuba' 'Cyprus'
 'Czech Republic' 'Denmark' 'Ecuador' 'El Salvador' 'Estonia' 'Fiji'
 'Finland' 'France' 'Georgia' 'Germany' 'Greece' 'Grenada' 'Guatemala'
 'Guyana' 'Hungary' 'Iceland' 'Ireland' 'Israel' 'Italy' 'Jamaica' 'Japan'
 'Kazakhstan' 'Kiribati' 'Kuwait' 'Kyrgyzstan' 'Latvia' 'Lithuania'
 'Luxembourg' 'Macau' 'Maldives' 'Malta' 'Mauritius' 'Mexico' 'Mongolia'
 'Montenegro' 'Netherlands' 'New Zealand' 'Nicaragua' 'Norway' 'Oman'
 'Panama' 'Paraguay' 'Philippines' 'Poland' 'Portugal' 'Puerto Rico'
 'Qatar' 'Republic of Korea' 'Romania' 'Russian Federation' 'Saint Lucia'
 'Saint Vincent and Grenadines' 'San Marino' 'Serbia' 'Seychelles'
 'Singapore' 'Slovakia' 'Slovenia' 'South Africa' 'Spain' 'Sri Lanka'


In [8]:
# Discretize for flexibility of model

# taken from module notebook
# pandas get_dummies function is the one-hot-encoder
def encode_onehot(_df, feat):
    _df2 = pd.get_dummies(_df[feat], prefix='', prefix_sep='').max(level=0, axis=1).add_prefix(feat+' - ')
    df3 = pd.concat([_df, _df2], axis=1)
    df3 = df3.drop([feat], axis=1)
    return df3

# assigns integers to nominal data
# returns the encoded dataframe and a decoder for the encoded feature
def encode_integer(_df, feat):
    _df2 = _df.copy()
    
    # get all feature types
    feat_types = _df2[feat].unique()
    
    # make an index of the feature types and replace in dataframe
    feat_index = {}
    for idx, ft in enumerate(feat_types):
        feat_index[ft] = idx
        
        _df2[feat] = _df2[feat].replace(ft, idx)
        
    # reverse the index into a decoder
    feat_decoder = {}
    for ft in feat_index:
        feat_decoder[feat_index[ft]] = ft

    # return the encoded dataframe and the decoder dictionary
    return _df2, feat_decoder
    
# adapted from module notebook
# Get nominal variables (except 'country' going to apply encode_integer() to it to reduce feature count)
nominals = []
skip = ['country', 'generation']
for f in list(df.columns.values):
    if df[f].dtype == np.object and f not in skip:
        nominals.append(f)

# Encode the one-hots
df_o = df.copy()
for nom in nominals:
    df_o = encode_onehot(df_o, nom)
    
    cols = []
    for f in list(df_o.columns.values):
        if nom in f:
            cols += [f]
            
    display(df_o[cols][:10])
      
# Encode 'country' and 'generation' as integers
# I sorta feel like I should do this with sex as well, as it's binary in nature
df_i = df_o.copy()
df_i, country_decoder = encode_integer(df_i, 'country') 
df_i, generation_decoder = encode_integer(df_i, 'generation')

# Show the final dataframe
display(df_i.head())

Unnamed: 0,sex - female,sex - male
0,0,1
1,0,1
2,1,0
3,0,1
4,0,1
5,1,0
6,1,0
7,1,0
8,0,1
9,1,0


Unnamed: 0,age - 15-24 years,age - 25-34 years,age - 35-54 years,age - 5-14 years,age - 55-74 years,age - 75+ years
0,1,0,0,0,0,0
1,0,0,1,0,0,0
2,1,0,0,0,0,0
3,0,0,0,0,0,1
4,0,1,0,0,0,0
5,0,0,0,0,0,1
6,0,0,1,0,0,0
7,0,1,0,0,0,0
8,0,0,0,0,1,0
9,0,0,0,1,0,0


Unnamed: 0,country,year,suicides_no,population,HDI for year,gdp_for_year ($),generation,sex - female,sex - male,age - 15-24 years,age - 25-34 years,age - 35-54 years,age - 5-14 years,age - 55-74 years,age - 75+ years
0,0,1987,21,312900,0.673,2156624900,0,0,1,1,0,0,0,0,0
1,0,1987,16,308000,0.673,2156624900,1,0,1,0,0,1,0,0,0
2,0,1987,14,289700,0.673,2156624900,0,1,0,1,0,0,0,0,0
3,0,1987,1,21800,0.673,2156624900,2,0,1,0,0,0,0,0,1
4,0,1987,9,274300,0.673,2156624900,3,0,1,0,1,0,0,0,0


### 6. Devise a classification problem and present a prototype model.

I'm sort of out of time for prototyping a model, but I think what I would like to do is either classify segements of proteins into secondary structure ($\alpha$-helices, $\beta$-sheets, or unfolded chains) or classify drugs into classes (such as catecholamines, barbitutates, etc.) based on IUPAC nomenclature (I think this should be possible as IUPAC names encode the structure of the molecule but it may be a little advanced for the time being).

SVMs have been used successfully to classify proteins so I think that would be a good place to start with either one of the classification problems above (if I could figure out the best way to feed it it IUPAC names).