Continuing with the previous machine learning problem, let's get back to the pre-processed dataset Suicide Rates Overview 1985 to 2016 file. We would like to have a machine learning model to predict the suicide rate 'suicides/100k pop'. 

In [1]:
## PREPROCESS DATASET
import numpy as np 
import pandas as pd 

## SET RANDOM SEED
np.random.seed(0)

## LOAD DATASET
df = pd.read_csv("master.csv")

## DROP POOR FEATURES AND GET ONE-HOT ENCODINGS
clean_df = df.drop(columns=[' gdp_for_year ($) ', 'country-year','HDI for year', 'country'])
clean_df = pd.get_dummies(clean_df, columns=['sex', 'age', 'generation'])

## REMOVE VARIABLES FROM WHICH THE DEPENDENT VARIABLE IS DERIVED
clean_df = clean_df.drop(columns=['suicides_no'])

## REMOVE 2016 DATA
clean_df = clean_df[clean_df["year"] != 2016]

## MOVE DEPENDENT VARIABLE TO LAST COLUMN
cols = clean_df.columns.tolist()
cols.remove("suicides/100k pop")
cols.append("suicides/100k pop")
clean_df = clean_df[cols]

## MIN/MAX NORMALIZE DATA
clean_df = clean_df.astype(np.float32)
mn = clean_df.min().values
mx = clean_df.max().values
norm = lambda x, mn, mx: (x - mn) / (mx-mn+1e-10)
unnorm = lambda x, mn, mx: (x * (mx-mn+1e-10)) + mn
norm_df = norm(clean_df, mn, mx)

# Prepare the input X matrix and target y vector
X = norm_df.loc[:, norm_df.columns != 'suicides/100k pop'].values
y = norm_df.loc[:, norm_df.columns == 'suicides/100k pop'].values.ravel()


NOTE: I removed `Country` to significantly reduce the size of the feature space (+memory footprint), and since it doesn't have a very meaningful numeric integer encoding scheme. I am also using the given `suicides/100k pop` feature as the target for the regression problem, as opposed to the binary category I derived by quantizing and binning the same feature on the classification assignment. The rest of the pre-processing is identical to what I did in Module 3. Please see my submission for assignment 3 for additional plots and reasoning behind the feature selection and cleaning. 


1. [10 pts] Use your previous pre-processed dataset, keep the variables as one-hot encoded, and develop a multiple linear regression model. How many regression coefficients does this model have? 

In [2]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold, train_test_split
from sklearn import metrics

X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, train_size=0.8)
model_config = {"fit_intercept":True}
model = LinearRegression(**model_config)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"MAE: {metrics.mean_absolute_error(y_test, y_pred):.4f}")
print(f"MSE: {metrics.mean_squared_error(y_test, y_pred):.4f}")
print(f"Number of coefficients: {model.n_features_in_+1}")

MAE: 0.0459
MSE: 0.0052
Number of coefficients: 18


2. [10 pts] Use this model to predict the target variable for people with age 20, male, and generation X. Report this prediction. What is the MAE error of this prediction?  

In [3]:
## LOOKUP GROUND TRUTH SAMPLES WITH THE SPECIFIED FEATURES
truth = clean_df[clean_df["sex_male"] == 1.]
truth = truth[truth["age_15-24 years"] == 1.]
truth = truth[truth["generation_Generation X"] == 1.]
truth = np.stack(truth.values)[..., :-1]
norm_truth = norm(truth, mn[:-1], mx[:-1])

## PRODUCE RANDOM SAMPLES WITH THE SPECIFIED FEATURES
y_pred = model.predict(norm_truth)
y_true = np.mean(y_pred)

## GENERATE RANDOM SAMPLES WITH THE SPECIFIED FEATURES
sample = {key:0. for key in norm_df.keys() if key != "suicides/100k pop"}
sample["age_15-24 years"] = 1.
sample["sex_male"] = 1.
sample["generation_Generation X"] = 1.

samples = []
for _ in range(100):
    sample.update({
        "year": np.random.randint(mn[0],mx[0]),
        "population": np.random.uniform(mn[1],mx[1]),
        "gdp_per_capita ($)": np.random.uniform(mn[2],mx[2]),
    })
    x = np.array(list(sample.values()))
    x_norm = norm(x, mn[:-1], mx[:-1])
    samples.append(x_norm)
samples = np.stack(samples)

## PRODUCE RANDOM SAMPLES WITH THE SPECIFIED FEATURES
y_pred = model.predict(samples)
y_unnorm = unnorm(np.mean(y_pred), mn[-1], mx[-1])
print(f"Predicted suicides/100k pop: {y_unnorm:.4f}")
mae = np.mean(np.abs(y_true - y_pred))
print(f"MAE: {mae:.4f}")

Predicted suicides/100k pop: 20.5034
MAE: 0.0158


3. [20 pts] Now go back to the original sex, age, and generation variables in their original numerical form (i.e. prior to the one-hot encoding) and build a new model. I.e., feature engineer the original nominal age and generation features into truly numerical features.) How many line coefficients are there? 

In [4]:
## DROP POOR FEATURES
clean_df = df.drop(columns=[' gdp_for_year ($) ', 'country-year','HDI for year', 'country'])

## CONVERT CATEGORICAL VARIABLES TO NUMERIC
categorical = ["sex","age","generation"]
mapping = {}
mapping["sex"] = {
    "male": 0.,
    "female": 1.,
}
mapping["age"] = {
    "75+ years": 0.,
    "55-74 years": 1.,
    "35-54 years": 2.,
    "25-34 years": 3.,
    "15-24 years": 4.,
    "5-14 years": 5.,
}
mapping["generation"] = {
    "G.I. Generation": 0.,   # GI Generation – 1901-1927.
    "Silent": 1.,            # Silent Generation – 1928-1945.
    "Boomers": 2.,           # Baby Boomers – 1946-1964.
    "Generation X": 3.,      # Generation X – 1965 - 1980.
    "Millenials": 4.,        # Millennials – 1981-1996.
    "Generation Z": 5.,      # Generation Z – 1997-2012.
    "Generation Alpha": 6.,  # Generation Alpha – 2013 - present.
}

for feat in categorical: 
    clean_df[feat] = clean_df[feat].apply(lambda x: mapping[feat][x])

## REMOVE VARIABLES FROM WHICH THE DEPENDENT VARIABLE IS DERIVED
clean_df = clean_df.drop(columns=['suicides_no'])

## REMOVE 2016 DATA
clean_df = clean_df[clean_df["year"] != 2016]

## MOVE DEPENDENT VARIABLE TO LAST COLUMN
cols = clean_df.columns.tolist()
cols.remove("suicides/100k pop")
cols.append("suicides/100k pop")
clean_df = clean_df[cols]

## MIN/MAX NORMALIZE DATA
clean_df = clean_df.astype(np.float32)
mn = clean_df.min().values
mx = clean_df.max().values
norm = lambda x, mn, mx: (x - mn) / (mx-mn+1e-10)
unnorm = lambda x, mn, mx: (x * (mx-mn+1e-10)) + mn
norm_df = norm(clean_df, mn, mx)

# Prepare the input X matrix and target y vector
X = norm_df.loc[:, norm_df.columns != 'suicides/100k pop'].values
y = norm_df.loc[:, norm_df.columns == 'suicides/100k pop'].values.ravel()

X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, train_size=0.8)
model_config = {"fit_intercept":True}
model = LinearRegression(**model_config)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"MAE: {metrics.mean_absolute_error(y_test, y_pred):.4f}")
print(f"MSE: {metrics.mean_squared_error(y_test, y_pred):.4f}")
print(f"Number of coefficients: {model.n_features_in_+1}")

MAE: 0.0463
MSE: 0.0053
Number of coefficients: 7


4. [10 pts] Use this new Q3. model to predict the target value for the people with age 20, male, and generation X. Report the prediction. What is the MAE error of this prediction?  

In [5]:
## LOOKUP GROUND TRUTH SAMPLES WITH THE SPECIFIED FEATURES
truth = clean_df[clean_df["sex"] == mapping["sex"]["male"]]
truth = truth[truth["age"] == mapping["age"]["15-24 years"]]
truth = truth[truth["generation"] == mapping["generation"]["Generation X"]]
truth = np.stack(truth.values)[..., :-1]
norm_truth = norm(truth, mn[:-1], mx[:-1])

## PRODUCE RANDOM SAMPLES WITH THE SPECIFIED FEATURES
y_pred = model.predict(norm_truth)
y_true = np.mean(y_pred)

## GENERATE RANDOM SAMPLES WITH THE SPECIFIED FEATURES
sample = {key:0. for key in norm_df.keys() if key != "suicides/100k pop"}
sample["sex"] = mapping["sex"]["male"]
sample["age"] = mapping["age"]["15-24 years"]
sample["generation"] = mapping["generation"]["Generation X"]

samples = []
for _ in range(100):
    sample.update({
        "year": np.random.randint(mn[0],mx[0]),
        "population": np.random.uniform(mn[1],mx[1]),
        "gdp_per_capita ($)": np.random.uniform(mn[2],mx[2]),
    })
    x = np.array(list(sample.values()))
    x_norm = norm(x, mn[:-1], mx[:-1])
    samples.append(x_norm)
samples = np.stack(samples)

## PRODUCE RANDOM SAMPLES WITH THE SPECIFIED FEATURES
y_pred = model.predict(samples)
y_unnorm = unnorm(np.mean(y_pred), mn[-1], mx[-1])
print(f"Predicted suicides/100k pop: {y_unnorm:.4f}")
mae = np.mean(np.abs(y_true - y_pred))
print(f"MAE: {mae:.4f}")

Predicted suicides/100k pop: 12.7163
MAE: 0.0101


5. [10 pts] Did you note any change in these two model performances?

With the one-hot encoded categorical variables we had a MAE of 0.0158 and with the numeric integer encoded categorical variables we had an MAE of 0.0101. That's an approximate 36% reduction in the Mean Absolute Error, which is a very good gain in terms of generalization performance. 

NOTE: My initial approach for this problem was to just use the built-in pandas functions to map the categorical variables to numeric integers. However when I examined the mapping, the categories were ordered by name rather than by their proper logical ordering (i.e. ascending/descending for age and generation). However, the performance of the model using this arbitrary ordering was significantly better, around 0.0032 on average. This is the opposite of what I would have expected, and I'm not sure why logically ordering the numerical mappings made the model worse. 

6. [10 pts] Use your Q3. model to predict the target value for age 33, male, and generation Alpha (i.e. the generation after generation Z); report the prediction. 

In [6]:
## GENERATE RANDOM SAMPLES WITH THE SPECIFIED FEATURES
sample = {key:0. for key in norm_df.keys() if key != "suicides/100k pop"}
sample["sex"] = mapping["sex"]["male"]
sample["age"] = mapping["age"]["25-34 years"]
sample["generation"] = mapping["generation"]["Generation Alpha"]

samples = []
for _ in range(100):
    sample.update({
        "year": np.random.randint(mn[0],mx[0]),
        "population": np.random.uniform(mn[1],mx[1]),
        "gdp_per_capita ($)": np.random.uniform(mn[2],mx[2]),
    })
    x = np.array(list(sample.values()))
    x_norm = norm(x, mn[:-1], mx[:-1])
    samples.append(x_norm)
samples = np.stack(samples)

## PRODUCE RANDOM SAMPLES WITH THE SPECIFIED FEATURES
y_pred = model.predict(samples)
y_unnorm = unnorm(np.mean(y_pred), mn[-1], mx[-1])
print(f"Predicted suicides/100k pop: {y_unnorm:.4f}")

Predicted suicides/100k pop: 25.3154


7. [10 pts] Give one advantage when using regression (as opposed to classification with nominal features) in terms of independent variables. 

Because regression does not enforce the assumption of independence between features (unlike classification methods like Naïve Bayes), we can take full advantage of features which may have some relationship or correlation. This allows us to utilize a greater diversity of features, resulting in additional predictive signal in the dataset for the model to learn from. This means we are more likely to end up with better model given the same data source. 

8. [10 pts] Give one advantage when using regular numerical values rather than one-hot encoding for regression. 

I believe the most beneficial advantage is the ability to generalize to never-before-seen categories as shown in problem 6. The ability to include Generation Alpha, not break the model, and get a meaningful prediction is a huge boost to the generalization capability of the model. The only caveat is that this benefit is restricted to ordinal variables. Other categorical variables that have no inherient order, such as `Country`, cannot take advantage of this. If there is no relation between integer enodings for `Country`, adding a new country with the encoding n+1 does not tell the model anything meaningful for making a prediction given that never-before-seen country. 

9. [10 pts] Now that you developed both a classifier (previously) and a regression model for the problem in this assignment, which method do you suggest to your machine learning model customer? Classifier or regression? Why? 

Because the dependent variable for this dataset is continuous, and because there are a number of useful ordinal nominal variables, I would choose regression in general. Linear regression also provides other benefits which may be very helpful when conducting analysis of suicide rates, such as the ability to investigate feature influence. By determining if any feature is strongly positively or negatively correlated with suicide rate, we can derive methods for reducing suicides in a given population. 

If there was a task that was specifically looking to classify if a group was at risk or a suicide rate was over a certain threshold, I believe Logistic Regression would be an excellent approach. However, I still think standard regression is a much better fit for this dataset than classification. 