#### Andrew Taylor
#### atayl136
#### EN705.601 Applied Machine Learning
### Homework 7

In [1]:
# Preprocessing Suicide Rates Data Set

import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Step 0: Load the data
file_path = 'master.csv'  # Replace with the actual path to the dataset
data = pd.read_csv(file_path)

# Clean column names by removing extra quotes and trimming spaces
data.columns = data.columns.str.strip("' ").str.replace("'", "")

# Step 1: Handle missing values
# Identify columns with missing values and impute them
num_imputer = SimpleImputer(strategy='mean')
nom_imputer = SimpleImputer(strategy='most_frequent')

# Identify numerical and nominal columns, exclude the columns 'country-year', 'year', and 'HDI for year' \
# as redundant or incomplete.
num_cols = ['suicides_no', 'population', 'suicides/100k pop', 'gdp_per_capita ($)']
nom_cols = ['country', 'age', 'sex', 'generation']

# Create separate imputers for numerical and nominal columns
imputers = ColumnTransformer(
    transformers=[
        ('num', num_imputer, num_cols),
        ('nom', nom_imputer, nom_cols)])

# Apply imputers
data_imputed = pd.DataFrame(imputers.fit_transform(data), columns=num_cols+nom_cols)
data_imputed[num_cols] = data_imputed[num_cols].apply(pd.to_numeric)

# Step 2: One-hot encode nominal variables
one_hot_encoder = OneHotEncoder(sparse=False)

# Step 3: Normalize numerical features
scaler = StandardScaler()

# Step 4: Convert 'gdp_for_year ($)' to a proper numerical format
data_imputed['gdp_for_year ($)'] = data['gdp_for_year ($)'].str.replace(',', '').astype(float)
num_cols.append('gdp_for_year ($)')

# Create the preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', scaler, num_cols),
        ('nom', one_hot_encoder, nom_cols)
    ])

# Apply the preprocessing pipeline to the data
data_preprocessed = preprocessor.fit_transform(data_imputed)

# Retrieve feature names for one-hot encoded columns
one_hot_feature_names = preprocessor.named_transformers_['nom'].get_feature_names_out(input_features=nom_cols)

# Combine all feature names
all_feature_names = num_cols + one_hot_feature_names.tolist()

# Convert the preprocessed data back to a DataFrame for better readability
data_preprocessed_df = pd.DataFrame(data_preprocessed, columns=all_feature_names)

for column in data_preprocessed_df.columns:
    print(column)

suicides_no
population
suicides/100k pop
gdp_per_capita ($)
gdp_for_year ($)
country_Albania
country_Antigua and Barbuda
country_Argentina
country_Armenia
country_Aruba
country_Australia
country_Austria
country_Azerbaijan
country_Bahamas
country_Bahrain
country_Barbados
country_Belarus
country_Belgium
country_Belize
country_Bosnia and Herzegovina
country_Brazil
country_Bulgaria
country_Cabo Verde
country_Canada
country_Chile
country_Colombia
country_Costa Rica
country_Croatia
country_Cuba
country_Cyprus
country_Czech Republic
country_Denmark
country_Dominica
country_Ecuador
country_El Salvador
country_Estonia
country_Fiji
country_Finland
country_France
country_Georgia
country_Germany
country_Greece
country_Grenada
country_Guatemala
country_Guyana
country_Hungary
country_Iceland
country_Ireland
country_Israel
country_Italy
country_Jamaica
country_Japan
country_Kazakhstan
country_Kiribati
country_Kuwait
country_Kyrgyzstan
country_Latvia
country_Lithuania
country_Luxembourg
country_Macau


In [2]:
# Question 1: Multiple Linear Regression Model with all features

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Data Preparation: Separate the target variable ('suicides/100k pop') and feature variables
X = data_preprocessed_df.drop('suicides/100k pop', axis=1)
y = data_preprocessed_df['suicides/100k pop']

# Data Splitting: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Select all features
selected_features = [
    'population', 'gdp_per_capita ($)', 'gdp_for_year ($)',
    'country_Antigua and Barbuda', 'country_Argentina', 'country_Armenia', 'country_Aruba', 'country_Australia',
    'country_Austria', 'country_Azerbaijan', 'country_Bahamas', 'country_Bahrain', 'country_Barbados',
    'country_Belarus', 'country_Belgium', 'country_Belize', 'country_Bosnia and Herzegovina', 'country_Brazil',
    'country_Bulgaria', 'country_Cabo Verde', 'country_Canada', 'country_Chile', 'country_Colombia',
    'country_Costa Rica', 'country_Croatia', 'country_Cuba', 'country_Cyprus', 'country_Czech Republic',
    'country_Denmark', 'country_Dominica', 'country_Ecuador', 'country_El Salvador', 'country_Estonia',
    'country_Fiji', 'country_Finland', 'country_France', 'country_Georgia', 'country_Germany', 'country_Greece',
    'country_Grenada', 'country_Guatemala', 'country_Guyana', 'country_Hungary', 'country_Iceland',
    'country_Ireland', 'country_Israel', 'country_Italy', 'country_Jamaica', 'country_Japan',
    'country_Kazakhstan', 'country_Kiribati', 'country_Kuwait', 'country_Kyrgyzstan', 'country_Latvia',
    'country_Lithuania', 'country_Luxembourg', 'country_Macau', 'country_Maldives', 'country_Malta',
    'country_Mauritius', 'country_Mexico', 'country_Mongolia', 'country_Montenegro', 'country_Netherlands',
    'country_New Zealand', 'country_Nicaragua', 'country_Norway', 'country_Oman', 'country_Panama',
    'country_Paraguay', 'country_Philippines', 'country_Poland', 'country_Portugal', 'country_Puerto Rico',
    'country_Qatar', 'country_Republic of Korea', 'country_Romania', 'country_Russian Federation',
    'country_Saint Kitts and Nevis', 'country_Saint Lucia', 'country_Saint Vincent and Grenadines',
    'country_San Marino', 'country_Serbia', 'country_Seychelles', 'country_Singapore', 'country_Slovakia',
    'country_Slovenia', 'country_South Africa', 'country_Spain', 'country_Sri Lanka', 'country_Suriname',
    'country_Sweden', 'country_Switzerland', 'country_Thailand', 'country_Trinidad and Tobago',
    'country_Turkey', 'country_Turkmenistan', 'country_Ukraine', 'country_United Arab Emirates',
    'country_United Kingdom', 'country_United States', 'country_Uruguay', 'country_Uzbekistan', 'age_15-24 years',
    'age_25-34 years', 'age_35-54 years', 'age_5-14 years', 'age_55-74 years', 'age_75+ years',
    'sex_male',
    'generation_G.I. Generation', 'generation_Generation X', 'generation_Generation Z',
    'generation_Millenials', 'generation_Silent'
]

# X_train, X_test, y_train, y_test are defined and data_preprocessed_df contains the preprocessed data
# we only consider the selected features here
X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]

# 3. Model Building: Create the LinearRegression model
model = LinearRegression()

# 4. Model Training: Fit the model on the training data
model.fit(X_train_selected, y_train)

# 5. Model Evaluation: Evaluate the model using the test data
y_pred = model.predict(X_test_selected)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MAE: {mae}, MSE: {mse}, R2: {r2}")



MAE: 0.45562008581673424, MSE: 0.47913577061013685, R2: 0.5070618631789722


In [3]:
# Question 1: one hot encoded only model

selected_features = [
    'age_15-24 years', 'age_25-34 years', 'age_35-54 years', 'age_5-14 years', 'age_55-74 years', 'age_75+ years',
    'sex_male',
    'generation_G.I. Generation', 'generation_Generation X', 'generation_Generation Z',
    'generation_Millenials', 'generation_Silent'
]

# X_train, X_test, y_train, y_test are defined and data_preprocessed_df contains the preprocessed data
# we only consider the selected features here
X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]

# 3. Model Building: Create the LinearRegression model
model = LinearRegression()

# 4. Model Training: Fit the model on the training data
model.fit(X_train_selected, y_train)

# 5. Model Evaluation: Evaluate the model using the test data
y_pred = model.predict(X_test_selected)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MAE: {mae}, MSE: {mse}, R2: {r2}")


MAE: 0.5289259937943444, MSE: 0.6883311112973017, R2: 0.29184027507116583


In [5]:
# Question 1: Use one-hot encoded features only to predict age 20, male, generation X input
# age changed to 25 because there is no data for age 20

import numpy as np

# Initialize an input vector with zeros
input_vector = np.zeros((1, len(selected_features)))

# Filter out the features present in the dataframe
feature_to_reset = [feature for feature in selected_features if feature in selected_features]

# Reset the corresponding positions in the input vector to zero
for feature in feature_to_reset:
    input_vector[0][selected_features.index(feature)] = 0
        
# Update the values based on the given input
input_vector[0][selected_features.index('age_15-24 years')] = 1
input_vector[0][selected_features.index('sex_male')] = 1
input_vector[0][selected_features.index('generation_Generation X')] = 1

# Display the updated input vector
print(f'\nInput Vector: {input_vector} \n')
print('\n')

# X_train, X_test, y_train, y_test are defined and data_preprocessed_df contains the preprocessed data
# we only consider the selected one hot features here
X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]

# 3. Model Building: Create the LinearRegression model
model = LinearRegression()

# 4. Model Training: Fit the model on the training data
model.fit(X_train_selected, y_train)

# 5. Model Evaluation: Evaluate the model using the test data
y_pred = model.predict(X_test_selected)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MAE: {mae}, MSE: {mse}, R2: {r2}")


# Use the model to make a prediction
predicted_suicide_rate = model.predict(input_vector)

print(f'Predicted Suicide Rate for Age 25, Male, Generation x: {predicted_suicide_rate}')


Input Vector: [[1. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0.]] 



MAE: 0.5289259937943444, MSE: 0.6883311112973017, R2: 0.29184027507116583
Predicted Suicide Rate for Age 25, Male, Generation x: [0.22070312]




In [4]:
# Retrieve the number of regression coefficients from the model trained on the feature set
num_coefficients = len(model.coef_)

num_coefficients


12

In [6]:
# Question 2: Predicting the same point with Numerical and ordinal features

# Define the mapping for age and generation features
age_mapping = {
    '5-14 years': 0,
    '15-24 years': 1,
    '25-34 years': 2,
    '35-54 years': 3,
    '55-74 years': 4,
    '75+ years': 5
}

generation_mapping = {
    'G.I. Generation': 0,
    'Silent': 1,
    'Boomers': 2,
    'Generation X': 3,
    'Millenials': 4,
    'Generation Z': 5
}

# Apply the mapping to the entire dataset
data['age_encoded'] = data['age'].map(age_mapping)
data['generation_encoded'] = data['generation'].map(generation_mapping)

# binary encode the 'sex' feature for the entire dataset
data['sex_encoded'] = (data['sex'] == 'male').astype(int)

# Extract the features and target variable from the entire dataset
X_all = data[['age_encoded', 'sex_encoded', 'generation_encoded']].values
y_all = data['suicides/100k pop'].values

# Split the data into training and test sets
X_train_all, X_test_all, y_train_all, y_test_all = train_test_split(X_all, y_all, test_size=0.2, random_state=42)

# Create the Linear Regression model and fit it on the entire dataset
model_all = LinearRegression()
model_all.fit(X_train_all, y_train_all)

# Evaluate the model using the test data from the entire dataset
y_pred_all = model_all.predict(X_test_all)
mae_all = mean_absolute_error(y_test_all, y_pred_all)
mse_all = mean_squared_error(y_test_all, y_pred_all)
r2_all = r2_score(y_test_all, y_pred_all)

# Specific prediction for age 25-34 years, male, and Generation X
# 'age_encoded' = 1, 'sex_encoded' = 1, 'generation_encoded' = 3
specific_data_point_all = np.array([[1, 1, 3]])
specific_prediction_all = model_all.predict(specific_data_point_all)

# print the model evaluation metrics and specific prediction
print("Model Evaluation Metrics (Entire Dataset)")
print(f"- Mean Absolute Error (MAE): {mae_all:.2f}")
print(f"- Mean Squared Error (MSE): {mse_all:.2f}")
print(f"- Coefficient of Determination (R^2): {r2_all:.3f}")

print("\nSpecific Prediction")
print(f"- Predicted Suicide Rate for Age 12-24, Male, Generation X: {specific_prediction_all[0]:.2f} per 100,000 population")



Model Evaluation Metrics (Entire Dataset)
- Mean Absolute Error (MAE): 10.17
- Mean Squared Error (MSE): 250.64
- Coefficient of Determination (R^2): 0.283

Specific Prediction
- Predicted Suicide Rate for Age 12-24, Male, Generation X: 14.36 per 100,000 population


In [7]:
# Retrieve the number of coefficients in the linear regression model
num_coefficients = len(model_all.coef_)

print(f"The number of line coefficients in the model is: {num_coefficients}")


The number of line coefficients in the model is: 3


#### Question 3: Performance

The model with all features had these statistics:

MAE: 0.45562008581673424, MSE: 0.47913577061013685, R2: 0.5070618631789722

The model with the selected one-hot encoded features only had this performance for the given input on 12 coefficients:

MAE: 0.5289259937943444, MSE: 0.6883311112973017, R2: 0.29184027507116583
Predicted Suicide Rate for Age 25, Male, Generation x: [0.22070312]

But the model using sex, age, and generational variables as binary and ordinal numeric features, had 3 coefficients, made a prediction for age 25, male, and generation X:

Model Evaluation Metrics (Entire Dataset)
- Mean Absolute Error (MAE): 10.17
- Mean Squared Error (MSE): 250.64
- Coefficient of Determination (R^2): 0.283

Specific Prediction
- Predicted Suicide Rate for Age 12-24, Male, Generation X: 14.36 per 100,000 populationModel Evaluation Metrics (Entire Dataset)
- Mean Absolute Error (MAE): 10.17
- Mean Squared Error (MSE): 250.64
- Coefficient of Determination (R^2): 0.283

Specific Prediction
- Predicted Suicide Rate for Age 25-34, Male, Generation X: 18.12 per 100,000 population


In [8]:
# Question 4: Prediction for age 33, male and Generation Alpha

# Encoding for age 33 falls under the category '25-34 years', which is encoded as 2
# Encoding for male is 1
# Encoding for generation Alpha would be the generation after Generation Z, so it would be encoded as 6 (one more than Generation Z's encoding of 5)
new_data_point = np.array([[2, 1, 6]])

# Use the model to make a new prediction
new_prediction = model_all.predict(new_data_point)

print(f"Predicted Suicide Rate for Age 33, Male, Generation Alpha: {new_prediction[0]:.2f} per 100,000 population")


Predicted Suicide Rate for Age 33, Male, Generation Alpha: 16.91 per 100,000 population


#### Question 5: Advantages using Regression in terms of independent variables

One advantage of using regression models when dealing with independent variables is the ability to capture and interpret the relationships between variables in a more nuanced way. In regression, the coefficients associated with the independent variables indicate the strength and direction of the relationship with the dependent variable. 

For example, in linear regression, the coefficient for an independent variable tells you how much the dependent variable is expected to increase (or decrease) for each one-unit increase in that independent variable, holding all other variables constant. This allows for a richer understanding of the underlying relationships, and you can quantify how a change in one feature is likely to impact the target variable. 

In contrast, classification models with nominal features generally don't offer this level of interpretability regarding the magnitude of impact of each feature on the target variable. 

#### Question 6: Advantages when using regular numerical values rather than one-hot encoding

One advantage of using regular numerical values rather than one-hot encoding for regression models is the preservation of ordinal relationships between categories. In many cases, the ordinal nature of a feature (e.g., age groups, education levels, or ratings) carries meaningful information that can be useful for making predictions. By converting these to regular numerical values, the model can capture the inherent order in the data and possibly result in a more accurate and interpretable model.

For example, encoding age groups ('25-34 years', '35-44 years', etc.) as ordinal numerical values (2, 3, etc.) allows the regression model to understand that '35-44 years' is a higher age group compared to '25-34 years'. This ordinal information is lost when using one-hot encoding, which treats each category as an independent feature.

Moreover, using regular numerical values for ordinal features reduces the dimensionality of the dataset, making the model simpler, more interpretable, and less prone to overfitting compared to one-hot encoding, which increases the number of features.

#### Question 7: Classification or Regression?

In the context of predicting suicide rates, I would recommend using a regression model over a classifier for the following reasons:

##### Why Regression:

1. **Continuous Outcome**: Suicide rates are continuous variables, generally expressed as rates per 100,000 population. Regression is well-suited for predicting a continuous outcome.

2. **Subtlety and Nuance**: Regression can capture the subtlety and nuance in how various factors contribute to suicide rates. The coefficients provide an interpretable way to understand the magnitude and direction of each feature's effect.

3. **Predictive Flexibility**: Regression models can make predictions for combinations of feature values that may not have been present in the training data. This is useful for making predictions about specific subgroups or under specific conditions.

4. **Ordinal Features**: If the dataset contains ordinal features (like age groups), regression models can effectively incorporate this ordinality, as opposed to classifiers that would treat each age group as an independent category.

5. **Policy Implications**: Understanding the rate allows policymakers to gauge the severity of the issue and allocate resources accordingly. A classification model, which would simply categorize an instance as high or low risk, might not provide this level of detail.

##### Honorable Mentions for the Classifier:

1. **Interpretability**: Classifiers can be easier to interpret when the outcome has clearly defined categories (e.g., 'High Risk' vs 'Low Risk').

2. **Imbalanced Data**: If the data shows extreme imbalances in suicide rates, classification might be more appropriate, as many classifiers have good techniques for handling class imbalance.

3. **Decision Boundaries**: Classifiers can capture complex decision boundaries, which might be useful if the relationship between features and suicide rates is not linear or easily approximable by a regression function.

##### Final Recommendation:

Given the nature of the problem and the type of insights we're likely interested in, a regression model would be more appropriate for predicting suicide rates. It offers the granularity, interpretability, and predictive flexibility that are crucial for understanding a complex issue like this.
