#### Andrew Taylor
#### EN 705.601
#### Applied Machine Learning
## Homework 3 - Resubmitted with Python

### Predicting Suicide Probabilities from the "Suicide Rates Overview 1985 to 2016" Data Set

#### Question 1: Objectives

Suicide is a very real phenomenon, and anything that can be done to prevent it is worthwhile. If there was a way to predict suicide, or establish a probability for suicide from some factors under study, the resulting information could be used for intervention or social planning. Assuming there are definite factors which lead to suicide, the event can be modeled and predicted with a machine learning model. It can be learned, and that will be the objective of this notebook. 

#### Question 2: The Problem

What can be classified or predicted with this dataset? It consists of 27,820 observations, with 12 features: country, year, sex, age, number of suicides, population, suicides per 100k population, country and year concatenated, Human Development Index, GDP for the observation year, GDP per capita, and the generation of the cohort recorded. Because the groups delineated by the observations are at the country/year/sex/age group level, our classification or prediction would be about groups so defined. A regression could be carried out to predict either the total number of suicides or the suicide rates (per 100k). One-hot encoding would be used for categorical variables, and after a train-test split, we could train a regression model and evaluate the performance on the testing set using metrics such as Mean Absolute Error, Mean Squared Error and R-Squared. Because we cover regression in a later module, for this exercise I will focus on classification. 

What can classification of the data set show? We can create categories/levels of suicide rates, such as low, medium, and high, then predict which category a particular group falls into (country/year/sex/age groups). Another classification task we will try is to predict the generation most at risk in a given year based on other features. This notebook will attempt these two tasks. These are the formulations of our problems under study. 

Let's also note that an unsupervised approach could yield insights. For instance, clustering countries based on features like GDP, GDP per capita, and suicide rates might reveal groups of countries with similar economic and mental health profiles. We could also use unsupervised methods to create new features which might be useful for other tasks.

#### Question 3: The Dependent Variable

We are going to do two classifications. For the first, suicide rates, we will transform the continuous variable 'suicides per 100k pop' into a categorical variable to serve as our dependent variable. This will have three levels, low, medium, and high suicide rate. This will let us look at which features indicate which groups most at risk.

For our second study, the dependent variable will be the 'generation' feature. The task would be predicting the generation most at risk in a given year based on other features. These are our dependent/target variables and the other features would represent independent variables for prediction.

#### Question 4: Correlations and Variable Ranking

Using Weka, it is easy to run a preliminary correlation matrix and see the correlation between the features and the dependent variables. For suicide rate, we already have a continuous variable, so we only need to define the bins for the levels of risk. However, for the generation question we need to one-hot encode. Let's look at suicide rates first:

![image.png](attachment:image.png)

Looking at the results above, we can see the top 5 features that correlated to the target variable are sex, total suicides for the group, age, generation and country. The *i>?* feature is actually country, for some reason Weka isn't displaying the column header correctly. Also, total suicides is a somewhat redundant feature, so the main features that display information we can use are sex, age (a proxy for generation), and country. HDI, the Human Development Index comes in close after country, and this would make sense because HDI is "is a statistical composite index of life expectancy, education (mean years of schooling completed and expected years of schooling upon entering the education system), and per capita income indicators, which is used to rank countries into four tiers of human development.", per Wikipedia. So, we will focus on HDI too. To get the true numbers, let's drop the other features and re-run the correlation matrix:

![image-2.png](attachment:image-2.png)

Unfortunately Weka did not recalculate the correlations based on the reduced features, and I don't know why. But, eyeballing the proportions of the correlations reported we can see that sex is about 10 times more correlated with suicide rate than country, and 3 times as correlated as age. HDI rounds out the group and the year of the observation does not seem to matter.

For the generational risk question, I dropped all features except suicide rate and total suicides, because that's what we are trying to understand. It doesn't help us to know if the country is correlated with the generation, for example. After discretizing the suicide rate into low, medium and high, we see this matrix:

![image-4.png](attachment:image-4.png)

Once again Weka doesn't seem to be dropping features correctly, but from the numbers that low and medium rates of suicide are most highly correlated to the generation. We'll see more when we run a classification.

#### Question 5: Pre-processing

From this exploration, and some common sense, we can remove some features and perform Sequential Backwards Selection on the features that matter most for both questions. Since the data set comes from Kaggle it is generally well-formed without blanks and missing values to impute. The major features to use for each question are:

*Predicting Suicide Rates*
1) age
2) sex
3) Country
4) HDI

*Predicting Generational Risk*
1) suicide rate per 100k

The other features are either derived features, or exhibit low correlation with the target variable. In the case of generational risk, the other features are irrelevant.


To preprocess this dataset for machine learning classification with SVM and Random Forest, we'll follow these steps:

1) Handle missing values.
2) One-hot encode nominal variables (including 'age').
3) Normalize numerical features.
4) Convert 'gdp_for_year ($)' to a proper numerical format.



In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Step 0: Load the data
file_path = 'master.csv'  # Replace with the actual path to the dataset
data = pd.read_csv(file_path)

# Clean column names by removing extra quotes and trimming spaces
data.columns = data.columns.str.strip("' ").str.replace("'", "")

# Step 1: Handle missing values
# Identify columns with missing values and impute them
num_imputer = SimpleImputer(strategy='mean')
nom_imputer = SimpleImputer(strategy='most_frequent')

# Identify numerical and nominal columns
num_cols = ['year', 'suicides_no', 'population', 'suicides/100k pop', 'HDI for year', 'gdp_per_capita ($)']
nom_cols = ['country', 'sex', 'age', 'country-year', 'generation']

# Create separate imputers for numerical and nominal columns
imputers = ColumnTransformer(
    transformers=[
        ('num', num_imputer, num_cols),
        ('nom', nom_imputer, nom_cols)])

# Apply imputers
data_imputed = pd.DataFrame(imputers.fit_transform(data), columns=num_cols+nom_cols)
data_imputed[num_cols] = data_imputed[num_cols].apply(pd.to_numeric)

# Step 2: One-hot encode nominal variables
one_hot_encoder = OneHotEncoder(sparse=False, drop='first')

# Step 3: Normalize numerical features
scaler = StandardScaler()

# Step 4: Convert 'gdp_for_year ($)' to a proper numerical format
data_imputed['gdp_for_year ($)'] = data['gdp_for_year ($)'].str.replace(',', '').astype(float)
num_cols.append('gdp_for_year ($)')

# Create the preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', scaler, num_cols),
        ('nom', one_hot_encoder, nom_cols)
    ])

# Apply the preprocessing pipeline to the data
data_preprocessed = preprocessor.fit_transform(data_imputed)

# Retrieve feature names for one-hot encoded columns
one_hot_feature_names = preprocessor.named_transformers_['nom'].get_feature_names_out(input_features=nom_cols)

# Combine all feature names
all_feature_names = num_cols + one_hot_feature_names.tolist()

# Convert the preprocessed data back to a DataFrame for better readability
data_preprocessed_df = pd.DataFrame(data_preprocessed, columns=all_feature_names)




#### Question 6: Classification Model

For both studies, we're looking at classification problems. Let's define the classification tasks and propose prototype models for each:

#### 1. Suicide Rate Classification:

##### Problem Definition:
Predict the risk category (Low, Medium, High) of suicide rates based on age, sex, country, and Human Development Index (HDI).

##### Prototype Model:
**Model:** Support Vector Machine (SVM) with Radial Basis Function (RBF) kernel.
- The RBF kernel can capture non-linear relationships, which might be present in this dataset.

**Features:**
- Age: Continuous or ordinal variable (could be binned into age groups).
- Sex: Binary variable.
- Country: One-hot encoded to represent each country as a binary vector.
- HDI: Continuous variable representing the Human Development Index.

**Target Variable:**
- Risk category of suicide rates: Categorical (Low, Medium, High).

#### 2. Generational Risk Classification:

##### Problem Definition:
Predict the generation most at risk based on the risk categories (Low, Medium, High) of suicide rates.

##### Prototype Model:
**Model:** Random Forest Classifier.
- Given the categorical nature of the independent variables (one-hot encoded generations), an ensemble method like Random Forest can effectively handle this type of data.

**Features:**
- Risk categories: Three binary variables representing Low, Medium, and High risk.

**Target Variable:**
- Generation: One-hot encoded, each generation will be a binary column.

#### Model Training and Evaluation:
For both models, the dataset will be split into training and testing sets, using an 80-20 split. The models will be trained on the training set and evaluated on the test set.

**Evaluation Metrics:**
- **Accuracy:** Gives an overall idea of how often the classifier is correct.
- **Precision, Recall, F1-score:** These metrics will provide a detailed performance analysis for each class, especially useful if there's a class imbalance.
- **Confusion Matrix:** To visually understand the true positives, false positives, true negatives, and false negatives for each class.

#### Model Optimization:
1. **Hyperparameter Tuning:** Use grid search or random search to find the optimal parameters for each model.
2. **Feature Importance:** Especially for the Random Forest model, understanding which features are most influential can help refine the model and potentially simplify it.
3. **Cross-Validation:** Use k-fold cross-validation for a more robust evaluation of the model's performance.

This is the format I will follow below.




In [5]:
# Function to categorize suicide rates
def categorize_suicide_rate(rate):
    if rate <= 10:
        return 'Low'
    elif 10 < rate <= 20:
        return 'Medium'
    else:
        return 'High'

# Add 'Risk Category' to data_imputed based on 'suicides/100k pop'
data_imputed['Risk Category'] = data_imputed['suicides/100k pop'].apply(categorize_suicide_rate)


# Check if 'Risk Category' exists in data_imputed
'Risk Category' in data_imputed.columns


True

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC  # Importing SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix  # Importing evaluation metrics

# Add risk category to the preprocessed dataset
# Try adding the 'Risk Category' column to data_preprocessed_df 
try:
    data_preprocessed_df['Risk Category'] = data_imputed['Risk Category'].values
    operation_status = "Successfully added 'Risk Category' to data_preprocessed_df."
except Exception as e:
    operation_status = f"An error occurred: {e}"

operation_status


data_preprocessed_df['Risk Category'] = data_imputed['Risk Category']

# Step 3: Feature Selection (Updated)
# Since the features are already one-hot encoded and normalized, we just need to select them from data_preprocessed_df
# Identify the selected feature columns from the preprocessed DataFrame
selected_features_svm = [col for col in data_preprocessed_df.columns if col in all_feature_names]
X_svm = data_preprocessed_df[selected_features_svm]
y_svm = data_preprocessed_df['Risk Category']

# Step 4: Model Training and Evaluation (Updated)
# Split the data into training and test sets (80-20 split)
X_train_svm, X_test_svm, y_train_svm, y_test_svm = train_test_split(X_svm, y_svm, test_size=0.2, random_state=42)

# Initialize and train the SVM model with RBF kernel
svm_model = SVC(kernel='rbf', random_state=42)
svm_model.fit(X_train_svm, y_train_svm)

# Make predictions on the test set
y_pred_svm = svm_model.predict(X_test_svm)

# Evaluate the model
accuracy_svm = accuracy_score(y_test_svm, y_pred_svm)
classification_rep_svm = classification_report(y_test_svm, y_pred_svm)
confusion_mat_svm = confusion_matrix(y_test_svm, y_pred_svm)

accuracy_svm, classification_rep_svm, confusion_mat_svm


(0.9759166067577283,
 '              precision    recall  f1-score   support\n\n        High       0.98      0.97      0.97      1128\n         Low       0.98      0.99      0.99      3505\n      Medium       0.94      0.91      0.93       931\n\n    accuracy                           0.98      5564\n   macro avg       0.97      0.96      0.96      5564\nweighted avg       0.98      0.98      0.98      5564\n',
 array([[1095,    0,   33],
        [   0, 3487,   18],
        [  27,   56,  848]], dtype=int64))

### SVM Conclusions Based on Suicide Risk Research Question:

#### Research Question:
The SVM model was designed to predict the risk category (Low, Medium, High) of suicide rates based on age, sex, country, and Human Development Index (HDI).

#### Model Performance:
- **Accuracy**: Approximately 97.59%
- **Precision, Recall, F1-score**: High for all categories; particularly strong for "Low" risk category.
- **Confusion Matrix**: Most entries lie along the diagonal, indicating a high number of true positives and true negatives.

### Conclusions:

1. **Highly Accurate**: With an accuracy of nearly 98%, the model is highly effective at predicting the risk categories for suicide rates based on the features selected.
  
2. **Strong Precision and Recall**: The model has high precision and recall, which suggests that it correctly identifies positive cases and also minimizes false negatives. 
  
3. **Low False Positives and Negatives**: The confusion matrix shows that false positives and false negatives are low, which further attests to the model's reliability.

### Critique:

1. **Overfitting Risk**: With such a high accuracy, there is a risk of overfitting. The model may not generalize well to new, unseen data. Cross-validation can be used to mitigate this risk.
  
2. **Class Imbalance**: The model performs exceptionally well on the "Low" risk category, which might indicate a class imbalance in the dataset. The performance on "Medium" and "High" categories, while good, is not as strong.
  
3. **Interpretability**: SVM models, especially with non-linear kernels, can be difficult to interpret. This could be a drawback if understanding the feature importance is crucial for the research.
  
4. **Limited Feature Set**: The model currently uses age, sex, country, and HDI as predictors. The inclusion of additional variables like economic indicators or mental health statistics could potentially improve the model's predictive power.

Overall, the model performs exceptionally well in predicting the risk categories of suicide rates. However, further validation is needed to ensure that it generalizes well to unseen data.

In [15]:
# Data Preparation for Generational Risk Classification
# Add the 'generation' column back to the preprocessed DataFrame
data_preprocessed_df['generation'] = data_imputed['generation'].values

# We'll use the 'Risk Category' created earlier as features for this model
X_rf = pd.get_dummies(data_preprocessed_df['Risk Category'], prefix='Risk')
y_rf = data_imputed['generation']  # Target is the 'generation' column from data_imputed

# Update X_rf to include the 'generation' column
X_rf['generation'] = data_preprocessed_df['generation'].values

print('generation' in data_preprocessed_df.columns)
print('generation' in X_rf.columns)


True
True


In [19]:
from sklearn.ensemble import RandomForestClassifier  # Importing RandomForestClassifier

# Feature Engineering for Random Forest Model 

# Step 1: Interaction Terms
# Create interaction terms between the risk categories (Low, Medium, High)
X_rf['Risk_Low_Medium'] = X_rf['Risk_Low'] * X_rf['Risk_Medium']
X_rf['Risk_Medium_High'] = X_rf['Risk_Medium'] * X_rf['Risk_High']
X_rf['Risk_Low_High'] = X_rf['Risk_Low'] * X_rf['Risk_High']

# Step 2: Split the data into training and test sets (80-20 split)
X_train_rf, X_test_rf, y_train_rf, y_test_rf = train_test_split(X_rf, y_rf, test_size=0.2, random_state=42)

# Step 3: Class Weights 
# Calculate class weights to handle class imbalance
class_weights = y_train_rf.value_counts().to_dict()
for key in class_weights.keys():
    class_weights[key] = 1 / class_weights[key]

# One-hot encode the 'generation' column
generation_dummies = pd.get_dummies(X_rf['generation'], prefix='generation')

# Drop the original 'generation' column and add the one-hot encoded columns
X_rf_engineered = X_rf.drop('generation', axis=1)
X_rf_engineered = pd.concat([X_rf_engineered, generation_dummies], axis=1)

# Step 4: Model Training and Evaluation

# run the Random Forest model training with the engineered features

# Split the engineered data into training and test sets (80-20 split)
X_train_rf_engineered, X_test_rf_engineered, y_train_rf, y_test_rf = train_test_split(X_rf_engineered, y_rf, test_size=0.2, random_state=42)

# Initialize and train the Random Forest Classifier model with new features and class weights
rf_model_engineered = RandomForestClassifier(
    n_estimators=50, 
    max_depth=10, 
    min_samples_split=2,
    class_weight=class_weights,
    random_state=42)

# Fit the engineered model to the training data
rf_model_engineered.fit(X_train_rf_engineered, y_train_rf)

# Make predictions on the test set
y_pred_rf_engineered = rf_model_engineered.predict(X_test_rf_engineered)

# Evaluate the engineered model
accuracy_rf_engineered = accuracy_score(y_test_rf, y_pred_rf_engineered)
classification_rep_rf_engineered = classification_report(y_test_rf, y_pred_rf_engineered)
confusion_mat_rf_engineered = confusion_matrix(y_test_rf, y_pred_rf_engineered)

accuracy_rf_engineered, classification_rep_rf_engineered, confusion_mat_rf_engineered


(1.0,
 '                 precision    recall  f1-score   support\n\n        Boomers       1.00      1.00      1.00       967\nG.I. Generation       1.00      1.00      1.00       573\n   Generation X       1.00      1.00      1.00      1322\n   Generation Z       1.00      1.00      1.00       303\n     Millenials       1.00      1.00      1.00      1144\n         Silent       1.00      1.00      1.00      1255\n\n       accuracy                           1.00      5564\n      macro avg       1.00      1.00      1.00      5564\n   weighted avg       1.00      1.00      1.00      5564\n',
 array([[ 967,    0,    0,    0,    0,    0],
        [   0,  573,    0,    0,    0,    0],
        [   0,    0, 1322,    0,    0,    0],
        [   0,    0,    0,  303,    0,    0],
        [   0,    0,    0,    0, 1144,    0],
        [   0,    0,    0,    0,    0, 1255]], dtype=int64))

Evaluation of Model Performance
The Random Forest model for predicting the most at-risk generation based on the risk categories of suicide rates yielded an accuracy of 1.0. The precision, recall, and f1-score are also 1.0 for each class, as shown in the classification report. The confusion matrix confirms that there are no misclassifications.

Observations:
Perfect Score: The model's accuracy, precision, recall, and F1-score are all perfect (1.0). This is usually a red flag for overfitting or data leakage.

Class Balance: Since the model is perfectly predicting each class, it suggests that the class weights and feature engineering did not negatively impact the model. However, the perfect score still raises questions.

Critique:
Overfitting/Data Leakage: A perfect score is usually a red flag and should be investigated for overfitting or data leakage.

Feature Importance: Understanding what's driving this perfect score is crucial.

Validation: The model should be validated on a completely independent dataset to truly gauge its performance.

Given these observations and critiques, further investigation is required to ensure that the model's performance is genuine and not a result of overfitting or data leakage. Let's try K-fold cross validation:

In [20]:
from sklearn.model_selection import cross_val_score

# Initialize and train the Random Forest Classifier model with new features and class weights
rf_model_kfold = RandomForestClassifier(
    n_estimators=50, 
    max_depth=10, 
    min_samples_split=2,
    class_weight=class_weights,
    random_state=42)

# Perform k-fold cross-validation (k=5)
cross_val_scores = cross_val_score(rf_model_kfold, X_rf_engineered, y_rf, cv=5, scoring='accuracy')

cross_val_scores, cross_val_scores.mean(), cross_val_scores.std()


(array([1., 1., 1., 1., 1.]), 1.0, 0.0)

Analysis of K=5 K-fold cross validation:

The accuracy is still 1.0 across all five folds indicating some data leakage possibly (where the model has access to data it shouldn't). I'll research next steps but these were my preliminary results. It seems the SVM did well, but the Random Forest was suspiciously accurate.