<center>
    <img src="JHU.png" width="200" alt="Johns Hopkins University logo">
</center>

## Hands-on Practice Assessment for Predictive Modeling in Suicide Rate Analysis Using Machine Learning

Estimated Time Needed: **60** minutes

### Overview:

In this practice assignment, you will use a compiled dataset from Kaggle, https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016, to predict suicide rates using both one-hot encoded and numerical features. You will build a Multiple Linear Regression model aimed at predicting the increasing suicide rates. The objective of this assignment is to apply various feature engineering techniques and assess the model's performance with each approach.

As you work through the assignment, you may encounter questions or challenges requiring additional clarification. In such cases, you can access hidden solutions by clicking the "Click Here" option. This feature is designed to guide you and help deepen your understanding of the material, ensuring you complete the assignment effectively.

Remember, while the solutions are a helpful resource, try solving the problems independently first to make the most of your practice experience.

### Learning Objectives

* Develop multiple linear regression models using one-hot encoding and original numerical features.
* Evaluate model performance and understand metrics such as Mean Absolute Error (MAE).
* Compare the effectiveness of different feature engineering techniques.
* Explore advantages and disadvantages of regression and encoding methods in predictive modeling.



### Data Dictionary for "Suicide Rates Overview 1985 to 2016" Dataset

| Column Name          | Description                                                                 |
|----------------------|-----------------------------------------------------------------------------|
| `country`            | The name of the country                                                     |
| `year`               | The year of the data entry                                                  |
| `sex`                | The gender of the individuals (male or female)                              |
| `age`                | The age group of the individuals                                            |
| `suicides_no`        | The number of suicides                                                      |
| `population`         | The population of the specified group                                       |
| `suicides/100k pop`  | The number of suicides per 100,000 population                               |
| `country-year`       | A combination of country and year                                           |
| `HDI for year`       | Human Development Index for the year (if available)                         |
| `gdp_for_year ($)`   | Gross Domestic Product for the year in dollars                              |
| `gdp_per_capita ($)` | Gross Domestic Product per capita in dollars                                |
| `generation`         | The generation of the individuals (e.g., Generation X, Silent, Boomers)     |


The target variable here is `suicides/100k pop` and we need to predict this variable based on various other features.

In [None]:
#importing required libraries

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,mean_absolute_error
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

### Problem 1: Use your pre-processed dataset, keep the variables as one-hot encoded, and develop a multiple linear regression model. How many regression coefficients does this model have? 

In [None]:
#Loading data into dataframe
data = pd.read_csv("suicide_rates.csv")

In [None]:
## Write your code to display the first 5 rows, list the columns, and count null values in each column.
# Display first five rows of the dataframe

# List the names of columns

#checking the data for null or missing values


<details>
<summary>Click here to view/hide solution</summary>
    
```
# Display first five rows of the dataframe
data.head()
# List the names of columns
data.columns
#checking the data for null or missing values
data.isnull().sum()  
    
```
    
</details>

In [None]:
##  Write your code here to drop the unwanted columns 'HDI for year' and 'country-year'.


<details>
<summary>Click here to view/hide solution</summary>

```
data = data.drop(['HDI for year','country-year'], axis = 1)
data.shape
```
</details>

In [None]:
## Write your code to apply one hot encoding to categorical columns country, age, sex and generation.


<details>
<summary>Click here to view/hide solution</summary>

```

categorical = ['country','age', 'sex', 'generation']
data_encoded = pd.get_dummies(data, columns=categorical)

```

Before building the model we need to normalize the numerical columns shown below as they are present in a different scale.

- `suicide_no`
- `population`
- `suicides/100k pop`
- `gdp for year ($)`
- `gdp_per_capita ($)`

In [None]:
### Write your code to standardize the numerical columns using standard scaling.
# Remove commas from the 'gdp_for_year ($)' column and transform it to a float type.
data_encoded[' gdp_for_year ($) '] = data_encoded[' gdp_for_year ($) '].str.replace(',','').astype(float)

#List the numerical columns to be standadized


# Initialize the StandardScaler


# Apply standard scaling to the numerical columns



<details>
<summary>Click here to view/hide solution</summary>
    
```
# Remove commas from the 'gdp_for_year ($)' column and transform it to a float type.
data_encoded[' gdp_for_year ($) '] = data_encoded[' gdp_for_year ($) '].str.replace(',','').astype(float)

#List the numerical columns to be standadized

numerical = ['suicides_no', 'population', 'suicides/100k pop', 

              ' gdp_for_year ($) ','gdp_per_capita ($)']

# Initialize the StandardScaler
sc = StandardScaler()

# Apply standard scaling to the numerical columns
data_encoded[numerical] = sc.fit_transform(data_encoded[numerical])

```


In [None]:
## Write your code here to build the multiple linear regression model and display the number of coefficients.
X = data_encoded.drop(['suicides/100k pop'], axis=1)  # Exclude target and columns with too many missing values
y = data_encoded['suicides/100k pop']


# Split the data into training and testing sets


# Initialize and fit the linear regression model


# Make predictions and evaluate the model


# Get the number of regression coefficients (including intercept)



<details>
<summary>Click here to view/hide solution</summary>
    
```
X = data_encoded.drop(['suicides/100k pop'], axis=1)  # Exclude target and columns with too many missing values
y = data_encoded['suicides/100k pop']


# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and fit the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions and evaluate the model
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

# Get the number of regression coefficients (including intercept)
num_coefficients = len(model.coef_) + 1  # +1 for intercept
print(f"Number of regression coefficients: {num_coefficients}")
```

</details>


### Problem 2: Use this model to predict the target variable for people with age 20, male, and generation X. Report this prediction. What is the MAE error of this prediction?

In [None]:
# Write your code here.
# Filter the dataframe to include the subset of the dataframe where columns'age_15-24 years', 'sex_male' and 'generation_Generation X' are equal to 1.'


# Drop the target variable 'suicides/100k pop' from sample_dataframe to create sample_dataframe_X.


# Use the pre-trained model to predict suicide rates 


# Calculate the Mean Absolute Error (MAE) between the predicted values and the actual values 



<details><summary>Click here to view the solution</summary>


#Filter DataFrame:
- Create a `sample_dataframe` by filtering `data_encoded` for records where:
- Age is '15-24 years'
- Sex is 'male'
- Generation is 'Generation X'

```
sample_dataframe = data_encoded[(data_encoded['age_15-24 years'] == 1) & (data_encoded['sex_male'] == 1) & (data_encoded['generation_Generation X'] == 1)]

#Drop the target variable 'suicides/100k pop' from sample_dataframe to create sample_dataframe_X.
sample_dataframe_X=sample_dataframe.drop(['suicides/100k pop'], axis=1)

#Use the pre-trained model to predict suicide rates 
y_pred = model.predict(sample_dataframe_X)
    
#Calculate the Mean Absolute Error (MAE) between the predicted values and the actual values 
mae_error = mean_absolute_error(sample_dataframe['suicides/100k pop'], y_pred)
print(f"MAE error: {mae_error}")


```

</details>

### Problem 3: Now go back to the original sex, age, and generation variables in their original numerical form (i.e. prior to the one-hot encoding) and build a new model. I.e., feature engineer the original nominal age and generation features into truly numerical features.) How many line coefficients are there?  

In [None]:
# Write your code here to map the categrical features to numerical values.
# Map categorical variables to numerical values. Do the same for the other columns generation and sex.
data['age'] = data['age'].map({
    '5-14 years': 10,
    '15-24 years': 20,
    '25-34 years': 30,
    '35-54 years': 45,
    '55-74 years': 65,
    '75+ years': 80
})





<details><summary>Click here to view the solution.</summary>
    
```
# Map categorical variables to numerical values
data['age'] = data['age'].map({
    '5-14 years': 10,
    '15-24 years': 20,
    '25-34 years': 30,
    '35-54 years': 45,
    '55-74 years': 65,
    '75+ years': 80
})
data['generation'] = data['generation'].map({
    'Generation X': 1,
    'Silent': 2,
    'Boomers': 3,
    'G.I. Generation': 4,
    'Millenials': 5
    
})

data['sex'] = data['sex'].map({'male': 1, 'female': 0})

data.dropna(subset=['generation'], inplace= True)

data = data.drop(['country'], axis = 1)
```

</details>


In [None]:
## Write your code to standardize the numerical columns using standard scaling.
#Scaling the numerical data columns.


#List the numerical columns to be standadized


# Initialize the StandardScaler


# Apply standard scaling to the numerical columns



<details><summary>Click here to view the solution.</summary>
    
```

#Scaling the numerical data columns.
data[' gdp_for_year ($) '] = data[' gdp_for_year ($) '].str.replace(',','').astype(float)

#List the numerical columns to be standadized
numerical = ['suicides_no', 'population', 'suicides/100k pop', 

              ' gdp_for_year ($) ','gdp_per_capita ($)']


# Initialize the StandardScaler
sc = StandardScaler()

# Apply standard scaling to the numerical columns
data[numerical] = sc.fit_transform(data[numerical])
    
```

</details>

In [None]:
## Write your code here to build the multiple linear regression model and display the number of coefficients.

X = data.drop(['suicides/100k pop'], axis=1)  # Exclude target and columns with too many missing values
y = data['suicides/100k pop']

# Split the data into training and testing sets



# Initialize and fit the linear regression model



# Make predictions and evaluate the model



<details><summary>Click here to view the solution.</summary>

```
X = data.drop(['suicides/100k pop'], axis=1)  # Exclude target and columns with too many missing values
y = data['suicides/100k pop']
    
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Initialize and fit the linear regression model
model_original = LinearRegression()
model_original.fit(X_train, y_train)



# Make predictions and evaluate the model
y_pred = model_original.predict(X_test)

num_line_coefficients = len(model_original.coef_) + 1  # +1 for intercept
print(f"Number of line coefficients: {num_line_coefficients}")

```
</details>

### Problem 4:  Use this new Q3 model to predict the target value for the people with age 20, male, and generation X. Report the prediction. What is the MAE error of this prediction? 

In [None]:
## Write your code here 

# Predict the target variable for a specific input (age 20, male, generation X) with original numerical form
sample_dataframe = data[(data['age'] == 20) & (data['sex'] == 1) & (data['generation'] == 1)]

#Drop the target variable 'suicides/100k pop' from sample_dataframe to create sample_dataframe_X.    


#Use the pre-trained model to predict suicide rates 


# Calculate MAE error of this prediction with original numerical form



<details><summary>Click here to view the solution</summary>

```
# Predict the target variable for a specific input (age 20, male, generation X) with original numerical form
sample_dataframe = data[(data['age'] == 20) & (data['sex'] == 1) & (data['generation'] == 1)]

#Drop the target variable 'suicides/100k pop' from sample_dataframe to create sample_dataframe_X.    
sample_dataframe_X=sample_dataframe.drop(['suicides/100k pop'], axis=1)

#Use the pre-trained model to predict suicide rates 
prediction_original = model_original.predict(sample_dataframe_X)
    
print(f"Prediction for age 20, male, generation X with original numerical form: {prediction_original[0]}")

print(f"Prediction for age 20, male, generation X with original numerical form: {prediction_original}")

# Calculate MAE error of this prediction with original numerical form

mae_error_original = mean_absolute_error(sample_dataframe['suicides/100k pop'], prediction_original)
print(f"MAE error with original numerical form: {mae_error_original}")

```

</details>

 ### Inferences drawn from the MAE of the two feature Engineering techniques.
 
 The increase in MAE from `0.3339` to `0.4216` suggests that the model's performance is slightly worse when using the original numerical form, as the average prediction error has increased.
 
 The model has performed better while using the one-hot encoding approach as it was able to capture relationships between the categories and target variable more effectively than using numerical representations.
 
 
 

### Problem 5 :Use your Q3. model to predict the target value for age 33, male, and generation Alpha (i.e. the generation after generation Z); report the prediction.

In [None]:
## Write your code here.
# Predict the target variable for a specific input (age 20, male, generation X) with original numerical form
sample_dataframe = data[(data['age'] == 30) & (data['sex'] == 1) & (data['generation'] == 5)]

 # Drop the target variable from the test data                                                                            


#Use the pre-trained model to predict suicide rates 




# Calculate MAE error of this prediction with original numerical form



<details><summary>Click here to view the solution</summary>

```
# Predict the target variable for a specific input (age 20, male, generation X) with original numerical form
sample_dataframe = data[(data['age'] == 30) & (data['sex'] == 1) & (data['generation'] == 5)]

 # Drop the target variable from the test data                                                                            
sample_dataframe_X=sample_dataframe.drop(['suicides/100k pop'], axis=1)

#Use the pre-trained model to predict suicide rates 
prediction_original = model_original.predict(sample_dataframe_X)

print(f"Prediction for age 33, male, generation X with original numerical form: {prediction_original[0]}")

# Calculate MAE error of this prediction with original numerical form

mae_error_original = mean_absolute_error(sample_dataframe['suicides/100k pop'], prediction_original)
print(f"MAE error with original numerical form: {mae_error_original}")
```

</details>

### Model Inferences:

#### Advantage of Using Regression with nominal features

In this problem the target variable `suicides/100k pop` used to predict suicide rates is a continous variable. Whenever we are having the outcome variable to be a continous variable, we need to apply regression techniques rather than classification. 

Classification is generally used when the target variable is a categorical variable.

In our present problem we encounter several nominal or categorical features such as `age`,`sex`,`generation`.

We can still build the regression model here as we use suitable feature engineering techniques such as `One-hot encoding` which will  transform these features to a form suitable for regression.

Thus, we are preserving the relationship between the categories and the continuous outcome.

#### Advantage of Using Regular Numerical values rather than one-hot encoding

When we use one-hot encoding, the dimensionality increases because each categorical feature is converted into multiple binary columns, potentially resulting in a significant number of additional columns.

However, this issue does not arise with regular numerical values, thus reducing the dimensionality.



## Summary:

This assignment offers a hands-on experience in building and evaluating the performance of a multiple linear regression model using various feature engineering techniques, such as one-hot encoding and numerical features. Ultimately, we conclude that the model performs best when utilizing the one-hot encoding approach.