# **Pride and Joy**
### *An investigation of mental health correlates in LGBQ+ people*
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|
|Emily K. Sanders| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |Capstone Project|
|DSB-318| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |June 13, 2024|
---

## Prior Notebooks Summary

In the previous notebooks, I thoroughly cleaned the data, imputed missing values, and conducted exploratory data analysis (EDA).  Through that EDA, I determined that my model was most likely to succeed if I used the square root of the Kessler scores as my target variable, rather than the raw scores themselves, and identified predictor variables with promising correlations to those square root scores.

In this notebook, I will build and evaluate several models.  The `python` code is included for any readers who wish to follow along or reproduce my work.

## Table of Contents

- [Modeling](###Modeling)
  - [Imports](###Imports)
  - [Model 1](###Model-1)
  - etc.
- [Notebook Summary](##Notebook-Summary)  

### Imports

In [1]:
import pandas as pd
import numpy as np
import os
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [2]:
# Import the most recent dataframe from the previous notebooks
meyer = pd.read_csv('../02_data/df_after_eda.csv')
meyer.shape

(1494, 132)

### Model 1

In [3]:
# I'm just going to eyeball some features from the scatterplots
good_ones = ['chronic_strain', 'suicidality', 'w1age', 'w1auditc_i', 
  'w1connectedness_i', 'w1conversion', 'w1dudit_i', 'w1everyday_i', 
  'w1feltstigma_i', 'w1idcentral_i', 'w1internalized_i', 'w1lifesat_i', 
  'w1meim_i', 'w1pinc_i', 'w1poverty_i_ei', 'w1q03_ei', 'w1q33_ei', 
  'w1q52_ei', 'w1q72_ei', 'w1q74_22_ei', 'w1q74_21_ei', 'w1q140_ei', 
  'w1q166_ei', 'w1q167_ei', 'w1q171_8_ei', 'w1q181_ei_r', 'w1socialwb_i', 
  'w1socsupport_i']

# Make X and y
X = meyer[good_ones]
y = meyer['kessler6_sqrt']

# Very small testing set because this is an inferential model
X_train, X_test, y_train, y_test = train_test_split(X, y, 
  test_size = 0.1, random_state = 6)
  
for i in [X_train, X_test, y_train, y_test]:
  print(i.shape)

(1344, 28)
(150, 28)
(1344,)
(150,)


In [4]:
# Instantiate the model
lr = LinearRegression()

# Cross validation just for funsies
cross_val_score(lr, X_train, y_train)

# Fit the model
model_1 = lr.fit(X_train, y_train)

# Make some predictions
model_1_train_preds = model_1.predict(X_train)
model_1_test_preds = model_1.predict(X_test)

# Score the model and print some metrics
print('Training Set')
model_1.score(X_train, y_train)
print(f' MSE: {mean_squared_error(y_train, model_1_train_preds)}')
print(f' RMSE: {mean_squared_error(y_train, model_1_train_preds, squared = False)}')
print(f' MAE: {mean_absolute_error(y_train, model_1_train_preds)}')
print('='*20)
print('Testing Set')
model_1.score(X_test, y_test)
print(f' MSE: {mean_squared_error(y_test, model_1_test_preds)}')
print(f' RMSE: {mean_squared_error(y_test, model_1_test_preds, squared = False)}')
print(f' MAE: {mean_absolute_error(y_test, model_1_test_preds)}')

Training Set
 MSE: 0.4842928313042025
 RMSE: 0.6959115111163218
 MAE: 0.5390303622322911
Testing Set
 MSE: 0.4212669570651022
 RMSE: 0.6490508123907575
 MAE: 0.49657612822804764


## Notebook Summary

In this notebook, I have built several models and presented relevent metrics from each.  

In the following notebook, I will interpret these models, and discuss my conclusions, limitations, and recommendations for further research.