#### FINAL PROJECT PHASE II

### Research Question:

Can we reliably predict a person's drug use based off a person's personality traits? What types of drugs are more likely to be used by certain demographics and personalities?

In this assignment we would like to see if there is a correlation between specific personality traits and drug use, specifically we will look at: euroticism, extraversion, openness to experience, agreeableness, conscientiousness, impulsivity, and sensation seeking. We will train a multivariable regression to see if we can reliably predict a person's drug use based on a person's personality traits.

### Importing:

In [3]:
# imports and settings
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from ucimlrepo import fetch_ucirepo 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_absolute_percentage_error

### Data Overview:

This dataset contains records for 1885 respondents with the following 12 attributes: personality measurements(neuroticism, extraversion, openness to experience, agreeableness, conscientiousness, impulsivity, sensation seeking), level of education, age, gender, country of residence, and ethnicity. It also contains responses for the responders use of 18 legal and illegal drugs: alcohol, amphetamines, amyl nitrite, benzodiazepine, cannabis, chocolate, cocaine, caffeine, crack, ecstasy, heroin, ketamine, legal highs, LSD, methadone, mushrooms, nicotine and volatile substance abuse and Semerone (fictitious drug to identify overresponders). The responders have selected whether they never used the drug, used it over a decade ago, or in the last decade, year, month, week, or day.

This dataset was accessed from UC Irvine Machine Learning Repository, where it was created by Elaine Fehrman, Vincent Egan, and Evgeny Mirkes. This data set was created to evaluate an individual's risk of drug consumption and misuse based on categorical data. This data collection was independantly created by the authors of the data set. An anonymous online survey methodology from Survey Gizmo was used to collect this data, which could have influenced people's responses because they can choose which questions to answer honestly or not. The people had to take a personality test to get the personality data. The data was analyzed and organized by the three authors listed at the beginning of this paragraph, and the uci machine learning repositiory obtained it for us to get the current form of the data. The people responding were aware of the data collection, as they filled out the form voluntarily, and knew what the data was going to be used for.

Here is the link to the dataset: https://archive.ics.uci.edu/dataset/373/drug+consumption+quantified
and here is the github link to the raw data: https://github.com/uci-ml-repo/ucimlrepo?tab=readme-ov-file

In [None]:
# fetch dataset 
drug_consumption = fetch_ucirepo(id=373) 
  
# data features
drug_consumption.data.features

We imported the data frame following the directions from the uci databse website, which imports the dataset as a pandas dataframe. It created a data frame of features shown above, which show the categorical data about the particpant. It also created a data frame of targets shown below, which shows the data about the drug use for each participant.

In [None]:
#data targets
drug_consumption.data.targets 

Below, a data frame showing the variables for the data frame is shown, showing the type of each variable, along with the fact that there are no missing values for any of the categories

In [None]:
# variable information 
print(drug_consumption.variables) 

##### Data Cleaning:

The first thing we decided to do is focus on the personality aspects, so we delted the columns that were irrelevant to our analysis: level of education, age, gender, country of residence, and ethnicity.

In [None]:
print("Columns before cleaning: ")
print(drug_consumption.data.features.columns)
drug_consumption.data.features = drug_consumption.data.features.drop(['age'], axis=1)
drug_consumption.data.features = drug_consumption.data.features.drop(['gender'], axis=1)
drug_consumption.data.features = drug_consumption.data.features.drop(['education'], axis=1)
drug_consumption.data.features = drug_consumption.data.features.drop(['country'], axis=1)
drug_consumption.data.features = drug_consumption.data.features.drop(['ethnicity'], axis=1)
print(drug_consumption.data.features)
print("Columns after cleaning: ")
print(drug_consumption.data.features.columns)

In [None]:
# Check unique target values for each substance before conversion
print("Unique target values before conversion:")
for column in drug_consumption.data.targets.columns:
    unique_values = drug_consumption.data.targets[column].unique()
    print(f"Unique values for {column}: {unique_values}")

# Create a new DataFrame to hold the converted values
converted_targets = drug_consumption.data.targets.copy()

# Convert target strings "CL#" to integers (1 to 6) using a for loop
for column in drug_consumption.data.targets.columns:
    for index in range(len(drug_consumption.data.targets)):
        converted_targets.loc[index, column] = int(drug_consumption.data.targets.loc[index, column][2])  # Extract the number and convert to int

# Replace the original targets with the converted DataFrame
drug_consumption.data.targets = converted_targets

# Check unique target values after conversion
print("\nUnique target values after conversion:")
for column in drug_consumption.data.targets.columns:
    unique_values = drug_consumption.data.targets[column].unique()
    print(f"Unique values for {column}: {unique_values}")

# Print targets
print(drug_consumption.data.targets)

In [None]:
print("Columns before cleaning: ")
print(drug_consumption.data.targets.columns)
drug_consumption.data.targets = drug_consumption.data.targets.drop(['semer'], axis=1)
print("Columns after cleaning: ")
print(drug_consumption.data.targets.columns)

### Exploratory Data Analysis:

In [None]:
# Combine features and targets into a single DataFrame
data_combined = pd.concat([drug_consumption.data.features, drug_consumption.data.targets], axis=1)

# Calculate the correlation matrix
correlation_matrix = data_combined.corr()

# Print the correlation matrix
print(correlation_matrix)

In [None]:
# Assuming you have already cleaned the features as needed
X = drug_consumption.data.features  # Features
y = drug_consumption.data.targets["alcohol"]    # Targets

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("Training set size:", X_train.shape)
print("Testing set size:", X_test.shape)

In [None]:
# Initialize the Linear Regression model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_hat_train = model.predict(X_train)
y_hat_test = model.predict(X_test)

# Evaluate the model on the train set
mse= mean_squared_error(y_train, y_hat_train)
rmse=np.sqrt(mse)
mae=mean_absolute_error(y_train, y_hat_train)
mape=mean_absolute_percentage_error(y_train,y_hat_train)
print("Mean squared error: ", mse)
print("Root mean squared error: ", rmse)
print("Mean absolute error: ", mae)
print("Mean absolute percentage error:", mape)

In [None]:
# Evaluate the model on the test set
mse= mean_squared_error(y_test, y_hat_test)
rmse=np.sqrt(mse)
mae=mean_absolute_error(y_test, y_hat_test)
mape=mean_absolute_percentage_error(y_test, y_hat_test)
print("Mean squared error: ", mse)
print("Root mean squared error: ", rmse)
print("Mean absolute error: ", mae)
print("Mean absolute percentage error:", mape)

In [None]:
# Calculate residuals
residuals_train = y_train - y_hat_train

# Create a scatter plot of residuals
plt.figure(figsize=(10, 6))
plt.scatter(y_hat_train, residuals_train, color='blue', alpha=0.5)
plt.axhline(y=0, color='red', linestyle='--')  # Line at y=0 for reference
plt.title('Residuals vs Predicted Values')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.grid()
plt.show()

In [None]:
# Calculate residuals
residuals_test = y_test - y_hat_test

# Create a scatter plot of residuals
plt.figure(figsize=(10, 6))
plt.scatter(y_hat_test, residuals_test, color='blue', alpha=0.5)
plt.axhline(y=0, color='red', linestyle='--')  # Line at y=0 for reference
plt.title('Residuals vs Predicted Values')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.grid()
plt.show()

### Data Limitations:

### Questions for Reviewers: