# Project MLE - Drug Dosage
### Business Understanding:
A pharmaceutical company has developed a new drug that is supposed to lower blood pressure in patients. They need to determine the optimal dosage of the drug to achieve the desired effect while minimizing any potential side effects. They have conducted a clinical trial with a sample of patients, and they want to use machine learning to estimate the optimal dosage.

### Data Understanding:
The company has collected data from a clinical trial conducted on a sample of patients. The dataset contains the patient's age, gender, weight, and blood pressure measurements before and after taking the drug. The data is in a CSV file, and it is clean and ready for analysis.

### Data Preparation:
The data needs to be split into training and testing sets. The training set will be used to train the machine learning model, while the testing set will be used to evaluate the performance of the model. We will use the scikit-learn library to split the data.

### Modeling:
We will use the Maximum Likelihood Estimate (MLE) to estimate the parameters of a linear regression model. The linear regression model will predict the change in blood pressure based on the patient's age, gender, weight, and the dosage of the drug. The MLE will estimate the optimal values for the model parameters that maximize the likelihood of observing the training data. We will use the statsmodels library to perform the MLE.

### Evaluation:
We will evaluate the performance of the model on the testing set using the mean squared error (MSE) metric. The MSE measures the average squared difference between the predicted and actual blood pressure measurements in the testing set. We will compare the MSE of the linear regression model with the MSE of a baseline model that always predicts the mean blood pressure measurement in the training set.

## Data Understanding
In this part, we import the necessary libraries and load the clinical trial data into a pandas DataFrame.

In [1]:
# Import necessary libraries
import pandas as pd

# Load the clinical trial data into a pandas DataFrame
data = pd.read_csv('clinical_trial_data.csv')

## Data Preparation
Here, we split the data into training and testing sets using an 80/20 split, and assign predictor variables (age, gender, weight, dosage) to X and the target variable (blood_pressure) to y.

In [2]:
# Split the data into training and testing sets using an 80/20 split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data[['age', 'gender', 'weight', 'dosage']], data['blood_pressure'], test_size=0.2, random_state=42)


In [3]:
#Checking shape of X
print(X_train.shape)
print(X_train.dtypes)
print(X_train.isnull().sum()) and print(y_train.isnull().sum())

(24, 4)
age        int64
gender    object
weight     int64
dosage     int64
dtype: object
age       0
gender    0
weight    0
dosage    0
dtype: int64


In [4]:
# One-hot encode gender column
X_train = pd.get_dummies(X_train, columns=['gender'], drop_first=True)

In [5]:
#Checking shape of X
print(X_train.shape)
print(X_train.dtypes)
print(X_train.isnull().sum()) and print(y_train.isnull().sum())

(24, 4)
age         int64
weight      int64
dosage      int64
gender_M    uint8
dtype: object
age         0
weight      0
dosage      0
gender_M    0
dtype: int64


In [9]:
print(X_test.dtypes)


age        int64
gender    object
weight     int64
dosage     int64
dtype: object


In [10]:
X_test['gender'] = pd.to_numeric(X_test['gender'], errors='coerce')


In [11]:
print(X_test.dtypes)


age         int64
gender    float64
weight      int64
dosage      int64
dtype: object


## Modeling

In this part, we fit a linear regression model to the training data using Maximum Likelihood Estimation (MLE), and add a constant to the predictor variables (X_train) to obtain the intercept term in the model.

In [14]:
import statsmodels.api as sm

# Fit a linear regression model to the training data using Maximum Likelihood Estimation (MLE)
# and add a constant to the predictor variables (X_train) to obtain the intercept term in the model.
model = sm.OLS(y_train, sm.add_constant(X_train)).fit()




## Model evaluation
Finally, we evaluate the performance of the model on the testing set by predicting the blood pressure values for the test data using the fitted model parameters and the predictor variables in X_test. We then calculate the Mean Squared Error (MSE) between the predicted and actual blood pressure values for the test set, and calculate the baseline MSE by predicting the mean blood pressure value in the training set for each patient in the test set. We compare the MSE of the model with the baseline MSE to see if the model performs better than the mean predictor.

In [15]:
# Evaluate the performance of the model on the testing set by predicting the blood pressure values for the test data
# using the fitted model parameters and the predictor variables in X_test.
import numpy as np
y_pred = model.predict(sm.add_constant(X_test))

# Calculate the Mean Squared Error (MSE) between the predicted and actual blood pressure values for the test set.
mse = np.mean((y_test - y_pred)**2)

# Calculate the baseline MSE by predicting the mean blood pressure value in the training set for each patient in the test set.
baseline_mse = np.mean((y_test - np.mean(y_train))**2)

# Compare the MSE of the model with the baseline MSE to see if the model performs better than the mean predictor.
if mse < baseline_mse:
    print('The linear regression model performs better than the baseline model.')
else:
    print('The baseline model performs better than the linear regression model.')



The baseline model performs better than the linear regression model.


In [16]:
# Preprocess new email text
new_email = "Get rich quick! Earn thousands of dollars a day!"
preprocessed_email = preprocess(new_email)

# Convert preprocessed text to bag-of-words features
new_email_b


NameError: name 'preprocess' is not defined