# PBHLTH 198, Fall 2023 | Homework 5

# Section 1: Exploring the Basics of Machine Learning

In this section, we'll explore and review the basics of machine learning and modeling, an essential tool in epidemiology and biostatistics. Specifically, we use modeling to measure the effect of spread of certain diseases, how different groups react to different levels of treatments, and exploring the confoudig nature of our epidemiological studies.

For now, we'll stick to the basics and a general overview on modeling, equipping you with the necessary skills for the final project. Before we get into the questions, let's (re)introduce some of the more commonly used models:

### A General Overview of Some Models

**Linear Regression**: 
- Models a continous response variable (e.g. price)
- Assumes that the interaction between variables is *linear* (e.g. If I increase the value of one variable, how much will my response increase by?)
- Creates a best fit line that minimizes the *Root Mean Square Error* ($\text{RMSE}$) between the line and the data points
    - In its simplest form, the line of best fit is in the form: 
    $$\underbrace{Y}_{\text{Response Variable}} = 
    \underbrace{\beta_0}_{\text{Intercept}}\hspace{1cm} + 
    \hspace{0cm}
    \underbrace{\beta_1 X}_{\text{Data times Some Coefficient}}$$

**Logistic Regression**: 
- Models a binary output variable (e.g. disease or no disease)
- One of the more common models used in biostatistics and epidemiology 
    - Used to estimate effect of treatment for control group and treatment group
- Returns the *probability* of the response variable based on features of the data
- Takes the form:
$$\underbrace{P(Y=1)}_{\text{Probability of Y = 1}} = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}}$$
    - Where $\beta_0$ is our intercept, and $\beta_1 X$ is the coefficient determined by the model times our data
    
Don't worry too much about the specifics of each model for now. Later on in the semester, we'll work with more complex models that are more specific to the domain of epidemiology and biostatistics. Let's focus more on the overview of constructing models with a Python package. This package is called ```scikit-learn```, also known as ```sklearn```.

For the following questions, we'll be asking about how we prepare our data for model creation using ```sklearn```.

## Question 1:
Given the DataFrame ```df```, look up an ```sklearn``` function we could use you split our data into a training set and a test set?

_Type your answer here, replacing this text._

## Question 2: 
Now that we've split up our data, what is a training set and a test set? Which of these do you train your model on and why? Should you include the actual ground truth labels in your model? Why or why not?

_Type your answer here, replacing this text._

## Question 3:
For this next part, assume we trained a ```LogisticRegression``` model called ```lr```. 

Great! We've split up our data and trained our model. Before we get to predicting on our test set, first, let's evaluate how our overall model is doing. First, what sklearn function can we use to see the predicted labels our model made? What argument should we pass into this function, the training set or the testing set?

*Hint*: The link to sklearn's ```LogisticRegression``` documentation is found [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

_Type your answer here, replacing this text._

## Question 4:

A brief blurb about accuracy, it is the percentage of correct classifications our model makes.

For a more mathematical approach, accuracy in modeling is defined as:

$$\text{Accuracy}\space =\space \frac{\text{TP} + \text{TN}}{\text{All Points in the Data}},$$

where $\text{TP}$, True Positive, is the number of data points the model correctly predicted as surviving and $\text{TN}$, True Negative, is the number of data points the model correctly predicted as not surviving.

Given this information, we now have the predicted labels our model made for the set you answered in Question 2. How do we compare these predictions to the actual, ground truth labels of our data? What sklearn function could do this for us and what arguments would we pass in? What does a high accuracy indicate? How about a low one?

*Hint*: Read up on what [this](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) function does. 

_Type your answer here, replacing this text._

## Question 5:

Now that we saw how our model was doing on the training set, let's see how it does on the testing set. After doing the same thing we did in Question 4, we see that the accuracy is not as good as the accuracy on the training set. List some possible explanations as to why this may be and how you would address them.

_Type your answer here, replacing this text._

# Section 2: Working with Machine Learning


Now that we have some of the basics down, let's put our knowledge into some coding practice. This section will primarily compose of coding, making use of sklearn's functions. We'll be working with a well-known dataset: the Titanic dataset.

The primary objective of using this dataset is to predict whether or not a passenger on the Titanic survived based on certain characterstics of the passenger. Here, because we're predicting on a binary outcome ($0$ if the passenger died and $1$ if they survived), we'll be using ```LogisticRegression```.

Load in all the necessary imports below.

In [None]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

## Question 1

Load in the ```titanic.csv``` into the notebook using Pandas.

In [None]:
# YOUR CODE HERE

## Question 2

Now that we have our data loaded in, we want to split our data up, with our training data containing $60\%$ of the data and the testing data containing the remaining $40\%$. Create two new variables: ```train``` and ```test```, that should be two separate DataFrames using ```train_test_split```.

In [None]:
train, test = # YOUR CODE HERE

## Question 3

Looking at the original dataset, there's a lot of missing values for certain columns! Instead of dealing with these missing values, we'll ignore them for now by not using them for our model. Instead, we'll only be using the passenger's ```Sex``` for our model for simplicity. Run the cell below.

In [None]:
# DO NOT EDIT THIS CODE
train = train[['Survived', 'Sex']]
test = test[['Survived', 'Sex']]

The following cell below converts a passenger's sex into a numeric value, $0$ for Man and $1$ for Woman. This is because when using modeling functions, we can only pass in numeric values. Run the cell below

In [None]:
# DO NOT EDIT THIS CODE
train = train.replace({'Sex' : {'male' : 0, 'female' : 1}})
test = test.replace({'Sex' : {'male' : 0, 'female' : 1}})

In [None]:
train.head()

In [None]:
test.head()

Now that we've isolated the columns we only want to work with, create a ```LogisticRegression``` model below and fit it with the training data.

**NOTE**: Because we are only working with one feature, you may get an error saying: "Expected 2D array, got 1D array instead:". When you're passing in the feature, pass it in as a *DataFrame*, not as a Series. This note carries on to the rest of the questions.

In [None]:
lr = # YOUR CODE HERE
lr.fit(...)

## Question 4

Now that we've created and fit our model, evaluate how our model is doing on the *training set*. Use ```accuracy_score``` to simplifiy your calculations

In [None]:
# YOUR CODE HERE

## Question 5

Now evalute your model on the *testing set* and report the accuracy. How does it compare with the accuracy on the training set?

In [None]:
# YOUR CODE HERE

_Type your answer here, replacing this text._

# Question 6

That's a pretty good accuracy for both the training and test set from just predicting on one feature! How would you recommend we improve this accuracy? This could range from using more features to using a different model, etc.

For reference, this is what the whole dataset looks like:

In [None]:
pd.read_csv('titanic.csv').head()

_Type your answer here, replacing this text._