# DS 3000 Quiz 3

Due by: Tuesday Nov 14 @ 11:59 PM EST

Time Limit: You have 2 hours to complete the assignment once started

## Instructions

This quiz has 100 points total.

- You are welcome to post a private note on piazza, but to keep a consistent testing environment for all students we are unlikely to provide assistance.
- You may not contact other students with information about this this quiz
    - even saying "it was easy/hard" in a general sense can introduce a bias in favor of students who take the quiz earlier or later
- Under no circumstances should you share a copy of this quiz with anyone who isn't a member of the course staff.
- Take this quiz with open notes and feel free to access any online resource / documentation you'd like.  

### Submission Instructions
After completing the quiz below, please follow the instructions below to submit:
1. "Kernel" -> "Restart & Run All"
1. save your quiz file to this latest version
1. upload the `.ipynb` to gradescope **before** clicking submit
1. ensure that you can see your jupyter notebook in the gradescope interface after clicking "submit"

We specify the last note above as gradescope has allowed students to "submit" without uploading a file.  It is your responsibility to ensure that you've actually submitted a file.

### Academic Integrity Pledge

Input your name below to sign the Academic Integrity Pledge before continuing with the quiz. Failure to do so will result in a score of **0**.

In [None]:
name = 'Student Name Here'
print(f'I, {name}, declare that the following work is entirely my own, and that I did not copy or seek help from any students who have currently or previously taken this course, nor from any online source other than private messages between myself and the professor on Piazza/via email.')

In [None]:
# the following modules may be necessary to complete the quiz
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.linear_model import Perceptron
import matplotlib.pyplot as plt
import scipy.stats as stats
import pylab as py

# Part 1: Linear Perceptron (40 points)

This problem will make use of the `titanic` dataset in the seaborn module. This dataset looks at the passengers on the Titanic, which famously sunk in 1912. There are many features in the data set, but the ones relevant to this problem are:

- `survived`: a binary classifier identifying if the passenger survived (1) or did not survive (0)
- `sex`: the sex of the passenger (male, female)
- `age`: the age of the passenger (years)
- `fare`: the fare (dollars) the passenger paid for their ticket

The rest of this problem involves curating the data and then applying and interpreting the linear perceptron algorithm to it. **Advice:** Keep track of time; this is worth 40% of the assignment. If it takes you longer than about 45 minutes, you should move on and try the other Part before coming back to this one.

**Note Also:** Your response need not build any functions, but be sure to name variables appropriately and document your process.

In [None]:
df_titanic = sns.load_dataset('titanic')
df_titanic.head()

#### Part 1.1 (20 pts)
Curate the data set so that it is in the final form you will use for the analysis, such that:

- You drop all rows with missing values of the **four** features we are interested in.
  - Do **not** drop rows with missing values in any other feature.
- The `sex` column is turned into an indicator (dummy variable) column of 0's and 1's.
- You scale normalize the `age` and `fare` features.
- You put the three $x$ features (`sex`, `age`, `fare`) in a 2-d numpy array (called `X`) that also includes a bias column of 1's.
  - There are several ways to do this, but you may find the [`np.column_stack`](https://numpy.org/doc/stable/reference/generated/numpy.column_stack.html) function helpful.
- You put the output feature, `survived`, in a 1-d numpy array (called `y`).

Display the first few rows/observations of the `X` and `y` arrays.

#### Part 1.2 (20 pts)
Use Leave One Out Cross Validation (LOO-CV) and the `Perceptron` function from `scikit-learn` to determine how well the three features of `sex`, `age`, and `fare` predict weather a passenger would survive the Titanic sinking. You may use any settings you wish in your call of `Perceptron`, except **do not make `max_iter` more than 1000**, otherwise it will take too long to run.

Print out the cross validated accuracy score and provide a discussion **in a markdown cell** of what it means.

# Part 2: Simple Linear Regression (60 points)

One waiter recorded information about each tip he received over a period of a few months working in one restaurant, in total 244 tips. The data are stored in the `tips` dataset in the seaborn module. We will determine if, based on this waiter's data, we can predict the tip amount based on the total bill. While there are a few other features that might be helpful for predicting the tip amount, we will focus only on the total bill, so that the two features of interest are:

- `total_bill`: the total bill amount paid by the customer (in dollars)
- `tip`: the tip amount paid on top of the total bill (in dollars)

The data was reported in a collection of case studies for business statistics, published in 1995 ([source](https://www.worldcat.org/title/Practical-data-analysis-:-case-studies-in-business-statistics/oclc/726362789))

In [None]:
df_tips = sns.load_dataset('tips')
df_tips.head()

#### Part 2.1 (10 points)

Visualize the relationship between `total_bill` and `tip` with a **well-labeled** scatter plot (I have loaded `matplotlib.pyplot` for you above, but you may use `seaborn` or `plotly` if you wish). Then, **in a markdown cell**, comment on your initial thoughts concerning:

- If there seems to be a relationship between `total_bill` and `tip`.
- If so, do you believe that relationship to be linear?

#### Part 2.2 (10 pts)

Calculate the line of best fit through the full set of points. You may use either `numpy` linear algebra, or `scikit-learn`'s `LinearRegression` function, loaded below. Round the slope and intercept to two decimals, and then interpret the slope and intercept in the context of the question.

In [None]:
from sklearn.linear_model import LinearRegression

#### Part 2.3 (10 pts)

Use LOO-CV to compute the cross-validated $R^2$ for the simple linear regression model. In other words:

1. Get predictions for each observation by holding them out, fitting the line, and then predicting them with the line
2. Calculate $R^2$ using the `r2_score` function (imported below), your predictions from the cross validation, and the true $y$ values

**You may use either** `numpy` linear algebra, or `scikit-learn`'s `LinearRegression()` function, whichever you prefer.

In [None]:
from sklearn.metrics import r2_score

#### Part 2.4 (10 pts)
Calculate the residuals using the true $y$ values and your cross validated predictions. Call these `residuals`. Then, run the following three code cells to produce the residual plots for the cross validated model fit.

In [None]:
plt.scatter(x = range(len(residuals)), y = residuals)
plt.xlabel('index')
plt.ylabel('residuals');

In [None]:
plt.scatter(x = X[:,1], y = residuals)
plt.xlabel('total_bill')
plt.ylabel('residuals');

In [None]:
stats.probplot(residuals, dist="norm", plot=py)
py.show()

#### Part 2.5 (20 pts)
Provide a summary paragraph about whether you believe this simple linear model is adequate for predicting the waiter's tips and what you learned from the model. Be sure to mention specific takeways, including:

- If a general rule of thumb is to tip around 15\%, does the model suggest the waiter is being tipped enough?
- How do you interpret the cross validated $R^2$, and do you consider it to be a "good" value?
- Were all the assumptions for the model met? If not, which ones were not met, and what would you suggest addressing?
- Can you think of any other improvements that could be (easily) made to the model?