For this exercise, we'll need several things from `sklearn`:

In [0]:
from sklearn import decomposition
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

We'll also need three other modules: `pandas`, `plotnine`, `sqlite3`.

Run the following cell.

It copies a database file called `friday.db` from the web into the file system hosting this notebook. This data is modified from that used in [Smith et al, 1988](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2245318/).

In [0]:
!wget https://github.com/YCMI/summer-course-2020/raw/master/friday.db

The database contains three tables, with columns as below:

**demographics**: patient_id, dob, gender, children, state

**patienthistory**: patient_id, diabetespedigree, diagnoses

**testresults**: patient_id, diastolic, systolic, bicepskinthickness, seruminsulin, glucose, bmi


Connect to the `friday.db` database and look at a sample of each table to see how the data is structured.

Make a bar chart comparing the number of male and female patients per state in our dataset.

We wish to test the hypotheses that for women, age, number of children, diabetespedigree (family history), blood pressure, serum insulin, glucose levels, and BMI are predictive of whether or not they have diabetes.

Use SQL to get a DataFrame of just the females with information on their: dob, number of children, diabetespedigree, diagnoses, diastolic, systolic, bicepskinthickness, seruminsulin, glucose, and bmi.

Let's clean up this data a bit.

Create a new column `age` and estimate it from the date of birth by assuming that every year has exactly 365.25 days.

Hint: You'll want to subtract dates and divide by `pd.to_timedelta('365.25 days')`

Derive a new column `has_diabetes` (with `True` or `False` values) based on the `diagnoses` column.

Hint: you can do this in one line.

The `bicepskinthickness` column has values with two different units -- cm and mm. Standardize this into a new column `bicepskinthickness_cm` measured in cm without units.

Hints: you could do this with the `re` module, but you don't need it. You'll also want a for loop, and an if statement to build the new column one entry at a time, although you can. You can convert a string `my_numstr` to a number using e.g. `my_num = float(my_numstr)`

Are any of the columns missing data? Are they missing a little data or a lot of data? Should we keep the column(s) with missing data?

While we're at it, drop the redundant columns: `dob`, `diagnoses`, and `bicepskinthickness`

Standardize your numeric data (should be everything except the `has_diabetes` column) to have a mean of 0 and a standard deviation of 1.

Hint: You can do this with simple arithmetic operations as we did on Wednesday or use a `StandardScaler` as you learned yesterday. The arithmetic solution can be done in two lines.

### PCA

A good starting place when looking at a new dataset is running PCA and visualizing at a couple of the first principle components.

Remember, PCA is simply a rotation of our data that allows us to visualize the axis with the greatest variation. This can elucidate some separation in our data if it exists on those axis, but it may not. 

**Adapt yesterday's PCA code to work with today's data and visualize the data in PCA space. Color-code by `has_diabetes`.** (For concreteness, plot all possible pairs from the first three principal components... that is, show three graphs PC1-PC2, PC1-PC3, and PC2-PC3.)

What do you make of the graphs?

# Training and testing sets

Split your data into training and test sets with 20% of the data being used for the test set. To ensure we all work with the same random split, let's specify a `random_state`, namely `random_state=42`.

If your predictor variables are in `X` and your outcome vector is in `y`, this can be done using:

```
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```

Hint: how can you split your DataFrame into a predictor matrix `X` and outcome vector `y`?

# Cross-validation

Using 5-fold cross-validation (`cross_val_score`) on your *training* data only, plot the relationship between the number of estimators (between 10 and 100 counting by 10s) with a max depth of 4 and the average regression score for a corresponding `RandomForest`.

Hint: you did cross-validation yesterday in the disease classification exercises with logistic regression.

# Testing

Using the number of estimators that gave the best score in the validation phase, train a `RandomForest` on your entire training set, and `.score` its performance with the test set.

Print the confusion matrix. What does it mean?

Hint: looking at `y_test.value_counts()` might help with the interpretation.

Finally, make a horizontal bar graph showing the fitted `RandomForest`'s `feature_importances_`, sorted in order of increasing importance.

Hints: you used `feature_importances_` in the Warfarin exercises. You'll need to specify a `stat` to `geom_bar` and you'll need to modify the coordinate system to make the graphs horizontal.