# Wrap Up

## 1. Installing Python packages

Install packages in a Jupyter notebook on any machine, including your own!  (And if you don't have Python and Jupyter notebooks installed on your computer you can find instructions to install them [here](https://www.cs.williams.edu/~cs104/docs/setup-laptop.html).
The following cell will install the three packages we have been using all semester:

In [None]:
!pip3 install -q cs104@git+https://github.com/cs104williams/cs104-toolbox
!pip3 install -q datascience@git+https://github.com/cs104williams/cs104-datascience
!pip3 install -q numpy 

We import packages that have been installed in order to use their features in our code:

In [None]:
# Second step: import packages 
from datascience import * 
from cs104 import * 
import numpy as np 
%matplotlib inline

Now let's install three different packages we haven't seen yet...

In [None]:
!pip3 install -q pandas 
!pip3 install -q scikit-learn
!pip3 install -q seaborn

In [None]:
import pandas as pd
from sklearn import *
import matplotlib.pyplot as plots
import seaborn as sns

## 2. Pandas

[Pandas](https://pandas.pydata.org/) is a library to manipulate and explore data (similar to Tables), but with more functionality. 

In [None]:
penguins = pd.read_csv('https://raw.githubusercontent.com/mcnakhaee/palmerpenguins/master/palmerpenguins/data/penguins.csv')
penguins = penguins.drop(columns = ['year'])
penguins.head(6)

In [None]:
# pandas gives us a quick summary of the data
penguins.info()

In [None]:
penguins = penguins.dropna()
penguins.head(6)

In [None]:
print('num rows (after droping nulls) = ', len(penguins))

In [None]:
penguins[penguins.species == 'Adelie'].mean(numeric_only=True)

## 3. Seaborn

[Seaborn](https://seaborn.pydata.org/) is a Data visualization library. Interfaces with pandas nicely. Makes very pretty plots! 

In [None]:
# The cs104 library changes the default plot settings.  
# This line changes them back
sns.set_theme()

In [None]:
sns.scatterplot(penguins, x="bill_length_mm", y="bill_depth_mm", hue="species");

A big advantage of seaborn is that you can quickly visualize different subsets of data:

In [None]:
sns.scatterplot(penguins, x="bill_length_mm", y="bill_depth_mm", hue="island");

In [None]:
sns.lmplot(penguins, x="bill_length_mm", y="bill_depth_mm", hue="species");

In [None]:
sns.kdeplot(penguins, x="bill_length_mm", y="bill_depth_mm", hue="species");

In [None]:
sns.pairplot(penguins, hue="species");

In [None]:
fig, ax = plots.subplots(1,2,figsize=(12,5))
sns.scatterplot(penguins, x="bill_length_mm", y="bill_depth_mm", hue="species", 
                size="body_mass_g", sizes=(30, 300), alpha=0.5, 
                ax=ax[0])
sns.violinplot(penguins, x="body_mass_g", y="species", hue="sex", 
               ax=ax[1])
fig.tight_layout()

## 4. sklearn  (Scikit-Learn)

[sklearn](https://scikit-learn.org) --- pronounced Sci Kit Learn --- is a library for machine learning (statistical pattern matching). 

### Linear Regression with sklearn

In [None]:
from sklearn import linear_model

from sklearn.metrics import r2_score as r2_score_sklearn
from sklearn.metrics import mean_squared_error as mse_sklearn
from sklearn.feature_selection import r_regression

In [None]:
# Some data wrangling to get our x and y values, this time with pandas...
chinstrap = penguins[penguins['species'] == 'Chinstrap']
x = chinstrap['bill_length_mm'].to_numpy().reshape(-1, 1)
y = chinstrap['bill_depth_mm'].to_numpy()

In [None]:
model = linear_model.LinearRegression()
model.fit(x, y)

In [None]:
print('slope    ', model.coef_[0])
print('intercept', model.intercept_)

In [None]:
y_hat = model.predict(x)
sns.scatterplot(chinstrap, x='bill_length_mm', y='bill_depth_mm')
plots.plot(x, y_hat, color='r', lw=2);

A whole lot of metrics we might want are already implemented in sklearn. 

In [None]:
print('Pearson Correlation:', r_regression(x, y)[0])
print('MSE:                ', mse_sklearn(y, y_hat))
print('R2 Score:           ', r2_score_sklearn(y, y_hat))

### Non-linear Regresion

New: let's fit a non-linear regression line with sklearn!

In [None]:
model_nonlinear = svm.SVR(kernel='poly') #does non-linear (polynomial) regression
model_nonlinear.fit(x, y)

In [None]:
# plot what the polynomial regression does 
x_range = np.arange(42.5, 57.5, 0.1).reshape(-1, 1)

y_hat_linear = model.predict(x_range)
y_hat_nonlinear = model_nonlinear.predict(x_range)

sns.scatterplot(chinstrap, x='bill_length_mm', y='bill_depth_mm')
plots.plot(x_range, y_hat_linear, color='r', label='linear', lw=2);
plots.plot(x_range, y_hat_nonlinear, color='b', label='nonlinear', lw=2)
plots.title("Linear and Nonlinear models")
plots.legend();

Take Machine Learning to learn the process for evaluating which model is better!