In [1]:
import seaborn as sns
import numpy as np
import pandas as pd
from datascience import *


# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)
from matplotlib import patches

## Part 1: a quick intro to pandas

Throughout this semester, we've used the `datascience` library: an educational tool that makes using functions from matplotlib, pandas, numpy, scipy, and other libraries easier for beginner Python coders. However, the problem is that, in practice, most data scientists do not use the `datascience` library, instead just using the other libraries directly.

In our case, the main tool in industry for **tabular data analysis and manipulations** is `pandas`, which is derived from the term "panel data" and "Python data analysis". You'll notice that pandas dataframes looks very similar to the `Table` data type we've seen before, but with a few syntax changes, we get a much more powerful tool.

In [None]:
# Let's use a dataset on car crashes, collected by 538
# It's going to be a pandas dataframe; we can read in data with pd.read_csv
crashes = ...

In [None]:
# To analyze a dataset, we can do a few things
# head will let us view a few rows (like tbl.show())
...

In [None]:
# same with tail
...

In [None]:
# Now, how do we access data from this table?
# To get a column, we can use bracket notation
...
# this is called a series; what if we wanted a specific value from this series?


In [None]:
# I can get multiple columns, but look at the datatype
...

So, what if I wanted to access specific rows/columns from the table? Bracket notation works well for columns, but we can use a better function called loc and iloc to get specific rows/columns.

Notice the numbers (0, 1, 2, 3, 4) on the left side of the dataframe. This is called the **index**. This will matter when we use loc.

In [None]:
crashes.head()

In [None]:
# df.loc takes in 2 arguments: rows, columns
# Let's find the total crashes where alcohol was involved for index 10 to 20  
...

In [None]:
# What states do these values correspond to?
...

In [None]:
# It's a bit annoying to do this, so we can actually just set the states as the index
crashes = ...
crashes.head()

In [None]:
# Now, I can find specific states using .loc
...

In [None]:
# What if I tried using df.iloc?
...

In [None]:
# iloc works specifically for the index position, so a number
...
# so even though the index is a string, we can use actual positions instead

So, now that we know how to work with pandas dataframes, let's do some data manipulations, primarily the equivalent for `tbl.where` and adding/modifying columns.

As a note: pandas is capable of **a lot** more features, such as grouping (`df.groupby("col").agg(func)`), pivoting (`df.pivot`), applying (`df.apply(func)`) and joining (`df.merge()`) that are similar to the datascience tools, albeit with more advanced use cases, like outer/inner/left/right joins, for example. 

Series are also capable of more features than just numpy arrays, including aggregating/counting values (`series.value_counts()`) or finding unique values (`series.unique()`).

In [None]:
# How do we work with pandas series? Just like arrays!
# Let's find the proportion of crashes that involved alcohol
alcohol_prop = ...

In [None]:
# If I want to add that to the table, all you need to do is use brackets
...
crashes.head()

## Part 2: using sklearn

scikit-learn is a machine learning library that offers tools to solve both **supervised learning** problems (regression, classification) and **unsupervised learning** problems (ex. clustering). 

It **abstracts** away a lot of the mathematics, linear algebra, and code that is involved in creating a proper model. Using pandas, this means that it makes our lives a lot easier when we want to generate models (especially complex, multi-variable models). 

However, **a word of warning:** just because machine learning makes this process easier, that doesn't mean you shouldn't ignore the data science lifecycle. Always make sure you're using your tools correctly, visualizing your data, cleaning it as necessary, making the correct assumptions, etc.

In [None]:
# Our goal today: using data about nba players, let's predict the number of points
# a player will average, based on their 3 pt field goal attempts

nba = ... # import our data: it's called "nba18-19.csv"
nba.head()

In [None]:
# Checking if linear regression is a decent tool
...

In [None]:
# correlation coefficient
from scipy.stats import pearsonr

...

In [None]:
# Looks good! So, now that we have this, we can start using sklearn
# First off: let's import our tools

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error 

from sklearn.model_selection import train_test_split

In [None]:
# Steps to create a model:
# 1) split our data into training and testing sets (using an 80/20 split)
X_train, X_test, y_train, y_test = ...
X_train

In [None]:
# Steps to create a model:
# 1) create a model object
our_first_model = ... # we want an intercept!
our_first_model

In [None]:
# 2) "train" the model: the computer will use the "ordinary least squares" approach
# implemented by scipy.linalg
...

In [None]:
# 3) use the model to predict
# approach 1: building a line
slope, intercept = ... # slope, intercept

# If this player attempts 4 3-point attempts, what would be their predicted PTS?


In [None]:
# approach 2: using sklearn predict
...

In [None]:
# 4) predicting and evaluating our model (is this a good predictor?)
# We will use our /test/ set, since this is supposed to be "unseen" data
# (remember, we used the /train/ set to build the model - we want to avoid overfitting/bias)

# 1st thing to check: r**2 (tldr: what % of data fit the model?)
...

In [None]:
# 2nd thing to check: "loss" = mean squared error (how "off" are our predictions?)
predictions = ...
...

In [None]:
# Graphing our predictions on the test data!
sns.scatterplot(x = "3PA", y = "PTS", data = nba, alpha = 0.6);
plt.scatter(X_test["3PA"], predictions);


In [None]:
# residual plot; another diagnostic
residuals = ...

# what are we looking for?
plt.scatter(X_test["3PA"], residuals);
plt.axhline(0, color = "red")
plt.ylim(-15, 16);
plt.ylabel("Residual");
plt.xlabel("3PA");

So, why is sklearn useful? It makes multi-variate regression (ordinary least squares) really, really easy! Just plug in a dataframe with the variables you want to use to predict when you build the model. 

In the next cell, we'll use multiple variables (3PA and AST - assists) to predict PTS.

In [None]:
# Notice that this is no longer y = mx + b
# This will be: y = m1x1 + m2x2 + b, because we have multiple x-variables

# Step 1: train/test split
X_train, X_test, y_train, y_test = train_test_split(nba[["3PA", "AST"]], nba["PTS"], 
                                                    train_size = 0.8, test_size = 0.2, random_state = 42)

In [None]:
# Step 2 and 3: create and fit model
multivar_model = LinearRegression(fit_intercept = True) # we want an intercept!
multivar_model.fit(X_train, y_train)

In [None]:
mvslope, mvint = multivar_model.coef_, multivar_model.intercept_
mvslope, mvint
# Line: y_pred = ...

In [None]:
# Step 4: predict with the model and evaluating
multivar_preds = multivar_model.predict(X_test)
multivar_model.score(X_test, y_test) # r**2

In [None]:
mean_squared_error(y_test, multivar_preds) # notice that it fits our data better!

There is a lot of mathematical and statistical theory behind these tools, so if you were interested in this content, check out DATA 200 (which focuses a lot on modeling in data science and properly applying these tools to real data).  

This lesson was a very quick overview of the field, but hopefully this piques your interest.