# Fantasy Football

In this notebook, I will make an attempt to predict football stats for the NFL player Tom Brady.
These stats could be used in a fantasy football league.

This is a continuation of the previous notebook where we defined and created our `sql` database (using the python module `sqlite3`).
_Now we do not need to interface with `nflgame` to access the data we want._

In this notebook we will load data from the database and make our first predictions.
My predictions will be compared with those from Yahoo! and ESPN, and ultimately graded on the actual outcome in each game.

_This notebook is aimed for running on Google Colab, so first we need to clone this repository to make the data available -- I don't yet know how to clone a git repository into google colab and then use a notebook from that repository, so this is a bit redundant._

In [None]:
import sqlite3
import numpy as np
import pandas as pd

# Load the modules from scikit-learn
from sklearn import svm
from sklearn import neighbors
from sklearn.metrics import explained_variance_score  # quantifying accuracy of regression
from sklearn.model_selection import train_test_split

So, we will now prepare the data for training.  We will use `numpy` to easily calculate averages of the columns for different numbers of rows and feed those into the training.

I understand this setup seems a little convoluted:
- convert `nflgame` data into `sql` database
- load `sql` database and re-calculate data, put into `pandas` dataframe
- load into ML tools

At the moment I want some practice with `sql`, and I know how to interface `pandas` dataframes with ML tools.   
I'm using this ~convoluted workflow as a way to practice with `sql`.  
For more developed workflows, I hope to learn more about `sql` and use it directly to interface data with ML tools, or I may just jump straight from `nflgame` into `pandas`.

## Learning

There are only 16 games per season, so the way I have this structured thus far is likely not the most ideal for predicting outcomes.  
Instead, I can imagine a better way would involve using individual plays categorized by many features, e.g., {field position, score, quarter+time remaining, down+distance, OFF/DEF formation, home/away, current stats, and metrics for the game's progression} for both the player of interest and the defense of interest.  
If I have the time, I will investigate such an approach.  For now, I will keep it simple and just use aggregated game information.

In [0]:
conn = sqlite3.connect(sqlite_file)
dfs  = {} # contain `sql` database tables as individual dataframes (in a dictionary)
for key in unique_opps+["tb12"]:
    dfs[key] = pd.read_sql("SELECT * FROM {0}".format(key), conn)
conn.close()

Now let's update the dataframe so that rows 2-15 represent the averages of all the previous data (week 1 will just be as it is, we won't make predictions for that week).

In [32]:
df_tb12 = dfs['tb12']
print df_tb12

    passing_tds  fumbles_lost  passer_rating  passing_att  passing_cmp  \
0             3             0          120.9           35           25   
1             2             1           72.5           36           20   
2             3             0          142.6           27           21   
3             1             0          107.1           24           19   
4             1             0           69.5           44           27   
5             1             0           82.7           32           19   
6             1             0          100.8           27           16   
7             2             0           90.5           36           19   
8             3             0          125.1           43           30   
9             2             0          123.1           25           19   
10            4             0          158.3           27           21   
11            4             0          148.9           29           21   
12            2             0         

In [34]:
# testing code for calculating averages:
updated_values = []
values = df_tb12['passing_ints'].values
for v,value in enumerate(values):
    print value,values[:v],np.mean(values[:v])
    if v>0:
        updated_values.append(np.mean(values[:v]))
    else:
        updated_values.append(value)
print
print len(updated_values),updated_values

0 [] nan
2 [0] 0.0
0 [0 2] 1.0
0 [0 2 0] 0.6666666666666666
2 [0 2 0 0] 0.5
0 [0 2 0 0 2] 0.8
0 [0 2 0 0 2 0] 0.6666666666666666
0 [0 2 0 0 2 0 0] 0.5714285714285714
0 [0 2 0 0 2 0 0 0] 0.5
0 [0 2 0 0 2 0 0 0 0] 0.4444444444444444
0 [0 2 0 0 2 0 0 0 0 0] 0.4
0 [0 2 0 0 2 0 0 0 0 0 0] 0.36363636363636365
0 [0 2 0 0 2 0 0 0 0 0 0 0] 0.3333333333333333
0 [0 2 0 0 2 0 0 0 0 0 0 0 0] 0.3076923076923077
0 [0 2 0 0 2 0 0 0 0 0 0 0 0 0] 0.2857142857142857
0 [0 2 0 0 2 0 0 0 0 0 0 0 0 0 0] 0.26666666666666666

16 [0, 0.0, 1.0, 0.6666666666666666, 0.5, 0.8, 0.6666666666666666, 0.5714285714285714, 0.5, 0.4444444444444444, 0.4, 0.36363636363636365, 0.3333333333333333, 0.3076923076923077, 0.2857142857142857, 0.26666666666666666]


#### Train/Test Split

With all of the data nicely organized into a SQLite database, let's use pandas to easily read that and prepare for our learning.

In [0]:
df  = df.fillna(-1)
tmp = df.sample(frac=1) # shuffle the dataframe rows
tts = train_test_split(df[features].values,\
                       df['SalePrice'].values, \
                       test_size=0.25)
X_train,X_test,Y_train,Y_test = tts

#### Pre-process Data

In [0]:
# Develop the scaling on the training dataset, and then apply the same shift to the test
from sklearn.preprocessing import StandardScaler

# scale features
scaler = StandardScaler()
scaler.fit(X_train)

# scale target values
scaler_target = StandardScaler()
scaler_target.fit(Y_train.reshape(-1,1))

In [0]:
# Scale values
X_test_scale  = scaler.transform(X_test)
Y_test_scale  = scaler_target.transform([Y_test])
X_train_scale = scaler.transform(X_train)
Y_train_scale = scaler_target.transform([Y_train])

### K-Nearest Neighbors

In [0]:
# KNN
n_neighbors = 5
weights = 'uniform'

knn  = neighbors.KNeighborsRegressor(n_neighbors, weights=weights)
fknn = knn.fit(X_train, Y_train)
predictions = fknn.predict(X_test)

### Support Vector Machine

In [0]:
# SVM
# with scikit-learn it is incredibly easy to get started
clf = svm.SVR()  # support vector regression
clf.fit(X_train,Y_train)

In [0]:
# Performance
predictions = clf.predict(X_test)
values = np.divide((np.asarray(predictions) - Y_test),Y_test)

fig,ax = plt.subplots(2,1,figsize=(8,8))

plt.subplot(2,1,1)
plt.hist(values,bins=20,normed=True)
plt.xlabel("(Pred-Real)/Real",position=(1,0),ha='right')
plt.ylabel("AU",position=(0,1),ha='right')
plt.text(0.97,0.90,"SVM Non-scaled Values",ha='right',transform=ax[0].transAxes)

plt.subplot(2,1,2)
plt.scatter(predictions,Y_test,color='b',edgecolor='k',alpha=0.5,label="Test Dataset");
plt.plot(Y_test,Y_test,color='r',label="Perfect")
plt.xlim(min(predictions)-10000,max(predictions)+10000)
plt.ylim(0,max(Y_test)+20000)
plt.xlabel("Predicted Sale Price",position=(1,0),ha='right')
plt.ylabel("Real Sale Price",position=(0,1),ha='right')
plt.legend()

evs = explained_variance_score(Y_test,predictions)

print(r"Distribution = {0:.3f} $\pm$ {1:.4f}".format(np.mean(values),np.std(values)))
print(r"EV Score     = {0:.3f}".format(evs))