# CS 109 Final Project: Breaking Daily Fantasy Basketball

![alt text](http://cdn.playbuzz.com/cdn/b83ad51b-f33e-4b06-879a-8b3f8a509b3e/9f71ec8b-2e9b-4bd4-80ff-ce0f555c5653.jpg)

## Thousands of Contests and Millions of Dollars. Every day.
![alt text](https://tribwtic.files.wordpress.com/2015/10/fanduel-draft-kings.jpg?w=1200),

###Overview and Motivation

Imagine a contest that combines the world's central form of entertainment with the rush that comes with games of chance, all while lowering the bariers to entry and judgement that are inherent with the gambler's vice. Sports appeal to people from every corner of the world, from the kids watching Champions League games on a rickety signal in Sao Paulo to the executives sipping champagne in the deluxe suites of the STAPLES Center. Somehow universal and divisive, they inspire levels of irrationality and emotion that fuel the gambling industry, as seen in the crowds screaming at the 12 50-inch plasma screens in every Las Vegas sportsbook.

FanDuel and DraftKings exploit this crossroads with excessive commercials and an approachable interface. Daily fantasy sports don't scream gambling; they're a legitimate way for dudes to have fun while watching sports, while using their talents to win some money. The rising popularity of daily fantasy sports, abetted by incessant ads on every major broadcaster, called into question the legality of the endeavor. After all, putting money against the performance of certain players sounds a bit like gambling. Heck, it sounds a lot like gambling. The rising popularity of daily fantasy sports, abetted by incessant ads on every major broadcaster, called into question the legality of the endeavor. After all, putting money against the performance of certain players sounds a bit like gambling. Heck, it sounds a lot like gambling.

Our goal is to determine whether fantasy sports are indeed games of chance. Using our knowledge of data science, love of the NBA, and a deep well of statistics dating back to 1985, we will attempt to pick the optimal lineup for DraftKings and FanDuel on any given night.

![alt text](https://mikelove.files.wordpress.com/2008/07/james-stein.png)

###Related Work
Although there is no direct model upon which we have built, there are many stories of FanDuel/DraftKings players who use their knowledge of statistics to predict the outcome of games. Some news outlets report that this 1.5 percent of fantasy players obtain 80 percent of the winnings. We are under the impression that most of their models involve simple Excel manipulation, not intense uses of analytics and data science. We have certainly considered their paths, but we look to create a more nuanced model.

Additionally, we have looked at the work of Bradley Effron and Carl Morris of Stanford University in conjunction with our the work of Nate Silver to perfect our \"FabMelo\" model. FabMelo, modeled after Silver's CARMELO, gets a particular player's closest historical relatives. We use that list to improve our estimates for the day's games, with the help of Stein's Paradox in Statistics.

###Initial Questions
#####What questions are you trying to answer? 
Can we predict fantasy scores better than the average player of daily fantasy? Does the use of advanced statistics and data science give us an advantage over elite players. Our questions didn't evolve much over the course of the project, but we gained a greater appreciation for our challenges.

###Exploratory Data Analysis
####Scraping
The first step of our project was seeing what data was available and how we could access it. Early on, we decided to rely solely on stats.nba.com for our data. We iterate through the different URLs that correspond to a certain year, and then pulled the information into dataframes. After a bit of data jiu-jitsu, we were left with a comprehensive data frame.

####Optimizer
Given a list of players, a salary cap, and position count limits, we needed to chose an optimal lineup that would maximize on fantasy points. Through a fair amount of research, we discovered that the problem was analogous to the multidimensional knapsack problem, a subset of combinatorial optimization. The knapsack problem is NP-hard, so we were understandably a little nervous in approaching it. One of our team members had some experience in optimization problems, so we decided to explore some alternative approaches to linear programming. We decided to take a stab at genetic algorithms, using pyeasyga. In future iterations of our project, we'll build a linear program for optimization, compare the run times, and use the algorithm that best balances computational time and optimization.

####FabMelo

Nate Silver did this, and we did it better. Or at least a cheap interpretation of it. Using CARMELO as a model, we tried to find a list of basketball players that could be used to narrow our predictive band for modern players. That is to say, we try to get the ten players from 1985 to today that most closely resemble Stephen Curry at his current age. This was a struggle; we tried to do Nate Silver proud in not too many days. However, we're proud of what we were able to build and how we integrated it into the project. Heck, it could probably serve as a neat feature all on its own.
http://projects.fivethirtyeight.com/carmelo/

####DavidSearchCV + TimeSeriesCrossValidation
Because our data is time-dependent, we can't rely upon normal cross validation techniques, as we can't use data from the future to predict something in the past. Instead, we developed our own cross-validation function, one that would be cognizant of the order of the games. We also wanted to be able to run GridSearchCV with a range of parameters, but we once again ran into the cross-validation issue. So we made our own version of GridSearchCV, called DavidSearchCV (aptly named after one of our group memebers), to iterate through different parameters combinations and return the one that maximizes our accuracy.

####Feature Selection/Feature Engineering & EWMA
In order to utilize our current and recent past data, we decided to use an exponentially weighted moving average (or EWMA). If we had some more time to finish our project, a certain amount of it would go into improving our feature selection decisions. We definitely have all the necessary parts, it was just a matter of getting the most necessary aspects into our final design. We plan on continuing our work over winter break.\n",

###Final Analysis
The data is hard to read, because even similar players have vastly different games on the day to day. Predicting on past weeks, we have successfully called the top performers and the sleepers. Our interface needs to improve, but we believe that our winnings over the next couple of weeks should validate us.

In [39]:
#load gamelog_data
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import time
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
from pyquery import PyQuery as pq
from bs4 import BeautifulSoup
import json
import requests
import datetime

First, let's define a function that allows us to pull any season subset out of the data.

In [40]:
def season_subset(df, year_season_start, year_season_end = None):
    df["GAME_DATE"] = pd.to_datetime(df["GAME_DATE"])
    if year_season_end is None:
        year_season_end = year_season_start + 1
    df_gt = df[df.GAME_DATE > datetime.date(year_season_start,9,1)]
    df_lt = df_gt[df_gt.GAME_DATE < datetime.date(year_season_end,9,1)]
    return df_lt.sort_values("GAME_DATE") if not df_lt.empty else None

Next up, we'll start pulling together the data we prepared from earlier. We'll be pulling from a multitude of sources throughout the project, but we'll start by just reading in a combination of all the gamelogs and nontraditional data for future use in our implementation of Nate Silver's CARMELO approach. We follow up by standardizing some columns, creating indicators for others, and calculating other useful metrics.

In [41]:
post85df = pd.read_csv('./gamelogs/final_master.csv')
post85df = post85df.drop('VIDEO_AVAILABLE',1)
df85_15 = season_subset(post85df,1985,2016)
by_player = df85_15.groupby("PLAYER_NAME")
MELOadvanceddf = pd.read_csv('./usage_stats/master_advanced.csv')

In [42]:
df85_15["FANTASY_ZSCORE"] = by_player["FANTASY_PTS"].apply(lambda x: ((x - x.mean())/x.std()))
df85_15["i_ZSCORE_OVER"] = df85_15["FANTASY_ZSCORE"].map(lambda x: 1 if x > 1 else 0)
df85_15["SEASON_MIN"] = by_player['MIN'].apply(lambda x: x.map(lambda y: x.sum()))
df85_15["GAMES_PLAYED"] = by_player["PLAYER_NAME"].apply(lambda x: x.map(lambda y: len(x)))
for x in ['FG_PCT', 'FG3_PCT', 'FT_PCT']:
    df85_15[x] = df85_15[x].map(lambda y: 0 if np.isnan(y) else y)
df85_15["WL"] = [1 if v == "W" else 0 for v in df85_15.copy()["WL"]]

opp_home = df85_15.MATCHUP.map(lambda x: (x[-3:],0) if "@" in x else (x[-3:],1))
df85_15["OPP"] = opp_home.map(lambda x: x[0])
df85_15["i_HOME"] = opp_home.map(lambda x: x[1])

Next, let's start pulling in all the information we have about the individual players. Most of the data frame changes we must make are pretty trivial (renaming columns, converting types). All straightforward data preparation work.

In [5]:
#Add player bio data for age,weight,height
player_bios_df = pd.read_csv("./player_bios/player_bios.csv")
player_bios_df = player_bios_df.rename(columns = {'PERSON_ID': 'PLAYER_ID', 'DISPLAY_FIRST_LAST': 'PLAYER_NAME'})
player_bios_df["BIRTHDATE"] = pd.to_datetime(player_bios_df["BIRTHDATE"])
player_bios_df['AGE'] = player_bios_df["BIRTHDATE"].map(lambda x: round((pd.to_datetime('today') - x).days / 365.,2))
player_bios_df["WEIGHT"] = player_bios_df["WEIGHT"].astype('str')
player_bios_df["HEIGHT"] = player_bios_df["HEIGHT"].astype('str')
player_bios_df["WEIGHT"] = player_bios_df["WEIGHT"].map(lambda x:  float(x) if x != 'nan' else 0.)
player_bios_df["HEIGHT"] = player_bios_df["HEIGHT"].map(lambda x: (12.*float(x[0]) + float(x[2:])) if x != 'nan' else 0.)

by_player = df85_15.groupby("PLAYER_NAME")

In [36]:
def get_player_bio(name, col_name):
    return float(player_bios_df[player_bios_df.PLAYER_NAME == name][col_name])

#df85_15["AGE"] = by_player["PLAYER_NAME"].apply(lambda x: x.replace(x.iloc[0],get_player_bio(x.iloc[0],"AGE")))
#df85_15["WEIGHT"] = by_player["PLAYER_NAME"].apply(lambda x: x.replace(x.iloc[0],get_player_bio(x.iloc[0],"WEIGHT")))
#df85_15["HEIGHT"] = by_player["PLAYER_NAME"].apply(lambda x: x.replace(x.iloc[0],get_player_bio(x.iloc[0],"HEIGHT")))

In [37]:
#Integrate ELO Rankings
elo_df = pd.read_csv("./gamelogs/all_elo.csv")
elo_df["date_game"] = pd.to_datetime(elo_df["date_game"])
elo_df["game_location"] = elo_df["game_location"].map(lambda x: 1 if x == "H" else 0)
elo_df = elo_df[elo_df["is_playoffs"] == 0]

curr = elo_df.columns.tolist()
cols = [curr[i] for i in [5,8,11,13,14,17,19,21]]
elo_df = elo_df[cols]
elo_df = elo_df.rename(columns={'date_game': 'GAME_DATE',
                                'team_id':'TEAM_ABBREVIATION',
                                'opp_id':'OPP', 
                                'game_location': 'i_HOME',
                                'elo_i':'ELO',
                                'opp_elo_i': 'OPP_ELO',
                                'win_equiv': 'EXP_WINS',
                                'forecast':'FORECAST'})

elo_df['SHIT'] = elo_df['OPP_ELO'].map(lambda x: 1 if x < 1400 else 0)
elo_df['OKAY'] = elo_df['OPP_ELO'].map(lambda x: 1 if 1400 <= x < 1600 else 0)
elo_df['GOOD'] = elo_df['OPP_ELO'].map(lambda x: 1 if 1600 <= x < 1700 else 0)
elo_df['GREAT'] = elo_df['OPP_ELO'].map(lambda x: 1 if 1700 <= x else 0)
df85_15 = df85_15.merge(season_subset(elo_df,1985,2016))


In [9]:
#Rearrange some columns in df85_15
curr = df85_15.columns.tolist()
cols = curr[:3] + curr[32:37] + curr[3:9] + curr[37:] + curr[9:32]
if len(curr) == len(cols):
    df85_15 = df85_15[cols]


name_pos = player_bios_df[["PLAYER_ID","POSITION","PLAYER_NAME"]]
df85_15 = df85_15.merge(name_pos)

In [18]:
"""
Code not used in final submission

def calc_season_avg(df,col_list,(date_str1,date_str2)):
    date1, date2 = pd.to_datetime(date_str1), pd.to_datetime(date_str2)
    mask = lambda x: (date1 <= x) & (x <= date2)
    return df[df.GAME_DATE.apply(mask)].groupby(["PLAYER_NAME","SEASON_ID"])[col_list].mean().reset_index()
    
def ngames_colname(col_list, ngames):
    return map(lambda x: str(ngames) + 'D_' + x, col_list)
    
def last_ngames(df,ngames,game_date,col_list):
    ngames_df = df[df.GAME_DATE < game_date].nlargest(ngames, "GAME_DATE")
    ngames_col_list = ngames_colname(col_list,ngames)
    num_cols = len(ngames_col_list)
    date_player_tuples = [("GAME_DATE",game_date)]#,("PLAYER_NAME",df.PLAYER_NAME.iloc[0])]
    if ngames_df.empty:
        return dict(date_player_tuples + zip(ngames_col_list,np.array(0).repeat(num_cols)))
    else:
        return dict(date_player_tuples + zip(ngames_col_list,ngames_df[col_list].mean()))
        
def calc_ngame_avg(df,col_list,game_date_str,ngames):
    game_date = pd.to_datetime(game_date_str)
    season_id = df[df.GAME_DATE == game_date]["SEASON_ID"].iloc[0]
    return last_ngames(df[df.SEASON_ID == season_id],ngames,game_date,col_list)
    
def rolling_cols(df,col_list,ngames,rolling_kind):
    if rolling_kind == 'mean':
        rolling_func = lambda (a,b,c): pd.rolling_mean(a,b,min_periods = c)
    elif rolling_kind == 'sum':
        rolling_func = lambda (a,b,c): pd.rolling_sum(a,b,min_periods = c)
    else:
        return None 
    
    rolling_df = (df.groupby(["PLAYER_NAME","SEASON_ID"])
                    .apply(lambda x: add_game_date_pts_col(rolling_func((x[col_list],ngames,1)),x.GAME_DATE,x.FANTASY_PTS).reset_index(drop = True)))
    return rolling_df.reset_index().drop('level_2',axis = 1).rename(columns=dict(zip(col_list,map(lambda x: 'R_' + x,col_list))))

def add_game_date_pts_col(df,game_date_col,fantasy_pts_col):
    new_df = pd.concat([df,game_date_col], axis = 1)
    return new_df
    
def per_season_cumsum(df,col_list):
    cumsum_df = (df.groupby(["PLAYER_NAME","SEASON_ID"])
                   .apply(lambda x: add_game_date_col(x[col_list].cumsum(axis = 0), x.GAME_DATE).reset_index(drop = True)))
    return cumsum_df.reset_index().drop('level_2',axis = 1).rename(columns=dict(zip(col_list,map(lambda x: 'C_' + x,col_list))))

def per_season_cummean(df,col_list):
    cumsum_df = (df.groupby(["PLAYER_NAME","SEASON_ID"])
                   .apply(lambda x: add_game_date_pts_col(pd.expanding_mean(x[col_list], min_periods = 2), x.GAME_DATE, x.FANTASY_PTS).reset_index(drop = True)))
    return cumsum_df.reset_index().drop('level_2',axis = 1).rename(columns=dict(zip(col_list,map(lambda x: 'C_' + x,col_list))))

def enumerate_games(df):
    new_df = df.copy()
    new_df["GAME_NUM"] = range(1,len(df.GAME_DATE) + 1)
    return new_df

def sigmoidfun(x):
    return 1/(1+np.exp(-0.007*(x-800)))

def fantasy_avg_lastn(player_df,last_n_seasons,seasons):
    return player_df[[s in seasons[-last_n_seasons:] for s in player_df.SEASON_ID]]['FANTASY_PTS'].mean()    

def true_fantasy_mean(player_df,last_n_seasons):
    seasons = list(set(player_df.SEASON_ID))
    lastn_mean = fantasy_avg_lastn(player_df,last_n_seasons,seasons)
    return player_df.groupby("SEASON_ID").apply(lambda x: x.apply(lambda y: lastn_mean + sigmoidfun(y.MIN) * (y.C_FANTASY_PTS - lastn_mean),axis = 1))

def fantasy_resp(df):
    return df.groupby('PLAYER_NAME').apply(lambda x: true_fantasy_mean(x,5))
"""

In [19]:
"""
Not used in this version of the code, but it's awesome

def timeseries_cv(df,lcols,resp_str,nfolds):
    resp_str = resp_str + '_RESP'
    train_size = df.shape[0]
    floor_fold_size = (train_size / nfolds)
    final_fold_size = floor_fold_size +  (train_size % nfolds)
    rest = train_size - final_fold_size
    rest_size = rest / (nfolds - 1)
    final_idx = lambda x: (final_fold_size * x, final_fold_size * (x + 1))
    rest_idx = lambda x: (rest_size * x, rest_size * (x + 1))
    folds_idx = map(lambda x: final_idx(x) if x == (nfolds - 1) else rest_idx(x),range(nfolds))
    folds = map(lambda (x,y): df.iloc[x:y], folds_idx)
    xtrain,ytrain = zip(*map(lambda x: (x[:-1][lcols].values, x[:-1][resp_str].values), folds))
    xtest,ytest = zip(*map(lambda x: (x[-1:][lcols].values, x[-1:][resp_str].values), folds))
    train_test_lcols = map(lambda x: map(lambda y: y.reshape((-1,len(lcols))),list(x)),[xtrain,xtest])
    train_test_resp = map(lambda x: map(lambda y: y.reshape((-1,1)),list(x)),[ytrain,ytest])
    xtrain,xtest = tuple(train_test_lcols)
    ytrain,ytest = tuple(train_test_resp)
    return xtrain,ytrain,xtest,ytest
"""

In [20]:
def make_ewma_pos_df2(df, game_date, potential_players):
    sub_df = df[(df.GAME_DATE <= game_date)]
    lower_bound = min_season(sub_df[['PLAYER_NAME','SEASON_ID']],potential_players)
    sub_df2 = sub_df[sub_df.SEASON_ID >= lower_bound]
    ewma_pos = sub_df2.groupby(["OPP",'SEASON_ID',"POSITION","GAME_DATE"]).apply(lambda x: x.FANTASY_PTS.sum())

    ewma_pos_df_temp = (ewma_pos.reset_index().rename(columns={0:'TOT_OPP_POS'})
                                .sort_values('GAME_DATE')
                                .groupby(["OPP",'SEASON_ID',"POSITION"])
                                .apply(lambda x: 
                                    pd.DataFrame(zip(x.GAME_DATE,[-5 if np.isinf(y) else y for y in np.log(pd.ewma(x.TOT_OPP_POS, span = 3) + 2.5)]), 
                                    index = range(x.shape[0])))
                                .rename(columns={0:'GAME_DATE',1:'EWMA_OPP_POS'})
                                .reset_index(level = [0,1,2]))
    merge_on = ['OPP','GAME_DATE','POSITION','SEASON_ID']
    ewma_pos_df = pd.merge(sub_df2,ewma_pos_df_temp,left_on=merge_on, right_on=merge_on)
    league_avg_df = (ewma_pos_df.groupby(["SEASON_ID",'POSITION'])
                     .apply(lambda x: x['EWMA_OPP_POS'].mean())
                     .reset_index()
                     .rename(columns={0:'LEAGUE_AVG_POS'}))
    nan_dict = dict(reduce(lambda x,y: x + y.items(),[{(k1,k2):v} for k1,k2,v in league_avg_df.to_records(index = False)], []))
    nan_rows = pd.isnull(ewma_pos_df['EWMA_OPP_POS'])
    ewma_pos_df.loc[nan_rows,'EWMA_OPP_POS'] = ewma_pos_df[nan_rows].apply(lambda x: nan_dict[x.SEASON_ID - 1,x.POSITION] if x.SEASON_ID > lower_bound else float('nan'), axis = 1)
    return ewma_pos_df, players

In [21]:
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.linear_model import LinearRegression as LR
from sklearn.decomposition import PCA
#train_test_split(xrange(df.shape[0]), train_size=0.7)

def mape(ypred, ytrue):
    """ returns the mean absolute percentage error """
    idx = ytrue != 0.0
    return 100*np.mean(np.abs(ypred[idx]-ytrue[idx])/ytrue[idx])

def run_classifier(df, mask, ewma_colresp,ewma_colfeats):
    dftouse = df.copy()

    ewma_feats = map(lambda x: 'EWMA_LOG_' + x , ewma_colfeats)
    STANDARDIZABLE = ['EWMA_LOG_' + ewma_colresp, 'EWMA_OPP_POS'] + ewma_feats
    for col in STANDARDIZABLE:
        print col
        valstrain=df[col].values[mask]
        valstest=df[col].values[~mask]
        scaler=StandardScaler().fit(valstrain)
        outtrain=scaler.transform(valstrain)
        outtest=scaler.fit_transform(valstest)
        out=np.empty(mask.shape[0])
        out[mask]=outtrain
        out[~mask]=outtest
        dftouse[col]=out

    lcols = STANDARDIZABLE + ["OKAY","GOOD","GREAT"]


    clf = LR()
    #cs=[.0001,.001,.01,.1,1,10]
    #n_estimators = [1,2,10,100,500,1000]
    #max_depth = [2,3,5,7,10]
    pca = PCA(n_components=3)
    feats = list(set(lcols) - set(['OKAY','GOOD','GREAT']))


    Xmatrix=pca.fit_transform(np.array(dftouse[feats]))
    Yresp=dftouse[ewma_colresp + '_RESP'].values 
    Xmatrix_train=Xmatrix[mask]
    Xmatrix_test=Xmatrix[~mask]
    Yresp_train=Yresp[mask]
    Yresp_test=Yresp[~mask]

    #your code here
    # from sklearn.grid_search import GridSearchCV
    # #{'n_estimators':n_estimators,'max_depth':max_depth}
    # gs=GridSearchCV(clfsvm, param_grid={'C':cs}, cv=5)
    # gs.fit(Xmatrix_train, Yresp_train)
    # print "BEST", gs.best_params_, gs.best_score_, gs.grid_scores_

    # #calculate the accuracy here
    # best = gs.best_estimator_
    # best.fit(Xmatrix_train, Yresp_train)
    # best.score(Xmatrix_test, Yresp_test)
    print(pca.explained_variance_ratio_)
    return clf, Xmatrix_train, Yresp_train, Xmatrix_test, Yresp_test

In [23]:
def get_player_seasons(player_name, game_date,df,ewma_colresp, ewma_colfeat):
    player_df = df[df.PLAYER_NAME == player_name]
    player_df2 = pd.concat([player_df.reset_index(drop = True),player_df.groupby("SEASON_ID").apply(lambda x: np.log(pd.ewma(x[ewma_colresp], span = 3).shift(1) + 2.5).reset_index().drop('index',axis=1).rename(columns={ewma_colresp:'EWMA_LOG_' + ewma_colresp})).reset_index(drop = True)],axis = 1)
    for ewma_col in ewma_colfeat:
        player_df2['EWMA_LOG_' + ewma_col] = player_df2.groupby("SEASON_ID").apply(lambda x: np.log(pd.ewma(x[ewma_col], span = 3).shift(1) + 2.5).reset_index().drop('index',axis=1).rename(columns={ewma_col:'EWMA_LOG_' + ewma_col})).reset_index(drop = True)
    #1 if np.log(y[ewma_colresp] + 1) >= y['EWMA_LOG_' + ewma_colresp] else 0
    resp = player_df2.groupby('SEASON_ID').apply(lambda x: x.apply(lambda y: np.log(y[ewma_colresp] + 2.5), axis = 1).reset_index().drop('index',axis=1).rename(columns={0: ewma_colresp + '_RESP'})).reset_index(drop = True)
    player_df3 = pd.concat([player_df2,resp], axis = 1)
    player_df_final = player_df3.dropna()
    return player_df_final, np.array(player_df_final.GAME_DATE < game_date)

In [24]:
def filter_players_by_season_count(df,players):
    season_count = lambda x: len(set(df[df.PLAYER_NAME == x].SEASON_ID))
    sub_players = filter(lambda x: season_count(x) >= 2, players)
    return sub_players

def reduce_picks(player_name,game_date, df, ewma_colresp, ewma_colfeats):
    seasons = list(set(df[df.PLAYER_NAME == player_name].SEASON_ID))
    season1 = seasons[1]
    dftouse,mask = get_player_seasons(player_name, game_date, df, ewma_colresp, ewma_colfeats)
    clf,xtrain,ytrain,xtest,ytest = run_classifier(dftouse,mask,ewma_colresp, ewma_colfeats)
    clf.fit(xtrain,ytrain)
    print player_name
    print 'The error is %0.2f%%' % mape(clf.predict(xtest),ytest)
    dfreturn = dftouse[~mask].copy()
    dfreturn['PRED' + ewma_colresp] = clf.predict(xtest)
    return dfreturn

def min_season(df,players):
    season = sorted(map(lambda x: df[df.PLAYER_NAME == x].SEASON_ID.min(),players))[0]
    return season

def make_ewma_pos_df(df, game_date):
    game_day_df = df[(df.GAME_DATE == game_date)]
    sub_df = df[(df.GAME_DATE <= game_date)]
    potential_players = list(set(game_day_df.PLAYER_NAME))
    players = filter_players_by_season_count(sub_df[['PLAYER_NAME','SEASON_ID']],potential_players)
    lower_bound = min_season(sub_df[['PLAYER_NAME','SEASON_ID']],players)
    sub_df2 = sub_df[sub_df.SEASON_ID >= lower_bound]
    ewma_pos = sub_df2.groupby(["OPP",'SEASON_ID',"POSITION","GAME_DATE"]).apply(lambda x: x.FANTASY_PTS.sum())
    
    ewma_pos_df_temp = (ewma_pos.reset_index().rename(columns={0:'TOT_OPP_POS'})
                                .sort_values('GAME_DATE')
                                .groupby(["OPP",'SEASON_ID',"POSITION"])
                                .apply(lambda x: 
                                    pd.DataFrame(zip(x.GAME_DATE,[-5 if np.isinf(y) else y for y in np.log(pd.ewma(x.TOT_OPP_POS, span = 3).shift(1) + 2.5)]), 
                                    index = range(x.shape[0])))
                                .rename(columns={0:'GAME_DATE',1:'EWMA_OPP_POS'})
                                .reset_index(level = [0,1,2]))
    merge_on = ['OPP','GAME_DATE','POSITION','SEASON_ID']
    ewma_pos_df = pd.merge(sub_df2,ewma_pos_df_temp,left_on=merge_on, right_on=merge_on)
    league_avg_df = (ewma_pos_df.groupby(["SEASON_ID",'POSITION'])
                     .apply(lambda x: x['EWMA_OPP_POS'].mean())
                     .reset_index()
                     .rename(columns={0:'LEAGUE_AVG_POS'}))
    nan_dict = dict(reduce(lambda x,y: x + y.items(),[{(k1,k2):v} for k1,k2,v in league_avg_df.to_records(index = False)], []))
    nan_rows = pd.isnull(ewma_pos_df['EWMA_OPP_POS'])
    ewma_pos_df.loc[nan_rows,'EWMA_OPP_POS'] = ewma_pos_df[nan_rows].apply(lambda x: nan_dict[x.SEASON_ID - 1,x.POSITION] if x.SEASON_ID > lower_bound else float('nan'), axis = 1)
    return ewma_pos_df, players

def classify_players_ondate(df,players, game_date,ewma_colresp, ewma_colfeats):
    store_df = []
    for player in players:
        print player
        store_df.append(reduce_picks(player,game_date, df, ewma_colresp, ewma_colfeats))
    return pd.concat(store_df, axis = 0)

def make_player_pool(df,game_date,ewma_colresp, ewma_colfeats):
    ewma_pos_df, players = make_ewma_pos_df(df, game_date)
    return classify_players_ondate(ewma_pos_df, players,game_date,ewma_colresp,ewma_colfeats)

In [None]:
#Adding KDE plot
kde_cols = ['WL','MIN','FGM','FGA','FG_PCT','FG3M','FG3A','FG3_PCT','FTM','FTA','FT_PCT',
'OREB','DREB','REB','AST','STL','BLK','TOV','PF','PTS','PLUS_MINUS','i_HOME', 'OPP_ELO','ELO']

indicators = ['WL','i_HOME','SHIT','OKAY','GOOD','GREAT']
continous = list(set(kde_cols) - set(indicators))

df85_15gb = df85_15.groupby("FANTASY_ZRESP")
fig, axes = plt.subplots(nrows=9, ncols=4, figsize=(20, 36), 
                         tight_layout=True)
for ax, p in zip(axes.ravel(), kde_cols):
    for k, v in df85_15gb[p]:
        sns.kdeplot(v, ax=ax, label=str(k)+":"+v.name)

In [27]:
"""
Code not used in final version

ewma_cols = ['MIN','FGM','FGA','FG_PCT','FG3M','FG3A','FG3_PCT','FTM','FTA','FT_PCT','OREB','DREB','REB','AST','STL','BLK','TOV','PF','PTS','FANTASY_PTS']
ewma_cols_renamed = map(lambda x: 'EWMA_LOG_' + x, ewma_cols)
gameday_df = (df85_15[[dfplayer in players for dfplayer in df85_15.PLAYER_NAME]]
                  .groupby("PLAYER_NAME")
                  .apply(lambda x: np.log(pd.ewma(x[ewma_cols] + 2.5, span = 3))
                      .iloc[-1:]).rename(columns=dict(zip(ewma_cols,ewma_cols_renamed)))
                  .reset_index().set_index('level_1'))
gameday_df['GAME_DATE'] = pd.to_datetime(['2015-12-10'] * gameday_df.shape[0])

#final_df = pd.concat([gameday_df,df85_15[[dfplayer in players for dfplayer in df85_15.PLAYER_NAME]]])
"""

NameError: name 'players' is not defined

In [31]:
"""
Reading in salary data, not used in this version

dk11 = pd.read_csv("./DKSalaries/DKSalaries11.csv")
players = list(dk11.Name)
"""

IOError: File ./DKSalaries/DKSalaries11.csv does not exist

In [38]:
#class1.predict(df[~mask][['EWMA_LOG_FANTASY_PTS','EWMA_LOG_PTS']])

In [34]:
"""
ewma_cols = ['MIN','FGM','FGA','FG_PCT','FG3M','FG3A','FG3_PCT','FTM','FTA','FT_PCT','OREB','DREB','REB','AST','STL','BLK','TOV','PF','PTS','FANTASY_PTS']
ewma_cols_renamed = map(lambda x: 'EWMA_LOG_' + x, ewma_cols)
ewma_pos_gameday_df, _ = make_ewma_pos_df2(df85_15[[dfplayer in players for dfplayer in df85_15.PLAYER_NAME]], '2015-12-09', players)
gameday_df = (ewma_pos_gameday_df.sort_values('GAME_DATE').groupby("PLAYER_NAME")
                  .apply(lambda x: pd.concat([x.POSITION[-1:],x.EWMA_OPP_POS[-1:],x.SEASON_ID[-1:],x.GAME_DATE[-1:],
                                              np.log(pd.ewma(x[ewma_cols] + 2.5, span = 3)).iloc[-1:]],
                                            axis = 1))
                  .rename(columns=dict(zip(ewma_cols,ewma_cols_renamed)))
                  .reset_index(level = 0))                     
gameday_df = gameday_df[gameday_df.SEASON_ID > 22014]
gameday_df['GAME_DATE'] = gameday_df.GAME_DATE.map(lambda x: '2015-12-10')
"""

NameError: name 'players' is not defined

In [546]:
#df_test, mask_test = get_player_seasons("Pau Gasol",'2015-12-09', ewma_pos_df,'FANTASY_PTS', ['PTS'],True)

In [800]:
#reduce_picks("Pau Gasol",'2015-12-09', df_test, gameday_df,'FANTASY_PTS', ['PTS'], True)

Pau Gasol




Unnamed: 0,PLAYER_NAME,POSITION,EWMA_OPP_POS,SEASON_ID,GAME_DATE,EWMA_LOG_MIN,EWMA_LOG_FGM,EWMA_LOG_FGA,EWMA_LOG_FG_PCT,EWMA_LOG_FG3M,EWMA_LOG_FG3A,EWMA_LOG_FG3_PCT,EWMA_LOG_FTM,EWMA_LOG_FTA,EWMA_LOG_FT_PCT,EWMA_LOG_OREB,EWMA_LOG_DREB,EWMA_LOG_REB,EWMA_LOG_AST,EWMA_LOG_STL,EWMA_LOG_BLK,EWMA_LOG_TOV,EWMA_LOG_PF,EWMA_LOG_PTS,EWMA_LOG_FANTASY_PTS,PREDFANTASY_PTS
5115,Pau Gasol,Center-Forward,3.697768,22015,2015-12-10,3.482834,2.260191,2.902217,1.083225,1.011601,1.183772,1.011601,1.746953,1.919781,1.160539,1.883738,2.463914,2.761829,1.931758,1.263958,1.631411,1.727296,1.662821,0,3.897591,3.683668


In [835]:
#gameday_df.head()

Unnamed: 0,PLAYER_NAME,POSITION,EWMA_OPP_POS,SEASON_ID,GAME_DATE,EWMA_LOG_MIN,EWMA_LOG_FGM,EWMA_LOG_FGA,EWMA_LOG_FG_PCT,EWMA_LOG_FG3M,EWMA_LOG_FG3A,EWMA_LOG_FG3_PCT,EWMA_LOG_FTM,EWMA_LOG_FTA,EWMA_LOG_FT_PCT,EWMA_LOG_OREB,EWMA_LOG_DREB,EWMA_LOG_REB,EWMA_LOG_AST,EWMA_LOG_STL,EWMA_LOG_BLK,EWMA_LOG_TOV,EWMA_LOG_PF,EWMA_LOG_PTS,EWMA_LOG_FANTASY_PTS
23432,Aaron Brooks,Guard,4.334486,22015,2015-12-10,2.682866,1.432438,1.952078,1.004718,1.135931,1.377687,1.020625,1.354887,1.504377,1.08478,1.04247,1.468689,1.543245,1.741186,0.921823,1.110264,1.388094,1.539713,2.062828,2.775742
21772,Al Horford,Center-Forward,3.688879,22015,2015-12-10,3.613464,2.17279,2.675475,1.105073,1.111852,1.544871,0.978025,1.376023,1.471913,1.063583,1.358179,2.263958,2.398822,1.808964,1.340571,1.421255,1.329398,1.449392,2.837007,3.654966
25305,Andre Roberson,Guard,4.603081,22015,2015-12-10,3.311199,1.626412,2.150399,1.071503,0.944667,1.412812,0.929647,1.323147,1.506389,1.085565,1.420415,1.725825,1.981828,1.021465,1.395628,1.219344,1.198074,1.462999,2.197043,2.987363
25303,Anthony Morrow,Guard,4.603081,22015,2015-12-10,2.944968,1.624212,2.05328,1.106511,1.263107,1.523553,1.144979,1.272653,1.340332,1.173459,1.099446,1.313293,1.440049,1.269641,0.930503,0.917852,0.932087,1.316522,2.277835,2.644308
11575,Arron Afflalo,Guard,4.332924,22015,2015-12-10,3.352348,1.727857,2.245268,1.071642,0.970859,1.471047,0.936217,1.212941,1.254459,1.060821,0.936424,1.83858,1.846633,1.532858,0.934874,0.919411,1.009738,1.638163,2.278367,2.879926


In [855]:
#player_pool, error_players = make_player_pool(df85_15,gameday_df,'2015-12-09',players,'FANTASY_PTS', ['PTS','REB','AST','FANTASY_PTS'], True)

Russell Westbrook
Russell Westbrook
DeMarcus Cousins




DeMarcus Cousins
Kevin Durant




Kevin Durant
Blake Griffin




Blake Griffin
Rajon Rondo




Rajon Rondo
Chris Paul




Chris Paul
Paul Millsap




Paul Millsap
Kristaps Porzingis




Kristaps Porzingis
Pau Gasol




Pau Gasol
Brook Lopez




Brook Lopez
Carmelo Anthony




Carmelo Anthony
Jimmy Butler




Jimmy Butler
Thaddeus Young




Thaddeus Young
Robert Covington




Robert Covington
DeAndre Jordan




DeAndre Jordan
Rudy Gay




Rudy Gay
Al Horford




Al Horford
Jeff Teague




Jeff Teague
Derrick Rose




Derrick Rose
Jahlil Okafor




Jahlil Okafor
Jarrett Jack




Jarrett Jack
Serge Ibaka




Serge Ibaka
Nerlens Noel




Nerlens Noel
Jamal Crawford




Jamal Crawford
Joe Johnson




Joe Johnson
Omri Casspi




Omri Casspi
Nikola Mirotic




Nikola Mirotic
Jerami Grant




Jerami Grant
Isaiah Canaan




Isaiah Canaan
Dennis Schroder




Dennis Schroder
Thabo Sefolosha




Thabo Sefolosha
Kent Bazemore




Kent Bazemore
Darren Collison




Darren Collison
Kosta Koufos




Kosta Koufos
T.J. McConnell




T.J. McConnell
Enes Kanter
Enes Kanter
Kyle Korver




Kyle Korver
Joakim Noah




Joakim Noah
Shane Larkin




Shane Larkin
Arron Afflalo




Arron Afflalo
Jose Calderon




Jose Calderon
Bojan Bogdanovic




Bojan Bogdanovic
Dion Waiters




Dion Waiters
JJ Redick




JJ Redick
Taj Gibson




Taj Gibson
Marco Belinelli




Marco Belinelli
Steven Adams




Steven Adams
Robin Lopez




Robin Lopez
Hollis Thompson




Hollis Thompson
Josh Smith




Josh Smith
Langston Galloway




Langston Galloway
Ben McLemore




Ben McLemore
Tony Wroten




Tony Wroten
Aaron Brooks




Aaron Brooks
Austin Rivers




Austin Rivers
Lance Stephenson




Lance Stephenson
Doug McDermott




Doug McDermott
Nik Stauskas




Nik Stauskas
Paul Pierce




Paul Pierce
Kirk Hinrich




Kirk Hinrich
Wesley Johnson




Wesley Johnson
Mike Scott




Mike Scott
Thomas Robinson




Thomas Robinson
Mike Muscala




Mike Muscala
Andre Roberson




Andre Roberson
Kevin Seraphin




Kevin Seraphin
Tony Snell




Tony Snell
Markel Brown




Markel Brown
Anthony Morrow




Anthony Morrow
Luc Richard Mbah a Moute




Luc Richard Mbah a Moute
D.J. Augustin
D.J. Augustin
Wayne Ellington




Wayne Ellington
Lance Thomas




Lance Thomas
Willie Reed




Willie Reed
Kyle O'Quinn
Kyle O'Quinn
Derrick Williams




Derrick Williams
Jerian Grant




Jerian Grant
Jakarr Sampson
Jakarr Sampson
Nick Collison




Nick Collison
Louis Amundson




Louis Amundson
Caron Butler
Caron Butler
Steve Novak




Steve Novak
Sasha Vujacic




Sasha Vujacic
Carl Landry




Carl Landry
Donald Sloan




Donald Sloan
Tiago Splitter




Tiago Splitter
Justin Holiday




Justin Holiday
E'Twaun Moore




E'Twaun Moore
Kyle Singler




Kyle Singler
Cole Aldrich




Cole Aldrich
James Anderson




James Anderson
Seth Curry




Seth Curry
Quincy Acy




Quincy Acy
Shelvin Mack




Shelvin Mack
Lamar Patterson




Lamar Patterson
C.J. Wilcox
C.J. Wilcox
Cameron Bairstow




Cameron Bairstow
Tim Hardaway Jr.




Tim Hardaway Jr.
Duje Dukan
Duje Dukan
Josh Huestis
Josh Huestis
Pablo Prigioni




Pablo Prigioni
Eric Moreland




Eric Moreland
Branden Dawson




Branden Dawson
Richaun Holmes
Richaun Holmes
Mitch McGary




Mitch McGary
Cleanthony Early




Cleanthony Early
Bobby Portis




Bobby Portis
Christian Wood




Christian Wood
Cameron Payne




Cameron Payne
Walter Tavares




Walter Tavares
Cristiano Felicio




Cristiano Felicio




In [857]:
#pd.concat(player_pool).to_csv("./player_pool.csv", index = False)

In [858]:
#error_players

['Kristaps Porzingis',
 'Jahlil Okafor',
 'T.J. McConnell',
 'Doug McDermott',
 'Luc Richard Mbah a Moute',
 'Willie Reed',
 'Jerian Grant',
 'Jakarr Sampson',
 'Louis Amundson',
 'Carl Landry',
 'Seth Curry',
 'Lamar Patterson',
 'C.J. Wilcox',
 'Cameron Bairstow',
 'Duje Dukan',
 'Josh Huestis',
 'Eric Moreland',
 'Branden Dawson',
 'Richaun Holmes',
 'Cleanthony Early',
 'Bobby Portis',
 'Christian Wood',
 'Cameron Payne',
 'Walter Tavares',
 'Cristiano Felicio']

In [859]:
#player_pool_df = pd.read_csv("./player_pool.csv")

In [860]:
#player_pool_df

Unnamed: 0,PLAYER_NAME,POSITION,EWMA_OPP_POS,SEASON_ID,GAME_DATE,EWMA_LOG_MIN,EWMA_LOG_FGM,EWMA_LOG_FGA,EWMA_LOG_FG_PCT,EWMA_LOG_FG3M,EWMA_LOG_FG3A,EWMA_LOG_FG3_PCT,EWMA_LOG_FTM,EWMA_LOG_FTA,EWMA_LOG_FT_PCT,EWMA_LOG_OREB,EWMA_LOG_DREB,EWMA_LOG_REB,EWMA_LOG_AST,EWMA_LOG_STL,EWMA_LOG_BLK,EWMA_LOG_TOV,EWMA_LOG_PF,EWMA_LOG_PTS,EWMA_LOG_FANTASY_PTS,PREDFANTASY_PTS
0,Russell Westbrook,Guard,4.603081,22015,2015-12-10,3.495088,2.183868,2.635601,1.132808,1.211603,1.637695,1.050316,1.895582,1.945897,1.233753,1.210169,2.143585,0.883829,1.886610,1.536502,0.919217,1.866419,1.399586,-0.487284,0.672112,3.861697
1,DeMarcus Cousins,Forward-Center,3.581533,22015,2015-12-10,3.616311,2.213739,2.942100,1.072696,1.163555,1.764062,0.994462,2.065856,2.460190,1.127939,1.546178,2.398345,0.149848,0.941011,1.332806,1.154316,1.800623,1.980846,0.189428,0.266317,3.884442
2,Kevin Durant,Forward,4.455925,22015,2015-12-10,3.602911,2.489988,2.872697,1.146228,1.680617,2.092820,1.112480,2.093742,2.184346,1.220307,0.928842,2.492169,1.159493,1.178353,1.498657,1.065611,2.006694,1.259186,0.151408,0.706942,4.006340
3,Blake Griffin,Forward,4.325469,22015,2015-12-10,3.712565,2.355495,3.032262,1.077917,0.923782,1.019300,0.923523,1.958203,2.242235,1.173527,1.529740,2.526434,0.993853,0.695564,1.333307,1.122542,1.731139,1.602328,-0.140501,0.584726,3.781768
4,Rajon Rondo,Guard,4.397432,22015,2015-12-10,3.635751,1.939591,2.435966,1.100122,1.187714,1.462054,1.064279,1.788188,1.939798,1.122992,1.338597,2.048116,1.000420,1.010629,1.586584,0.918632,1.893459,1.228865,0.596637,0.940321,3.661890
5,Chris Paul,Guard,4.608205,22015,2015-12-10,3.595161,2.136022,2.734112,1.085444,1.341359,1.682540,1.064971,1.684167,1.691584,1.250881,1.025794,1.556109,-1.442057,0.965986,1.360016,0.916293,1.714245,1.769216,-0.417555,-0.416583,3.768098
6,Paul Millsap,Forward,4.056157,22015,2015-12-10,3.619013,2.188284,2.763359,1.094134,0.944606,1.442727,0.929728,2.036194,2.237345,1.195158,1.645238,2.277898,1.066879,1.415246,1.590522,1.334977,1.397046,1.731750,0.803408,1.143436,3.683404
7,Pau Gasol,Center-Forward,3.697768,22015,2015-12-10,3.482834,2.260191,2.902217,1.083225,1.011601,1.183772,1.011601,1.746953,1.919781,1.160539,1.883738,2.463914,1.359339,0.945239,1.263958,1.631411,1.727296,1.662821,-0.038117,1.095661,3.779175
8,Brook Lopez,Center,3.611777,22015,2015-12-10,3.635779,2.386021,2.882568,1.117174,0.916291,0.916396,0.916291,2.014610,2.046175,1.238879,1.743572,1.903408,0.142695,-0.035273,1.262982,1.758196,1.915080,1.726889,0.725393,0.875658,3.569226
9,Carmelo Anthony,Forward,4.195589,22015,2015-12-10,3.491466,1.967325,2.815807,1.035719,1.349633,2.025261,1.013131,1.947707,2.129182,1.189826,1.053146,2.025736,-0.481554,0.585447,1.191004,0.973282,1.841673,1.675227,-1.972531,-1.704927,3.640534


In [222]:
"""
def CM_HEIGHT(x):
    MELO_HT = x['HEIGHT'] * 4.5
    return 'MELO_HT',MELO_HT

def CM_WEIGHT(x):
    MELO_WT = x['WEIGHT'] * 2.0
    return 'MELO_WT',MELO_WT

def CM_CAREER_MINUTES(x):
    MELO_CAREER_MIN = x['SEASON_MIN'] * 2.5
    return 'MELO_CAREER_MIN', MELO_CAREER_MIN

def CM_AGE(x):
    MELO_AGE = x['AGE']
    return 'MELO_AGE', MELO_AGE

def CM_MIN_PER(x):
    MIN_PER = x['MIN'] * 4.5
    return 'MELO_MIN_PER',MIN_PER

def CM_MIN_TOT(x):
    MIN_TOT = (x['MIN'] * x['GP']) * 7
    return 'MELO_MIN_TOT', MIN_TOT

def CM_TRUE_PER(x):
    TRUE_PER = x['TS_PCT'] * 6
    return 'MELO_TRUE_PER',TRUE_PER

def CM_USG_PER(x):
    USG_PER = x['USG_PCT'] * 6
    return 'MELO_USG_PER',USG_PER

def CM_AST_PER(x):
    AST_PER = x['AST_PCT'] * 5
    return 'MELO_AST_PCT', AST_PER

def CM_TO_PER(x):
    TO_PER= x['TM_TOV_PCT'] * 2.5
    return 'MELO_TO_PCT', TO_PER

def CM_REB_PER(x):
    REB_PER = x['REB_PCT'] * 5
    return 'MELO_REB_PCT', REB_PER

def CM_OFF_PM(x):
    OFF_PM= x['OFF_RATING'] * 3
    return 'OFF_PM', OFF_PM

def CM_DF_PM(x):
    DEF_PM= x['DEF_RATING'] * 3
    return 'DEF_PM', DEF_PM

def CM_3FEQ(x):
    MELO_3FEQ = x['3PT_FEQ'] * 3.5
    return 'MELO_3FEQ',MELO_3FEQ

def CM_FT_PER(x):
    MELO_FT_PER = x['FT_PER'] * 3.5
    return 'MELO_FT_PER',MELO_FT_PER

def weight_prop(cat_str, weight_dict):
    tot = sum(weight_dict.values())
    prop = weight_dict[cat_str] / tot
    return prop

def make_melo_sim(fab_std,cat_str):
    fab_std_player_idx = fab_std.set_index('PLAYER_NAME')
    fab_comp = pd.DataFrame(index=fab_std_player_idx.index.tolist(), columns=fab_std_player_idx.index.tolist())
    prop = weight_prop(cat_str, weights)
    melo_category = (fab_comp.apply(lambda x: fab_comp.columns,axis = 1)
             .apply(lambda x: fab_std_player_idx.loc[x.name][cat_str] - fab_std_player_idx.loc[x][cat_str], axis = 1)
             .applymap(lambda x: x**2 * prop))
    return melo_category

def fab_melo(player, comboMELO):
    root = comboMELO[comboMELO.PLAYER_NAME == player].sort_values('PLAYER_NAME')
    calc_melo_funcs = [CM_WEIGHT, CM_HEIGHT, CM_MIN_PER,CM_CAREER_MINUTES, CM_3FEQ, CM_MIN_TOT, CM_TRUE_PER, CM_USG_PER, CM_AST_PER, CM_TO_PER, CM_REB_PER, CM_OFF_PM, CM_DF_PM,CM_FT_PER,CM_AGE]
    result = root.groupby('SEASON_ID').apply(lambda x: pd.DataFrame(dict([('SEASON_ID',x.SEASON_ID),('PLAYER_NAME',x.PLAYER_NAME)] + map(lambda y: y(x),calc_melo_funcs))))
    return result

def zscore(col):
    return (col - col.mean())/col.std(ddof=0)
    
store_df = []
melo_advanced_df = pd.read_csv("./usage_stats/comboMELO.csv") 
players = set(season_subset(df85_15,1996,2015)['PLAYER_NAME'])
for player in players:
    store_df.append(fab_melo(player,melo_advanced_df))
FAB_MELO = pd.concat(store_df,axis = 0)
melo_cols = ["MELO_MIN_PER", "MELO_MIN_TOT", "DEF_PM","OFF_PM", "MELO_AST_PCT", "MELO_REB_PCT", "MELO_TO_PCT","MELO_USG_PER", "MELO_TRUE_PER","MELO_3FEQ","MELO_FT_PER","MELO_CAREER_MIN","MELO_WT","MELO_HT"]
weights = dict(zip(melo_cols,[4.5,7.0,3.0,3.0,5.0,5.0,2.5,6.0,6.0,3.5,3.5,2.5,2,4.5]))
FAB_MELO[melo_cols] = FAB_MELO[melo_cols].apply(zscore, axis =0)
get_top_ten(FAB_MELO[FAB_MELO.AGE == 26],weights,"Danny Green")
"""

SyntaxError: invalid syntax (<ipython-input-222-29ab6db4eff2>, line 2)

In [757]:
"""
salary_df = pd.read_csv('./DKSalaries/DKSalaries11.csv')
opt_players = list(set(salary_df.Name))
sampled_salary = salary_df.groupby("Name").apply(lambda x: x.sample(n=1)).reset_index(drop = True)
salary_dict = dict(zip(sampled_salary.Name, sampled_salary.Salary))
salary_dict
"""

{'Aaron Brooks': 3600,
 'Al Horford': 6900,
 'Andre Roberson': 3200,
 'Anthony Morrow': 3100,
 'Arron Afflalo': 4400,
 'Austin Rivers': 3600,
 'Ben McLemore': 3700,
 'Blake Griffin': 9200,
 'Bobby Portis': 3000,
 'Bojan Bogdanovic': 4200,
 'Branden Dawson': 3000,
 'Brook Lopez': 7700,
 'C.J. Wilcox': 3000,
 'Cameron Bairstow': 3000,
 'Cameron Payne': 3000,
 'Carl Landry': 3000,
 'Carmelo Anthony': 7600,
 'Caron Butler': 3000,
 'Chris Paul': 8400,
 'Christian Wood': 3000,
 'Cleanthony Early': 3000,
 'Cole Aldrich': 3000,
 'Cristiano Felicio': 3000,
 'D.J. Augustin': 3100,
 'Darren Collison': 4800,
 'DeAndre Jordan': 7100,
 'DeMarcus Cousins': 10200,
 'Dennis Schroder': 5000,
 'Derrick Rose': 6500,
 'Derrick Williams': 3100,
 'Dion Waiters': 4100,
 'Donald Sloan': 3000,
 'Doug McDermott': 3400,
 'Duje Dukan': 3000,
 "E'Twaun Moore": 3000,
 'Enes Kanter': 4600,
 'Eric Moreland': 3000,
 'Hollis Thompson': 3800,
 'Isaiah Canaan': 5000,
 'JJ Redick': 4000,
 'Jahlil Okafor': 6400,
 'Jakarr Sa

In [824]:
#singleday = pd.merge(gameday_df,player_pool_df[['PREDFANTASY_PTS','PLAYER_NAME']], on = 'PLAYER_NAME')

In [865]:
#dk = dk11.rename(columns={'Position':'DK_POSITION', 'Name': 'PLAYER_NAME','Salary':'SALARY'})

In [869]:
#singleday = pd.merge(player_pool_df,dk[['PLAYER_NAME','SALARY','DK_POSITION']], on=['PLAYER_NAME'])

In [None]:
"""
singleday['REAL_POSITION'] = singleday['DK_POSITION']
singleday.sort_values('PREDFANTASY_PTS', ascending = False)
np.unique(singleday['REAL_POSITION'].values)
#PG = singleday[singleday.POSITION == '']

total_players = singleday['PLAYER_NAME'].values
forwards = singleday[[pos in ["Forward"] for pos in singleday['POSITION'].values]]['PLAYER_NAME'].values
guards = singleday[[pos in ["Guard"] for pos in singleday['POSITION'].values]]['PLAYER_NAME'].values
centers = singleday[[pos in ["Center"] for pos in singleday['POSITION'].values]]['PLAYER_NAME'].values

forward = list(np.random.choice(a=forwards,replace=False,size=3))
guard = list(np.random.choice(a=guards,replace=False,size=3))
center = list(np.random.choice(a=centers,replace=False,size=1))
util = np.random.choice(a=list(set(total_players) - set(forward + guard + center)), replace = False, size = 1)
"""

In [829]:
#singleday.sort_values('PREDFANTASY_PTS', ascending = False)

Unnamed: 0,PLAYER_NAME,POSITION,EWMA_OPP_POS,SEASON_ID,GAME_DATE,EWMA_LOG_MIN,EWMA_LOG_FGM,EWMA_LOG_FGA,EWMA_LOG_FG_PCT,EWMA_LOG_FG3M,EWMA_LOG_FG3A,EWMA_LOG_FG3_PCT,EWMA_LOG_FTM,EWMA_LOG_FTA,EWMA_LOG_FT_PCT,EWMA_LOG_OREB,EWMA_LOG_DREB,EWMA_LOG_REB,EWMA_LOG_AST,EWMA_LOG_STL,EWMA_LOG_BLK,EWMA_LOG_TOV,EWMA_LOG_PF,EWMA_LOG_PTS,EWMA_LOG_FANTASY_PTS,PREDFANTASY_PTS
75,Steven Adams,Center,3.449736,22015,2015-12-10,3.077275,1.463485,2.060684,1.039870,0.916291,0.916291,0.916291,1.358705,1.643646,1.047386,1.876057,1.727886,2.267627,1.033139,1.183509,1.043805,1.141100,1.553914,2.019311,2.935985,8.971705
33,Jimmy Butler,Guard-Forward,3.626039,22015,2015-12-10,3.704987,2.455673,3.065302,1.092319,1.178249,1.917927,1.001862,2.448614,2.571915,1.211725,1.175742,2.063959,2.153812,1.368502,1.300995,1.067505,1.489076,1.753050,3.421964,3.760985,6.814987
55,Nerlens Noel,Forward-Center,3.309777,22015,2015-12-10,3.397559,2.023989,2.421020,1.126885,0.916291,0.916291,0.916291,1.471507,1.798194,1.089406,1.518100,1.863821,2.141462,1.184264,1.351485,1.266440,1.619615,1.659979,2.673645,3.284930,6.810962
49,Langston Galloway,Guard,4.332924,22015,2015-12-10,3.222834,1.417251,2.028969,1.045481,1.219720,1.441595,1.148951,1.037730,1.043604,0.978193,1.024379,1.486051,1.548619,1.311964,1.249991,0.916681,1.030406,1.096557,1.940261,2.625507,6.399312
58,Nikola Mirotic,Forward,4.106751,22015,2015-12-10,3.251855,1.905470,2.555991,1.067434,1.401288,2.130016,1.023090,0.975498,1.076950,0.925315,1.454216,1.690273,1.974371,1.588046,1.237341,1.183463,1.324069,1.545632,2.538293,3.246568,5.316672
25,Hollis Thompson,Guard-Forward,2.788337,22015,2015-12-10,3.206870,1.468122,2.223869,1.009488,1.290093,1.855644,1.008616,1.023234,1.033194,0.945269,0.979918,1.486065,1.522552,1.426433,0.955691,1.092081,1.121622,1.638425,2.027812,2.649120,4.843831
44,Kyle Korver,Guard,4.239683,22015,2015-12-10,3.557963,1.743603,2.381590,1.062806,1.362021,2.098877,1.010429,1.156635,1.162454,1.145876,0.941079,1.646085,1.658109,1.553530,1.270642,1.330444,1.655670,1.237431,2.399631,3.082987,4.448837
79,Thomas Robinson,Forward,4.166999,22015,2015-12-10,2.568297,1.352159,1.746154,1.039127,0.916291,0.916291,0.916291,1.181071,1.324934,1.141003,1.055919,1.845889,1.903352,0.916293,0.933360,1.311312,1.235427,1.654812,1.789989,2.586866,4.213513
2,Andre Roberson,Guard,4.603081,22015,2015-12-10,3.311199,1.626412,2.150399,1.071503,0.944667,1.412812,0.929647,1.323147,1.506389,1.085565,1.420415,1.725825,1.981828,1.021465,1.395628,1.219344,1.198074,1.462999,2.197043,2.987363,4.206345
78,Thaddeus Young,Forward,4.166999,22015,2015-12-10,3.602133,2.490854,2.881331,1.144469,0.916291,1.020586,0.916291,1.240042,1.482099,0.987384,1.686979,2.430832,2.658262,1.681762,1.436894,0.928797,1.806900,1.499662,3.117902,3.768341,4.103714


In [876]:
"""
gafantasypts = singleday['PREDFANTASY_PTS'].values
ga_pg = singleday['DK_POSITION'].map(lambda x: 1 if x == 'PG' else 0).values
ga_sg = singleday['DK_POSITION'].map(lambda x: 1 if x == 'SG' else 0).values
ga_g = singleday['DK_POSITION'].map(lambda x: 1 if (x == 'SG') or (x == 'PG') else 0).values
ga_sf = singleday['DK_POSITION'].map(lambda x: 1 if x == 'SF' else 0).values
ga_pf = singleday['DK_POSITION'].map(lambda x: 1 if x == 'PF' else 0).values
ga_f = singleday['DK_POSITION'].map(lambda x: 1 if (x == 'PF') or (x == 'SF') else 0).values
ga_c = singleday['DK_POSITION'].map(lambda x: 1 if x == 'C' else 0).values
#gautil = np.ones(len(gacenters))
gasalaries = singleday['SALARY'].values
"""

In [871]:
#list(set(singleday.DK_POSITION))

['SG', 'C', 'PF', 'PG', 'SF']

In [879]:
#small_data = zip(gasalaries, gafantasypts, ga_pg, ga_sg, ga_g, ga_sf, ga_pf, ga_f, ga_c)#,gautil)

In [880]:
#small_data

[(10400, 3.8616970464099998, 1, 0, 1, 0, 0, 0, 0),
 (10200, 3.88444163074, 0, 0, 0, 0, 1, 1, 0),
 (10100, 4.0063400173000003, 0, 0, 0, 1, 0, 1, 0),
 (9200, 3.7817678209599994, 0, 0, 0, 0, 1, 1, 0),
 (8500, 3.6618900974500002, 1, 0, 1, 0, 0, 0, 0),
 (8400, 3.76809757417, 1, 0, 1, 0, 0, 0, 0),
 (8300, 3.6834040881199996, 0, 0, 0, 0, 1, 1, 0),
 (7700, 3.77917534394, 0, 0, 0, 0, 0, 0, 1),
 (7700, 3.5692261186500001, 0, 0, 0, 0, 0, 0, 1),
 (7600, 3.6405338707400001, 0, 0, 0, 1, 0, 1, 0),
 (7500, 3.6145681018999998, 0, 1, 1, 0, 0, 0, 0),
 (7400, 3.6021099116800004, 0, 0, 0, 0, 1, 1, 0),
 (7200, 3.3627352879500001, 0, 0, 0, 1, 0, 1, 0),
 (7100, 3.6911223409699994, 0, 0, 0, 0, 0, 0, 1),
 (7000, 3.5825926523400002, 0, 0, 0, 1, 0, 1, 0),
 (6900, 3.5715496396300006, 0, 0, 0, 0, 0, 0, 1),
 (6800, 3.5266029908099998, 1, 0, 1, 0, 0, 0, 0),
 (6500, 3.4913916798300004, 1, 0, 1, 0, 0, 0, 0),
 (6200, 3.3229012253199999, 1, 0, 1, 0, 0, 0, 0),
 (6000, 3.47902331435, 0, 0, 0, 0, 1, 1, 0),
 (5800, 3.4825789

At this point, we can move into the actual optimization. We decided to use a genetic algorithm, as we were able to find some decent documentation on solving a multidimensional knapsack problem with pyeasyga. Genetics algorithms adaptive heuristic search algorithms based on the evolutionary ideas of natural selection and genetics. In short, our data points go through natural selection, surviving through multiple generations and mating and mutating throughout the process. We input our salary cap and position caps and let the algorithm do its thing. With larger data sets, the algorithm takes much longer to run. In our experience, without a large enough population size, the algorithm returns junk responses, and we've only been able to navigate around this by increasing the population size.

You can read more about genetic algorithms here: http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol1/hmw/article1.html

In [897]:
"""
Not used in this version of the code

from pyeasyga import pyeasyga

ga = pyeasyga.GeneticAlgorithm(small_data)        # initialise the GA with data
ga.population_size = 400000
ga.mutation_probability = .05

# define a fitness function
def fitness(individual, data):
    salaries, points, pg,sg,g, sf,pf, f, c = 0, 0, 0, 0, 0, 0, 0,0, 0
    for (selected, item) in zip(individual, data):
        if selected:
            salaries += item[0]
            points += item[1]
            pg += item[2]
            sg += item[3]
            g += item[4]
            sf += item[5]
            pf += item[6]
            f += item[7]
            c += item[8]
    if salaries > 50000 or (pg < 1) or (sg < 1) or (sf < 1) or (pf < 1) or (c < 1) or (c + f + g > 8):
        points = 0
    return points

ga.fitness_function = fitness               # set the GA's fitness function
ga.run()                                    # run the GA
print ga.best_individual()                  # print the GA's best solution
"""

(0, [1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0])


In [891]:
#import itertools
#list(itertools.combinations(['pg','sg','sf','pf','c','g','f'],2))

[('pg', 'sg'),
 ('pg', 'sf'),
 ('pg', 'pf'),
 ('pg', 'c'),
 ('pg', 'g'),
 ('pg', 'f'),
 ('sg', 'sf'),
 ('sg', 'pf'),
 ('sg', 'c'),
 ('sg', 'g'),
 ('sg', 'f'),
 ('sf', 'pf'),
 ('sf', 'c'),
 ('sf', 'g'),
 ('sf', 'f'),
 ('pf', 'c'),
 ('pf', 'g'),
 ('pf', 'f'),
 ('c', 'g'),
 ('c', 'f'),
 ('g', 'f')]

In smaller tests and previous iterations of the genetic algorithm, we would use this test code to print out the players the optimizer returned.

In [None]:
#_,mask = ga.best_individual()
#mask = np.array(mask) == 1
#singleday[mask]

In [224]:
#xtrain,ytrain,xtest,ytest = timeseries_cv(test_df,['PTS']"FANTASY_PTS_RESP",7)

NameError: global name 'lcols' is not defined

In [20]:
#ewma_pos_df,_ = make_ewma_pos_df(df85_15,'2009-12-04')

In [25]:
#test_df.head()

AttributeError: 'tuple' object has no attribute 'head'

In [138]:
#df,mask = get_player_seasons('Chris Paul','2015-12-10',ewma_pos_df,'FANTASY_PTS',['PTS'], True)
#test_df = df[mask]

In [88]:
#test_df['FANTASY_PTS_RESP'].shape

(799,)

In [142]:
"""
Implemented a replacement for grid search that works with time series data, not used in this version

import itertools
import operator
from sklearn.svm import LinearSVC, SVR, SVC
from sklearn.linear_model import LogisticRegression, Lasso, Ridge, LinearRegression, SGDClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

#GridSearchCV, authored by David DiCiurcio
def davidsearchcv(X_trainFolds, y_trainFolds, X_testFolds, y_testFolds, parameters, classifier, regression):    
    templist = []
    paramlist = []
    outputs = {}
    clflist = []
    counter = 0
    averageaccuracy = []
    if regression:
        for i in parameters:
            templist.append(parameters[i])
        zlist = list(itertools.product(*templist))

        for i in zlist:
            paramlist.append(dict(zip(parameters.keys(),i)))
        
        for i in paramlist:
            stringexec = ''
            for k in i:
                if isinstance(i[k], basestring):
                    stringexec = stringexec+k+"="+"'"+i[k]+"'"+","
                else:
                    stringexec = stringexec+k+"="+str(i[k])+","
            exec "clf = "+classifier+"("+stringexec[:-1]+")" 
            for j in range(0,len(X_trainFolds)):
                clf.fit(X_trainFolds[j], y_trainFolds[j])
                averageaccuracy.append(clf.score(X_testFolds[j], y_testFolds[j]))
                clflist.append(clf)
            outputs[counter] = np.mean(averageaccuracy)
            counter = counter + 1
    else:
        exec "clf = "+classifier+"()"
        for j in range(0,len(X_trainFolds)):
                clf.fit(X_trainFolds[j], y_trainFolds[j])
                preds = clf.predict(X_testFolds[j])
                averageaccuracy.append(mape(preds, y_testFolds[j]))
                clflist.append(clf)
        outputs[counter] = np.mean(averageaccuracy)
        counter = counter + 1

        
    accmaxindex = max(outputs.iteritems(), key=operator.itemgetter(1))[0]
    return clflist, paramlist, accmaxindex, outputs[accmaxindex]
"""

In [114]:
#parameters.keys()

['n_estimators']

In [104]:
"""
from sklearn.ensemble import VotingClassifier

# clflist in form of ('lr', clf1)
# weightlist in form of [1, 2, 4]
# voting in form of 'soft' or 'hard'
def runVotingClassifier(Xtrain,ytrain,Xtest,ytest,clflist,weights,voting):
    vcaverage = []
    for i in range(0,len(Xtrain)):
        eclf = VotingClassifier(estimators=clflist,voting=voting,weights=weights)
        eclf.fit(Xtrain[i], ytrain[i])
        vcaverage.append(eclf.score(Xtest[i], ytest[i]))
    return np.mean(vcaverage)
"""

In [127]:
"""
#non-regression classifiers
x = [0.01, 0.1, 1., 10., 100.]
y = ['rbf','poly','linear']
y2 = [True]
z = [5, 10, 20, 50]
parameters1={'C':x, 'kernel':y, 'probability':y2}
parameters={'n_estimators':z}
#clflist, vala, valb, valc = davidsearchcv(X_trainFolds, y_trainFolds, X_testFolds, y_testFolds, parameters,'ExtraTreesClassifier')
"""

In [178]:
"""
#regression
Lasso, Ridge, LinearRegression, SVR
#SVRc = [0.01, 0.1, 1., 10., 100.]
#SVRc2 = [True]
RidgeAlpha = [0.01, 0.1, 1., 10., 100.]
#SVRc
#LassoAlpha = [0.01, 0.1, 1., 10., 100.]
#LassoP = {'alpha':LassoAlpha}
RidgeP = {'alpha':RidgeAlpha}
#LinearRegressionP = {}
#SVRP = {'C':SVRc} fpro
"""

In [167]:
"""
def ClassifierComp(df, lcols,resp_str,kfolds, searchlst,weights,voting,regression):
    X_trainFolds, y_trainFolds, X_testFolds, y_testFolds = timeseries_cv(df,lcols,resp_str,kfolds)
    VCclfLst = []
    for p,c in searchlst:
        clflist,_,accmaxindex,_ = davidsearchcv(X_trainFolds, y_trainFolds, X_testFolds, y_testFolds,p,c,regression)
        VCclfLst.append((c,clflist[accmaxindex]))
    return VCclfLst, X_trainFolds, y_trainFolds, X_testFolds, y_testFolds#runVotingClassifier(X_trainFolds, y_trainFolds, X_testFolds, y_testFolds,VCclfLst,weights,voting)
"""

In [170]:
#clflist, X_trainFolds, y_trainFolds, X_testFolds, y_testFolds = ClassifierComp(test_df,['EWMA_LOG_FANTASY_PTS','EWMA_LOG_PTS'],'FANTASY_PTS',5,searchlist,[1,1,1,1],'soft',True)

  return column_or_1d(y, warn=True).astype(np.float64)
  return column_or_1d(y, warn=True).astype(np.float64)
  return column_or_1d(y, warn=True).astype(np.float64)
  return column_or_1d(y, warn=True).astype(np.float64)
  return column_or_1d(y, warn=True).astype(np.float64)
  return column_or_1d(y, warn=True).astype(np.float64)
  return column_or_1d(y, warn=True).astype(np.float64)
  return column_or_1d(y, warn=True).astype(np.float64)
  return column_or_1d(y, warn=True).astype(np.float64)
  return column_or_1d(y, warn=True).astype(np.float64)
  return column_or_1d(y, warn=True).astype(np.float64)
  return column_or_1d(y, warn=True).astype(np.float64)
  return column_or_1d(y, warn=True).astype(np.float64)
  return column_or_1d(y, warn=True).astype(np.float64)
  return column_or_1d(y, warn=True).astype(np.float64)
  return column_or_1d(y, warn=True).astype(np.float64)
  return column_or_1d(y, warn=True).astype(np.float64)
  return column_or_1d(y, warn=True).astype(np.float64)
  return c

In [172]:
#clf = clflist[0][1].fit(X_trainFolds[0],y_trainFolds[0])

In [173]:
#clf.predict(X_testFolds[0])

array([ 0.54066115])