## Overview ##

This is a starter notebook inspired by last year's [Logistic Regression on Tournament Seeds by Kasper P. Lauritzen](https://www.kaggle.com/kplauritzen/notebookde27b18258?scriptVersionId=804590) starter kernel. It creates a basic logistic regression model based on the seed differences between teams. 

Note that the predictions for Stage 1's sample submissions file are already based on known outcomes, and the Tourney data this model is trained on includes that data. For Stage 2, you will be predicting future outcomes based on the teams selected for the tournament on March 11.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.set_option('mode.chained_assignment', None)
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
import random
from sklearn.utils import shuffle
from sklearn.model_selection import GridSearchCV


In [2]:
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

data_dir = '../datasets/womens-machine-learning-competition-2018/'

# Any results you write to the current directory are saved as output.

## Load the training data ##
I'm building off of the starter notebook by including DetailedResults for a "past 10 game" average, and season ending ELO ratings from Liam Kirwin

In [4]:
df_seeds = pd.read_csv(data_dir + 'WNCAATourneySeeds.csv')
df_tour = pd.read_csv(data_dir + 'WNCAATourneyCompactResults.csv')
df_elo_ratings = pd.read_csv(data_dir + 'updated_season_elos_women.csv')
df_teams = pd.read_csv(data_dir + 'WTeams.csv')

In [5]:
df_teams.head()

Unnamed: 0,TeamID,TeamName
0,3101,Abilene Chr
1,3102,Air Force
2,3103,Akron
3,3104,Alabama
4,3105,Alabama A&M


In [7]:
df_tour.tail()

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT
1255,2017,147,3163,90,3332,52,H,0
1256,2017,147,3376,71,3199,64,N,0
1257,2017,151,3280,66,3163,64,N,1
1258,2017,151,3376,62,3390,53,N,0
1259,2017,153,3376,67,3280,55,N,0


In [8]:
df_elo_ratings = df_elo_ratings.rename(columns={'team_id':'WTeamID', 'season': 'Season'})
df_elo_ratings.head()

Unnamed: 0,Season,season_elo,WTeamID
0,2014,1466.742705,3101
1,2015,1387.712167,3101
2,2016,1548.082286,3101
3,2017,1561.905894,3101
4,2018,1409.454524,3101


In [9]:
df_elo_ratings.Season.unique()

array([2014, 2015, 2016, 2017, 2018, 1998, 1999, 2000, 2001, 2002, 2003,
       2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013])

First, we'll simplify the datasets to remove the columns we won't be using and convert the seedings to the needed format (stripping the regional abbreviation in front of the seed).

In [10]:
def seed_to_int(seed):
    #Get just the digits from the seeding. Return as int
    s_int = int(seed[1:3])
    return s_int
df_seeds['seed_int'] = df_seeds.Seed.apply(seed_to_int)
df_seeds.drop(labels=['Seed'], inplace=True, axis=1) # This is the string label
df_seeds.head()

Unnamed: 0,Season,TeamID,seed_int
0,1998,3330,1
1,1998,3163,2
2,1998,3112,3
3,1998,3301,4
4,1998,3272,5


In [11]:
df_tour.drop(labels=['DayNum', 'WScore', 'LScore', 'WLoc', 'NumOT'], inplace=True, axis=1)
df_tour.head()

Unnamed: 0,Season,WTeamID,LTeamID
0,1998,3104,3422
1,1998,3112,3365
2,1998,3163,3193
3,1998,3198,3266
4,1998,3203,3208


## Merge seed for each team ##
Merge the Seeds and ELO ratings with their corresponding TeamIDs in the compact results dataframe.

In [12]:
df_winseeds = df_seeds.rename(columns={'TeamID':'WTeamID', 'seed_int':'WSeed'})
df_lossseeds = df_seeds.rename(columns={'TeamID':'LTeamID', 'seed_int':'LSeed'})
df_dummy = pd.merge(left=df_tour, right=df_winseeds, how='left', on=['Season', 'WTeamID'])
df_concat = pd.merge(left=df_dummy, right=df_lossseeds, on=['Season', 'LTeamID'])
df_concat['SeedDiff'] = df_concat.WSeed - df_concat.LSeed
df_concat.head()

Unnamed: 0,Season,WTeamID,LTeamID,WSeed,LSeed,SeedDiff
0,1998,3104,3422,2,15,-13
1,1998,3112,3365,3,14,-11
2,1998,3163,3193,2,15,-13
3,1998,3198,3266,7,10,-3
4,1998,3203,3208,10,7,3


In [13]:
df_concat = pd.merge(left=df_concat, right=df_elo_ratings, how='left', on=['Season', 'WTeamID'])

In [14]:
df_concat = df_concat.rename(columns={'season_elo': 'WTeamELO'})
df_elo_ratings = df_elo_ratings.rename(columns={'WTeamID': 'LTeamID'})

In [15]:
df_concat = pd.merge(left=df_concat, right=df_elo_ratings, how='left', on=['Season', 'LTeamID'])

In [16]:
df_concat = df_concat.rename(columns={'season_elo': 'LTeamELO'})

In [17]:
df_concat['ELODiff'] = df_concat.WTeamELO - df_concat.LTeamELO

In [18]:
df_concat.head()

Unnamed: 0,Season,WTeamID,LTeamID,WSeed,LSeed,SeedDiff,WTeamELO,LTeamELO,ELODiff
0,1998,3104,3422,2,15,-13,1713.811438,1588.263795,125.547643
1,1998,3112,3365,3,14,-11,1659.482317,1673.878038,-14.395721
2,1998,3163,3193,2,15,-13,1785.247304,1586.31111,198.936195
3,1998,3198,3266,7,10,-3,1751.701793,1632.97984,118.721953
4,1998,3203,3208,10,7,3,1617.797353,1593.958564,23.838789


In [19]:
df_concat.tail()

Unnamed: 0,Season,WTeamID,LTeamID,WSeed,LSeed,SeedDiff,WTeamELO,LTeamELO,ELODiff
1255,2017,3163,3332,1,10,-9,2565.874293,1811.861657,754.012637
1256,2017,3376,3199,1,3,-2,2282.594911,2088.322098,194.272813
1257,2017,3280,3163,2,1,1,2064.219596,2565.874293,-501.654697
1258,2017,3376,3390,1,2,-1,2282.594911,2114.860666,167.734245
1259,2017,3376,3280,1,2,-1,2282.594911,2064.219596,218.375315


Now we'll create a dataframe that summarizes wins & losses along with their corresponding seed differences. This is the meat of what we'll be creating our model on.
*First edit:* Use ELO ratings instead of seed difference for Round 1 submission

In [20]:
df_wins = pd.DataFrame()
df_wins['ELODiff'] = df_concat['ELODiff']
df_wins['Result'] = 1

df_losses = pd.DataFrame()
df_losses['ELODiff'] = -df_concat['ELODiff']
df_losses['Result'] = 0

df_predictions = pd.concat((df_wins, df_losses))
df_predictions.head()

Unnamed: 0,ELODiff,Result
0,125.547643,1
1,-14.395721,1
2,198.936195,1
3,118.721953,1
4,23.838789,1


In [21]:
print(df_predictions.shape)

(2520, 2)


In [22]:
X_train = df_predictions.ELODiff.values.reshape(-1,1)
print(X_train.shape)
y_train = df_predictions.Result.values
print(y_train)
X_train, y_train = shuffle(X_train, y_train)

(2520, 1)
[1 1 1 ..., 0 0 0]


In [23]:
y_train.shape

(2520,)

## Train the model ##
Use a basic logistic regression to train the model. You can set different C values to see how performance changes.

In [28]:
logreg = LogisticRegression()
params = {'C': np.logspace(start=-5, stop=3, num=50)}
clf = GridSearchCV(logreg, params, scoring='neg_log_loss', refit=True)
clf.fit(X_train, y_train)
print('Best log_loss: {:.4}, with best C: {}'.format(clf.best_score_, clf.best_params_['C']))

Best log_loss: -0.4555, with best C: 4.498432668969444e-05


## Format season and team IDs from the SampleSubmissionStage2.csv file ##


In [29]:
df_sample_sub = pd.read_csv(data_dir + 'WSampleSubmissionStage2.csv')
n_test_games = len(df_sample_sub)

def get_year_t1_t2(ID):
    """Return a tuple with ints `year`, `team1` and `team2`."""
    return (int(x) for x in ID.split('_'))

In [35]:
df_sample_sub.shape

(2016, 2)

In [30]:
# Confirm that the df_elo_ratings still has complete (1985-2018) data
df_elo_ratings.Season.sort_values().unique()

array([1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008,
       2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018])

In [31]:
# Rename team ID column back to the generic form
df_elo_ratings = df_elo_ratings.rename(columns={'LTeamID': 'TeamID'})

In [32]:
df_elo_ratings.tail()

Unnamed: 0,Season,season_elo,TeamID
7008,2014,1440.28913,3464
7009,2015,1508.802926,3464
7010,2016,1487.679329,3464
7011,2017,1316.333043,3464
7012,2018,1428.262022,3464


## Changed function from sample notebook to grab ELO Ratings##
Create predictions using the logistic regression model we trained.

In [33]:
X_test = np.zeros(shape=(n_test_games, 1))
for ii, row in df_sample_sub.iterrows():
    year, t1, t2 = get_year_t1_t2(row.ID)
    t1_elo = df_elo_ratings[(df_elo_ratings.TeamID == t1) & (df_elo_ratings.Season == year)].season_elo.values[0]
    t2_elo = df_elo_ratings[(df_elo_ratings.TeamID == t2) & (df_elo_ratings.Season == year)].season_elo.values[0]
    diff_elo = t1_elo - t2_elo
    X_test[ii, 0] = diff_elo

In [34]:
X_test.shape

(2016, 1)

## Make Predictions ##
Create predictions using the logistic regression model we trained.

In [36]:
preds = clf.predict_proba(X_test)[:,1]
df_sample_sub.Pred = preds
df_sample_sub.head()

Unnamed: 0,ID,Pred
0,2018_3110_3113,0.064879
1,2018_3110_3114,0.304418
2,2018_3110_3124,0.002544
3,2018_3110_3125,0.206841
4,2018_3110_3129,0.334875


In [37]:
df_teams.head()

Unnamed: 0,TeamID,TeamName
0,3101,Abilene Chr
1,3102,Air Force
2,3103,Akron
3,3104,Alabama
4,3105,Alabama A&M


In [41]:
df_teams[(df_teams.TeamName == 'Connecticut')]

Unnamed: 0,TeamID,TeamName
62,3163,Connecticut


Lastly, create your submission file!

In [42]:
df_sample_sub.to_csv('dan_douthit_womens_elo_predictions_2018.csv', index=False)