## Problem Statement

Let's give it a crack! After reading the introduction, it seems to me the goal of this project is to prediction the result of the match between two teams, based on the 'knowledge' we have learnt about the teams. Therefore, it is important to grab as much information as possible from the past records of the teams.

## Data inspection

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

teams = pd.read_csv('./input/teams.csv')
seasons = pd.read_csv('./input/seasons.csv')

In [7]:
# make team_id <==> team_name dict
team_dict = {}
for x, y in zip(teams['Team_Id'], teams['Team_Name']):
    team_dict[x] = y
print('Total number of teams: {}'.format(len(team_dict)))

Total number of teams: 364


In [3]:
regular_season = pd.read_csv('./input/RegularSeasonCompactResults.csv')
regular_season_detail = pd.read_csv('./input/RegularSeasonDetailedResults.csv')

In [4]:
import re
t = list(map(lambda x: re.search('W*', x).group(), regular_season_detail.columns.tolist()))

In [5]:
Tourney = pd.read_csv('./input/TourneyCompactResults.csv')
Tourney_detail = pd.read_csv('./input/TourneyDetailedResults.csv')
Tour_seed = pd.read_csv('./input/TourneySeeds.csv')
Tour_slots = pd.read_csv('./input/TourneySlots.csv')

With all the information loaded, how to build a sensible model? The ideal model, like a crystal ball, should takes in two teams, and then split out the winning team. This should be a very backbone of the model. However, in reality, there are more factors that we might to consider: for example, how much games has the each team played prior to their encounter? How fresh are their legs? Are they historical rivals (like Duke and UNC). Also, although we have many years of historical data available, the roster of a college basketball team changes on a yearly basis, therefore, the histroical record 10 years ago might not be as useful as the record last year.

Well, other than the names of the two teams, what else should we provide to the crystal ball? Most likely, the crystal ball needs to know more about each team, such as avarge points scored/allowed per game, win/loss record for the season, average rebounds per game, etc. These attributes, for each team, can be found from the provided training data. 

## Feature Engineering

From above thinking, we need to prepare a set of features for each of the two teams, and then basing these features can we make a prediction. What are the features we can distill from the existing data, and more important, what are the features are important?

The rule of the game is simple: you need to outscore your oppoent to win. Therefore, a good indicator will be the average point per game, let's call it *ppg*, if team A has a higher ppg than team B, and this is the only information I have, I would bet team A would beat team B. Now the question is, how do you get the ppg information? For the first game in the season, where can I get this number, maybe use the average ppg from last season? For the last game in the season, does it make more sense to use the average ppg from the previous games in the season? Therefore, we need to take this information into account.

Let's start building such feature(s) for each team, and for each regular season. Given the fact that the statisitic after 2003 (*RegularSeasonDetailedResults.csv*) have more information than the years before (*RegularSeasonCompactResults.csv*), let me start from 2003 season


In [25]:
team_stat = teams # get the team id and names
stats = [
         'fgm', # field goal made
         'fga', # field goal attempt
         'fgm3', # 3-pointer made
         'fga3', # 3-pointer attempted
         'ftm', # free-throw made
         'fta', # free-throw attempt
         'or', # offensive rebound
         'df', # defensive rebound
         'ast', # assists
         'to', # turnover
         'stl', # steals
         'blk', # blocks
         'pf', # personal fouls
         # ===== below are new features =====
         'ppg', # point per game
         'oppg', # oppent point per game (new feature)
         'margin', # point margin (new feature)
         'fgp', # field goal percentage (new feature)
         'fgp3', # 3-pointer percentage (new feature)
         'ftp', # free-throw percentage (new feature)
         'odr', # offensive-defensive rebound ratio (new feature)
         'otw', # total overtime wins, not averaged! (new feature)
         'otl', # total overtime loses, not averaged! (new feature)
        ]
# get the season averaged stat
df = regular_season_detail.groupby('Wteam').mean()

In [28]:
regular_season_detail.head()

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot,Wfgm,Wfga,...,Lfga3,Lftm,Lfta,Lor,Ldr,Last,Lto,Lstl,Lblk,Lpf
0,2003,10,1104,68,1328,62,N,0,27,58,...,10,16,22,10,22,8,18,9,2,20
1,2003,10,1272,70,1393,63,N,0,26,62,...,24,9,20,20,25,7,12,8,6,16
2,2003,11,1266,73,1437,61,N,0,24,58,...,26,14,23,31,22,9,12,2,5,23
3,2003,11,1296,56,1457,50,N,0,18,38,...,22,8,15,17,20,9,19,4,3,23
4,2003,11,1400,77,1208,71,N,0,30,61,...,16,17,27,21,15,12,10,7,1,14
