## 1.1. LPM and Logit Model

## Self Test - Exploratory Data Analysis Example
- Instruction: Explore the imported data by using some python commands. Specifically you should be able to: 
    - Display the raw data 
    - Examine dimensionality of the raw data
    - Obtain the list of variables
    - Explore the data types in the dataset
    - Obtain various summary statistics (e.g., mean, variance, etc.)
    - Explore the missing values within the data frame

In [None]:
#Display data
display(NHL_game[0:10])
#Examine the size of data
NHL_game.shape
#Obtain the list of vairables
NHL_game.columns
#Explore the types of variables (data) and missing values in a data frame
NHL_game.info()
#Obtain basic descriptive statistics (Round off the descriptive statistics properly)
NHL_game.describe().round(decimals=3).round({"comp_id":0,"gid":0,"tid":0,"year":0})

### Self Test: Multiple Logistic Regression Example
#### Fit the multiple logistic regression using pythagorean winning % and home dummy variable as independent variables
Instruction 
1. Use home dummy variable encoded with a value of 1 indicating home and 0 indicating away games
2. Interpret the coefficients

In [None]:
# Self-test
Win_Pyth_hm = 'win~pyth_wpct + home'
model2 = smf.glm(formula=Win_Pyth_hm, data = NHL_reg_2017_WL, family=sm.families.Binomial())
result2 = model2.fit()
print(result2.summary())

print("Coefficients")
print(result2.params)
print("p-Values")
print(result2.pvalues)
print("Dependent variables")
print(result2.model.endog_names)

#we can fit the model to calculate probabilities of winning on each game
## Here we can print first 10 probabilities, which correspond to the probability of the chance of winning
fittedProbs2 = result2.predict()
print(fittedProbs2[0:10])

# Based on the fitted probabilites of winning, here we create a binary winning variable for 1 indicating Win, 0 indicating lose
fittedWin2 = [1 if x > .5 else 0 for x in fittedProbs2]
print(fittedWin2[0:10])

from sklearn.metrics import confusion_matrix, classification_report
confusion_matrix(NHL_reg_2017_WL['win'], fittedWin2)

print(classification_report(NHL_reg_2017_WL['win'], fittedWin2, digits=3))

# Success rate
(751+769)/2454 #Model2 worked slightly better than Model1

## 1.2. Ordered Logit Regression

## Self Test: Add pythagorean winning % to the current dataset Solution
### Add Pythagorean Winning percentage
1) Sort the dataframe by game in order (i.e., gid)
    (Note you will need to reset your index after you sort the NHL_reg_2016 data)

2) Calculate cumulative GF and GA for each team

3) Calculate Pythagorean win-percent

In [None]:
## Add Pythagorean Winning percentage
#1) Sort the dataframe by game in order (i.e., gid)
    ##Note) you will need to reset your index after you sort the NHL_reg_2016 data
NHL_reg_2016 = NHL_reg_2016.sort_values(by ='gid').reset_index().drop(['index'], axis=1)

#2) Calculate cumulative GF and GA for each team
NHL_reg_2016['cumGF'] = NHL_reg_2016.groupby(['tid'])['goals_for'].apply(lambda x: x.cumsum())
NHL_reg_2016['cumGA'] = NHL_reg_2016.groupby(['tid'])['goals_against'].apply(lambda x: x.cumsum())

display(NHL_reg_2016)

# Calculate Pythagorean win-percent
NHL_reg_2016['pyth_wpct'] = NHL_reg_2016['cumGF']**2/(NHL_reg_2016['cumGF']**2 + NHL_reg_2016['cumGA']**2)

display(NHL_reg_2016[0:10])

NHL_reg_2016.shape

## 1.3. Predictive Modeling-Basics of Forecasting

### Self-Test Data preparation for NHL_2016_2nd Solution
Instruction
: In this exercise, you will need to manipulate the 2nd half of NHL data. Specifically, you will need to 
1. generate a team level data which obtain the total number of wins
2. obtain the number of games played by each team for the 2st half of the 2016 regular season and merge it to the data
3. Create an winning percentage in the dataset: winning percent = win/total games

In [None]:
# 1. generate a team level data which obtain the total number of 1) wins, 2) goals for, and 3) goals against
nhl2016_pos = NHL_2016_2nd.groupby(['tricode'])['win',].sum()
display(nhl2016_pos[0:9])

# 2. obtain the number of games played by each team for the 2st half of the 2016 regular season
NHL_pos_GameNum = NHL_2016_2nd.groupby(['tricode']).size().reset_index(name='game_count')
display(NHL_pos_GameNum[0:3])

# Merge "NHL_pos_gameNum" to the "nhl2016_pos" dataset
nhl2016_pos=pd.merge(nhl2016_pos, NHL_pos_GameNum, on=['tricode'])
nhl2016_pos.head()

# 3. Create an winning percent in the nhl2016_pos dataset: winning percent = win/total games
nhl2016_pos['win_pct_pos']=nhl2016_pos['win']/nhl2016_pos['game_count']
display(nhl2016_pos[0:3])

# 4. Drop unnecessary columns  
nhl2016_pos.drop(['game_count'], axis=1, inplace=True)
nhl2016_pos.drop(['win'], axis=1, inplace=True)

nhl2016_pos.head()