# Model Trees: The Hart Memorial Trophy
In our previous examples we've written all of our code in a single notebook, but this is actually a pretty rare case. A more authentic workflow is that we put specific libraries or functions in files which are meaingful based on what the file does, then we just import them into the notebook we are building. We're going to do that here. Now, I can't show you all the ins and outs of building deployable python packages here, but we'll do a bit of separation of concerns.

In [1]:
# Let's start by bringing in our data science imports
import pandas as pd
import numpy as np

# And add to that the import_ipynb package, which allows us to call another jupyter
# notebook from this one
import import_ipynb

Now, we need a bunch of different data for this analysis. I'm going to show you how we got the data we needed, but most of the this won't work directly on the Coursera system because it doesn't support web scraping. Regardless, this will be useful if you want to replicate the work on your own computer, and of course I've included the datafiles I have pulled down so you can work on the machine learning parts as part of the course.

Let's start by getting information on players, and that's going to come from the NHL API directly. Let's create a new notebook for that called [nhl_api.ipynb](nhl_api.ipynb).

------

Good, now that we have official player stats ready to go from the NHL API we have to consider our Hart Memorial Trophy voting data. Now, the Hart votes are done by reporters who cover the NHL, and the voting procedure has stayed relatively constant from 1996 up until the 2020-2021 season. This last season, and the playoffs are actually going on right now while I'm filming, modified the number of reporters who could vote because of structural changes to the season due to covid. So, will the data from the 2020-2021 season be useful for next year? I don't know, it might be that the league changes the way the trophy is voted for next year too. What about this current analysis we're doing, would that be useful for the 2020-2021 season? I can't answer that either, since the Hart results haven't been shared yet! This would be a great place for you to extend this analysis though, and see what impact the covid year has on building models like this.

Regardless, we need to get that Hart voting data. There are several places to get this, but the website hockey-reference.com is a great resource and makes this easy, so let's create another API called [hockey_reference_api.ipynb](hockey_reference_api.ipynb).

-----

With our data now in hand, let's start our analysis!

In [2]:
# If you want to get the data directly from the API, just run the following:
# import hockey_reference_api
# historical_hart_results()

# This takes awhile! On the Coursera system I've cached the datafles, so you
# can just load them for to start our analysese
df_votes=pd.read_csv("historical_hart_results.csv")
df_players=pd.read_csv("historical_player_data.csv")

In [3]:
# The first challenge we run into is one we saw in the previous module,
# sometimes the data isn't quite as clean as we would like it. It's very
# common in manual data systems to find alignment issues, like we did
# with the spelling of the team the Montreal Canadiens. In this case, it
# turns out that two players have difference preffered names, and we need
# to align them
df_votes.loc[df_votes["Player"]=='Alex Steen','Player'] = 'Alexander Steen'
df_votes.loc[df_votes["Player"]=='Olaf Kolzig','Player'] = "Olie Kolzig"

In [4]:
# We want to predict the number of votes that a given player will get in a
# year. Actually, we don't care about the number of votes per se, which is
# good because as membership in the PHWA changes there are more or less votes
# cast each year. Instead, we want to predict the ratio of the votes that
# a given player will get in a season. We're going to call this the normalized
# vote percentage. This will be our regression target
df_votes["normalized_vote_pct"]=df_votes["Votes"] / \
    df_votes.groupby("season")["Votes"].transform(np.sum)

In [5]:
# There are some interesting statistics in the data on the amount of time
# each player spent on the ice, on the ice during power plays, which is when
# one team has a penalty and the other has more people on the ice, and so
# forth. These are all strings, but we want to convert these into a single
# integer value of total seconds so they can be used by our model better

def convert_on_ice_time_to_seconds(x:str) -> int:
    '''Conver the string in the format of mm:ss to an integer of seconds
    :param x: the mm:ss string from the NHL api
    :return: the total number of seconds
    '''
    y = x.split(":")
    return int(y[0])*60 + int(y[1])

# We just want to apply this to all of the columns which have an "OnIce"
# component in their name, and if a person has no entry then we will fill
# it with the dummy value of 0:0 which should convert nicely.
on_ice_columns = [c for c in df_players.columns if "OnIce" in c]
df_players.loc[:, on_ice_columns] = df_players[on_ice_columns].fillna("0:0").applymap(convert_on_ice_time_to_seconds).values

In [6]:
# Now, let's merge these two files together and have a look at our
# data!
df_full=pd.merge(df_players,df_votes, how="left", left_on=["fullName","season"],right_on=["Player","season"])
df_full.head()

Unnamed: 0,assists,fullName,gameWinningGoals,games,gamesStarted,goalAgainstAverage,goals,goalsAgainst,id,link,...,W,L,T/O,GAA,SV%,OPS,DPS,GPS,PS,normalized_vote_pct
0,34.0,Dave Andreychuk,2.0,82.0,,,27.0,,8445000.0,/api/v1/people/8445000,...,,,,,,,,,,
1,12.0,Neal Broten,2.0,42.0,,,8.0,,8445724.0,/api/v1/people/8445724,...,,,,,,,,,,
2,15.0,Bobby Carpenter,0.0,62.0,,,4.0,,8445977.0,/api/v1/people/8445977,...,,,,,,,,,,
3,17.0,Shawn Chambers,0.0,73.0,,,4.0,,8446042.0,/api/v1/people/8446042,...,,,,,,,,,,
4,7.0,Ken Daneyko,0.0,77.0,,,2.0,,8446309.0,/api/v1/people/8446309,...,,,,,,,,,,


In [7]:
# Ok, two quick bits of data cleaning based on our look. We need to fill in
# missing values - I'll just set them to 0 - and we need to make sure anyone
# who didn't get votes gets a 0 as their vote percentage. 
# Everyone who did not get a vote should be recognized as a 0% chance, so fillna on y
df_full=df_full.fillna(0)

In [8]:
# As we explored the dataset one of the students pointed out that a lot of the
# stats might be only relevant for some positions, or that a player position
# might influence a given statistic. This is very common in machine learning, that
# there is a lack of independence between features. For instance, you would expect
# a forward to be in a better position to score a goal than, say, a player on
# defense. In addition, some stats, like the save percentages, are only calculate
# for goalies, and thus are by definition non-existant for other positions.

# The position code is available in our data, let's explore it
df_full.groupby("position_code").apply(lambda grp: np.mean(grp["normalized_vote_pct"])/len(grp))

position_code
C    2.825837e-07
D    1.517509e-08
G    1.021953e-06
L    1.767933e-07
R    2.627900e-07
dtype: float64

In [9]:
# The position information is not a numeric value though. One way we can incorporate
# this into our model is to change these into dummy indicators - e.g. five different
# features, one for each position, which are either a 0 or a 1 if the player isn't
# or is playing that position. Pandas makes this easy!
df_full = pd.get_dummies(df_full, columns=['position_code'])

# And we can create out holdout and training datasets
df_holdout=df_full[df_full["season"]==20182019]
df_full=df_full[(df_full["season"]<20182019) & (df_full["season"]>=20012002)]

# Let's build up a list of features we want to use. Here I'm going to choose a number
# which I think might be interesting, you can feel free to explore!
features=['assists', 'gameWinningGoals', 'games', 'gamesStarted', 'goalAgainstAverage', 
'goals', 'goalsAgainst', 'overTimeGoals', 'plusMinus', 'points', 'powerPlayGoals', 
'powerPlayPoints', 'savePercentage', 'saves', 'shortHandedGoals', 'shortHandedPoints', 
'shots', 'shotsAgainst', 'wins', 'blocked', 'evenSaves', 'evenShots', 'evenStrengthSavePercentage', 
'faceOffPct', 'hits', 'powerPlaySavePercentage', 'powerPlaySaves', 'powerPlayShots', 
'shortHandedSavePercentage', 'shortHandedSaves', 'shortHandedShots', 'position_code_C', 
'position_code_D', 'position_code_G', 'position_code_L', 'position_code_R',
'timeOnIce', 'timeOnIcePerGame', 'evenTimeOnIce', 'evenTimeOnIcePerGame', 'powerPlayTimeOnIce', 
'powerPlayTimeOnIcePerGame', 'shortHandedTimeOnIce', 'shortHandedTimeOnIcePerGame',
]

Ok, we're just about at the exciting part: building a model! Before we do though, take a look at the list of the features which are in our model. Which features do you think are most informative for predicting the Hart trophy winner? Number of goals? Amount of time on ice? Shots blocked?

-----

I know I said we were going to build the model, but I have to put one more shout out up here! Despite being well known, there is no python implementation of the M5 algorithm in scikit learn! However, [Sylvain Marie](https://github.com/smarie) in the Analytics and Cloud Platforms group from Schneider Electric has coded up the algorithm, and has made it available as open source on github. Even better they're currently pursuing getting this added to scikit learn, so maybe in the future this model will be available directly to us in sklearn.

In [10]:
# I've put Sylvain's code in a python file and put that in the Coursera platform
# for you. We can bring it into our current interpretor scope using the %run
# magic function
%run m5p.py

# Now we just build our model as we have previously! Here I'll set a max depth
# of the tree and the minimum number of samples per leaf, but feel free to play
# with the parameters.
m5p=M5Prime(max_depth=6, min_samples_leaf=3, use_smoothing=False)

# We create our X and y
X_train=df_full[features].reset_index(drop=True)
y_train=df_full['normalized_vote_pct'].reset_index(drop=True)

# I'm also going to store our dataframe for a future lecture
df_full[[*features,'normalized_vote_pct']].to_csv("model_tree_data.csv",index=False)
df_holdout[[*features,'normalized_vote_pct','fullName']].to_csv("model_tree_holdout_data.csv",index=False)

from sklearn.model_selection import cross_validate
results = cross_validate(m5p, X_train, y_train, cv=10, scoring='r2')

print(f"The cv score results are {results['test_score']}")
print(f"The average cv score results are {np.mean(results['test_score'])} with a standard deviation of {np.std(results['test_score'])}")

The cv score results are [nan nan nan nan nan nan nan nan nan nan]
The average cv score results are nan with a standard deviation of nan


Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\Dennis\OneDrive\Documents\Introduction to Machine Learning in Sports Analytics\m5p.py", line 192, in fit
    super(M5Base, self).fit(
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 375, in fit
    builder.build(self.tree_, X, y, sample_weight, X_idx_sorted)
TypeError: Argument 'X_idx_sorted' has incorrect type (expected numpy.ndarray, got str)

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\Dennis\OneDrive\Documents\Introduction to Machine Learning in Sports Analytics\m5p.py", line 192, in fit
    super(M5Base, self).fit(
  File "C:\ProgramData\Anaconda3\lib\si

Well, that's quite the range of $R^2$ values! This suggests to me that there might be a temporal nature to the accuracy of our models, and that we shouldn't put too much stock in the predictive power we've currently got - the standard deviation is really high.

If we wanted to put this model into practice, we would want to look at particular folds in our validation which are particularly bad and consider which features may have led the model astray. I think it would be useful to think about these hyperparameters I just arbitrarily chose -- should the max depth of the tree be limited to 6?

We'll tackle this a bit more in the next lecture where we tune and inspect the model tree.