ADD PLAYER ID

This notebook will seek to add player ID's to the existing prediction datasets that are being used in the Forecast Evaluator notebooks.

The main reason for this is that finding the same player in two different datasets can be an exceptionally cumbersome task.  However, if we utilize the player ID column in the actual data, we can write a script to append said player ID to the prediction set.

I will make use of a large quantity of code already written for the Forecast Evaluator, this will in turn make the code for Forecast Evaluator much more straight forward.

To make use of this we will need to specify the following information:

- position (either 'skater' or 'goalie') since data is currently separated for these two
- season in which we want to be working
- prediction data set that we want to append
- actual data set that we will source the PlayerID from

Final output should be a new csv file with the appended PlayerID (note: we will be preserving the original prediction set)

In [1]:
import numpy as np
import csv
import pandas as pd

For simplicity, I am going to start off working only with skaters and the 2011 season.

Let's define those variable now for later use.

In [2]:
position = 'skater'
season = 2011

Let's start off with a function to load the actual data set that we will be retrieving the player ID's from.

Since this data contains the actual results from all seasons between 2011-2016, we should only need to load this once, and we will only need to specify the position.

In [3]:
def loadactual (position):
    
    # input required: position
    # two options, either skater or goalie
    # input files are name using the convention 'NHL_XXXX_2011-2016.csv' 
    # where XXXX is one of the two positions listed above
    
    # output will be a numpy array of the actual results for all seasons
    # row 0 is a header row, identifying each column
    # the same player may be found in multiple rows, but it will always be for a different season
    # pertinent columns are:
    # column 0 - season (XXXXYYYY) where XXXX is the first half of the season and YYYY is the second half
    # column 8 - PlayerFirstName
    # column 9 - PlayerId
    # column 10 - PlayerLastName
    # column 11 - PlayerName
    # column 12 - PlayerPositionCode
    # column 14 - Points
    # column 22 - TeamAbbrev
    
    filename = 'NHL_%s_stats_2011-2016.csv' % position
    result = np.array(pd.read_csv(filename,header=None))
    
    return result

# test the function out and print out a small snippet for verification
actual = loadactual(position)
print (actual[0:4,:])

    

[['Season' 'Assists' 'FaceoffWinPctg' 'GameWinningGoals' 'GamesPlayed'
  'Goals' 'OTGoals' 'PenaltyMinutes' 'PlayerFirstName' 'PlayerId'
  'PlayerLastName' 'PlayerName' 'PlayerPositionCode' 'PlusMinus' 'Points'
  'PPGoals' 'PPPoints' 'SHGoals' 'SHPoints' 'ShiftsPerGame' 'ShootingPctg'
  'Shots' 'TeamAbbrev' 'TimeOnIcePerGame']
 ['20112012' '19' '0.00' '0' '80' '5' '0' '66' 'Luca' '8474579' 'Sbisa'
  'Luca Sbisa' 'D' '-5' '24' '0' '3' '0' '0' '21.50' '0.0568' '88' nan
  '1075.7625']
 ['20112012' '0' '1.00' '0' '23' '1' '0' '37' 'Aaron' '8475619' 'Volpatti'
  'Aaron Volpatti' 'L' '-2' '1' '0' '0' '0' '0' '13.6086' '0.0588' '17' nan
  '538.1304']
 ['20112012' '31' '0.4375' '4' '92' '43' '2' '34' 'Alex' '8471214'
  'Ovechkin' 'Alex Ovechkin' 'L' '-10' '74' '15' '27' '0' '0' '21.4782'
  '0.1218' '353' nan '1188.4891']]


Now, let's define a new function to grab the prediction set that we want to append.

In [6]:
def loadpredictions(season,position):

    # input required:
    # 1) position - either skater or goalie
    # 2) season
    # file naming convention of the source files are 'XXXX_Predictions_YYYY.csv'
    # where XXXX is the season and YYYY is the position (either skater or goalie)
    
    # output: numpy array of the prediction results
    # row 0 is a header row, identifying/naming each column
    # each subsequent row is for each different player
    # column 0 - Last Name
    # column 1 - First Name
    # column 2 - Full Name
    # column 3 - Team
    # column 4 - Position
    # column 5 - Hockey News Prediction
    # column 6 - Poolers Prediction
    # column 7 - Forecaster Prediction
    
    filename = '%s_Predictions_%s.csv' % (season,position)
    result = np.array(pd.read_csv(filename,header=None))
    
    return result

#test out the function and print out a small snippet for verification
prediction = loadpredictions(season,position)
print (prediction[0:4,:])
    

[['Last Name' 'First Name' 'Full Name' 'Team' 'Pos' 'Hockey News' 'Poolers'
  'Forecaster']
 ['Timonen' 'Kimmo' 'Kimmo Timonen' 'PHI' 'D' '36' '36' '26']
 ['Corvo' 'Joe' 'Joe Corvo' 'BOS' 'D' '35' '37' '30']
 ['Gonchar' 'Sergei' 'Sergei Gonchar' 'OTT' 'D' '40' '41' '30']]


Let's define a few other functions for use later

In [29]:
def takeout (name):
    
    # removes all periods, dashes and spaces from a given name
    # also makes all characters lower case
    # input: name - a string
    # output: result - a string
    
    # strings are immutable so let's turn the input string into a list
    result = list(name)
    
    while '.' in result: 
        result.remove('.') # remove all periods
    while '-' in result:
        result.remove('-') # remove all dashes
    while ' ' in result:
        result.remove(' ') # remove all spaces
        
    # turn the list of characters back into a string
    result = ''.join(result)
    
    return result.lower()

# test this bad boy out
test01 = 'Martin St Louis'
test02 = 'Martin St-Louis'
test03 = 'Martin St. Louis'
test04 = 'Martin St.Louis'
print (takeout(test01))
print (takeout(test02))
print (takeout(test03))
print (takeout(test04))

martinstlouis
martinstlouis
martinstlouis
martinstlouis


In [31]:
def missing (season,predicted, actual):
    
    # input required:
    # season - integer
    # predicted - numpy array with predicted data
    # actual - numpy array with actual data
    
    # output will be a list of the players in the prediction set that no exact match could be made for and
    # the most likely corresponding name in the actual data set
    
    # initialize a list for the final output
    result = [['Predicted Name','Actual Name']]
    
    # modify our season input to match the formatting of the season column in the actual array
    year = season*10000+season+1
    
    # now let's start looking through the predicted array one at a time
    
    for i in range(1,len(predicted)):
        
        # load the values from the predicted set into some variables
        FullName = predicted[i,2]
        FirstName = predicted[i,1]
        LastName = predicted[i,0]
        Position = predicted[i,4]
        
        # initialize a variable to store name data for potential missing players
        player = []
        
        # switch to determine if a player is found or not 0 = not found, 1 = found
        # this will reset the switch to not found for each new player from the predicted data set
        Found = 0
        
        for j in range(1,len(actual)):
            
            if int(actual[j,0]) == year and FullName == actual[j,11]: # checks for exact match of Full Name
                Found = 1
                
        if Found == 0:  # no exact match of the FullName is found
            
            # remember the following indexing for the actual dataset
            # column 0 - season (XXXXYYYY) where XXXX is the first half of the season and YYYY is the second half
            # column 8 - PlayerFirstName
            # column 10 - PlayerLastName
            # column 11 - PlayerName
            # column 12 - PlayerPositionCode
            
            Lastfound = 0
            Firstfound = 0
            tempName = 'Not Found'
            
            
            # let's look for a match with season / last name / position
            for j in range(1,len(actual)):
                
                # Prediction data only specifies forward (F) or defense (D)
                # Actual data breaks forwards down into left wing (L), right wing (R), and center (C)
                # need to convert those to a value of 'F' for forward for comparison purposes
                if actual[j,12] == 'L' or actual[j,12] == 'R' or actual[j,12] == 'C':
                    tempPos = 'F'
                else:
                    tempPos = actual[j,12] # for goalies (G) and defense (D)
                
                if LastName == actual[j,10]:
                    Lastfound = 1
                
                
                if int(actual[j,0]) == year and LastName == actual[j,10] and Position == tempPos:

                    # let's load the corresponding actual name to a temp variable for possible manipulation
                    tempActual = actual[j,8]

                    # and intialize a variable to check if we found an appropriate match for the first name too
                    #Firstfound = 0

                    # now let's check some common scenarios

                    # Scenario #1 - Shortened names
                    # Alex versus Alexander, Dan versus Daniel versus Danny
                    # we are only going to check the first three letters
                    # so it doesn't matter which set has the shortened version
                    # but we do have a problem if we have initials like PK in PK Subban, so let's avoid those

                    if len(FirstName) >= 3 and len(tempActual) >= 3:
                        count = 0
                        for k in range(0,3):
                            if FirstName[k] == tempActual[k]:
                                count += 1
                        if count == 3:
                            Firstfound = 1

                    # Scenario #2 - First Name is initials (Predicted set without periods)
                    if len(FirstName) < 3:
                        temp = FirstName[0] + '.' + FirstName[1] + '.'
                        if temp == tempActual:
                            Firstfound=1

                    # Scenario #3 - First Name is initials (Actual set without periods)
                    if len(tempActual) < 3:
                        temp = tempActual[0] + '.' + tempActual[1] + '.'
                        if temp == FirstName:
                            Firstfound = 1


                    # Scenario #4 - one is initials, the other is a full name
                    if len(FirstName) < 3:
                        temp = tempActual
                        next = 0
                        secondinit = ''
                        for letter in temp:
                            if next == 1:
                                secondinit = letter
                                next = 0
                            if letter == '-':
                                next = 1
                        if FirstName[0] == temp[0] and FirstName[1] == secondinit:
                            Firstfound = 1
                            
                    if len(tempActual) < 3:
                        temp = FirstName
                        next = 0
                        secondinit = ''
                        for letter in temp:
                            if next == 1:
                                secondinit = letter
                                next = 0
                            if letter == '-':
                                next = 1
                        if tempActual[0] == temp[0] and tempActual[1] == secondinit:
                            Firstfound = 1
                            
                    # should probably expand scenario #4 to account for initial with periods
                    
                    # now if one of our scenarios found a match then let's load the version from the actual data set

                    if Firstfound == 1:
                        tempName = actual[j,11] # version of the name as listed in the actual data set
                    else:
                        tempName = 'Not Found'

                
                # now just in case the error was with a mismatch in the Last Name
                if Lastfound == 0:

                    # Scenario 1 - periods, dashes, spaces cause the problem
                    if int(actual[j,0]) == year and Position == tempPos:
                        
                        temp = FullName #copy of predicted data full name
                        tempa = actual[j,11] # copy of actual data full name
                        
                        # run takeout function to remove punctuation from each full name
                        temp = takeout(temp) 
                        tempa = takeout(tempa) 
                        
                        # compare results to see if they are the same
                        if temp == tempa:
                            Lastfound = 1
                        else:
                            Lastfound = 0

                    if Lastfound == 1:
                        tempName = actual[j,11]
                    else:
                        tempName = 'Not found'
        
            
            # record the results, either we found a suitable alternate value for the name or we didn't "none"
            player.append(FullName) # Full Name as found in the prediction set
            player.append(tempName) # Full Name as found in the actual set OR 'Not Found'
            result.append(player)
            
            
    return np.asarray(result)

# let's test it out

missingplayers = missing(season,prediction,actual)
print (missingplayers)

[['Predicted Name' 'Actual Name']
 ['Brayden Mcnabb' 'Brayden McNabb']
 ['Danny Dekeyser' 'Danny DeKeyser']
 ['Tobias Enstrom' 'Toby Enstrom']
 ['Matthieu Perreault' 'Mathieu Perreault']
 ['Mike Cammalleri' 'Not Found']
 ['PK Subban' 'P.K. Subban']
 ['TJ Oshie' 'T.J. Oshie']
 ['James Van Riemsdyk' 'James van Riemsdyk']
 ['Alexander Ovechkin' 'Alex Ovechkin']]


Now for the real meat and potatoes!!!!

Let's grab those PlayerIDs and add them to the prediction set.

If no PlayerID can be found for a player, it will write 'N/A' to the new array AND also output a message saying that a player ID could not be found for that player.

In [32]:
def addID (season,predicted,actual):
    
    # inputs:
    # season - integer
    # predicted - numpy array
    # actual - numpy array
    
    # output:
    # a new numpy array, identical to the predicted array but with an added column for Player ID
    
    # intialize a header row for our eventual output array
    result = [['Last Name','First Name','Full Name','Team','Pos','Hockey News','Poolers','Forecaster','PlayerID']]
    
    # modify our season input to suit the formatting of the actual results array
    year = season*10000+season+1
    
    # let's grab a list of all the players without an exact match
    missinglist = missing(season,predicted,actual)
    
    # Reminder of column formatting for input arrays
    
    # Predicted array:
    # column 0 - Last Name
    # column 1 - First Name
    # column 2 - Full Name
    # column 3 - Team
    # column 4 - Position
    # column 5 - Hockey News Prediction
    # column 6 - Poolers Prediction
    # column 7 - Forecaster Prediction
    
    # Actual array:
    # column 0 - season (XXXXYYYY) where XXXX is the first half of the season and YYYY is the second half
    # column 8 - PlayerFirstName
    # column 9 - PlayerId
    # column 10 - PlayerLastName
    # column 11 - PlayerName
    # column 12 - PlayerPositionCode
    # column 14 - Points
    # column 22 - TeamAbbrev
    
    for i in range(1,len(predicted)):
        
        # loop to go through each player in the predicted array one at a time
        
        # let's load the data from the current player for use and eventual output to the result array
        LastName = predicted[i,0]
        FirstName = predicted[i,1]
        FullName = predicted[i,2]
        Team = predicted[i,3]
        Position = predicted[i,4]
        Value1 = predicted[i,5]
        Value2 = predicted[i,6]
        Value3 = predicted[i,7]
        
        counter = 0
        
        # now let's look for the match by searching through the actual array
        for j in range(1,len(actual)):
            
            Found = 0  # switch to determine when a match is found
            
            # look for the simple exact match
            if int(actual[j,0]) == year and FullName == actual[j,11]:
                Found = 1
            
            # check for the player on the missing list
            if Found == 0:
                for k in range(1,len(missinglist)):
                    if missinglist[k,0] == FullName and missinglist[k,1] == actual[j,11]:
                        Found = 1
                        break       # stop looking if we found the player
            
            # now let's get the player ID
            if Found == 1:
                PlayerID = actual[j,9]
                result.append([LastName,FirstName,FullName,Team,Position,Value1,Value2,Value3,PlayerID])
                break    
            else:
                PlayerID = 'N/A'
                counter += 1
            
            if counter == len(actual)-1:
                result.append([LastName,FirstName,FullName,Team,Position,Value1,Value2,Value3,PlayerID])
                print ("No Player ID found for: " + FullName + "(" + str(season) + " Season)")
                
    return np.asarray(result)
            

# let's test it out again
Newpred = addID(season,prediction,actual)
print (Newpred[0:5,:])
    
    

No Player ID found for: Mike Cammalleri(2015 Season)
[['Last Name' 'First Name' 'Full Name' 'Team' 'Pos' 'Hockey News' 'Poolers'
  'Forecaster' 'PlayerID']
 ['Weber' 'Yannick' 'Yannick Weber' 'Van' 'D' '14' '26' 'nan' '8474134']
 ['Liles' 'John-Michael' 'John-Michael Liles' 'Car' 'D' '15' '26' 'nan'
  '8468639']
 ['Wiercioch' 'Patrick' 'Patrick Wiercioch' 'Ott' 'D' '17' '25' 'nan'
  '8474605']
 ['Hanifin' 'Noah' 'Noah Hanifin' 'Car' 'D' '19' '20' 'nan' '8478396']]


OK we have added the player ID to the array, no let's make our new files.

***NOTE: 'N/A' is written for player that no ID can be found, these are relatively few so they can be edited manually in the file after the fact.***

Remember to move the file somewhere else (or rename) after the manual edits, otherwise they will get overwritten if this code is run again.

In [33]:
def newfile(season,position,data):
    
    #inputs:
    # season - interger
    # position - string (options are skater or goalie)
    # writes a new csv file for the new prediction array
    
    filename = '%s_Predictions_%s_withPlayerID.csv' % (season,position)
    
    df = pd.DataFrame(data)
    df.to_csv(filename, header=None,index=False)
    
    return


Let's put it all together.

In [36]:
position = 'skater'
actual = loadactual(position)

for i in range(2011,2016):
    season = i
    prediction = loadpredictions(season,position)
    newPrediction = addID(season,prediction,actual)
    newfile(season,position,newPrediction)
    print ("New file written for " + position + "s " + str(season) + "season")
    print ()
    

New file written for skaters 2011season

No Player ID found for: Jason Garrisson(2012 Season)
No Player ID found for: Viktor Hedman(2012 Season)
No Player ID found for: Vaclav Prospal(2012 Season)
No Player ID found for: Drerrick Brassard(2012 Season)
No Player ID found for: Mike Cammalleri(2012 Season)
New file written for skaters 2012season

No Player ID found for: Sheldon Souray(2013 Season)
No Player ID found for: Jonathan Drouin(2013 Season)
New file written for skaters 2013season

No Player ID found for: Kris Russel(2014 Season)
No Player ID found for: Olli Maata(2014 Season)
No Player ID found for: Nathan Horton(2014 Season)
No Player ID found for: Mike Cammalleri(2014 Season)
No Player ID found for: Niklas Backstrom(2014 Season)
New file written for skaters 2014season

No Player ID found for: Mike Cammalleri(2015 Season)
New file written for skaters 2015season



There we go!!!!!

Now just to go manually edit those files for the listed players above.  (and move these files somewhere else)