## 1.0 Problem Formulation and Data Acquisition ##

In this notebook, I'll walk-through an explanation of my goals and cover the foundational work needed to reach those goals;  extraction, cleaning and manipulation culminating into a desired representation of the data.

### 1.1 Introduction to the Problem/ Fantasy Football Primer
<br>


<font size=4>__Key Takeaways__</font>

 - Fantasy Football offers a lot of interesting data science tasks/problems to explore. One of those is predicting players a fantasy owner should choose.
 - Choosing the right players is crucial for being competitive in a league.
 - Aside from choosing a player based on performance (which has a more direct relationship with points, modeling  potential for injury by using past player injured reserve (ir) data could inform owners about players to avoid
 - One approach is to represent a players' career as a vector with each index in the vector representing a single season and each value representing whether the player was healthy or an injury that caused the player to be placed on IR.


<font size=4>__Problem Formulation__</font>
<br>
<font size=10>F</font>
<font size=4>or this analysis, I'll be focusing on the "Who should I chose?" question.  My long-term desire is to build a reasonably accurate predictive model to aide in choosing a player for an upcoming season. That simple question becomes very complex when considering the variables needed to predict a player's future success.  You may think that a prior season's performance is a good predictor of the future but we could reason that modeling future performance based on prior statistics is most likely a simplistic representation of the complex real-world. But what other factors influence a player's performance?  Here, I would make a distinction between intrinsic (player-dependent) and extrinsic (outside of the player) factors. Beyond past performance, I would suggest that intrinsic factors could include age, experience, or even potentially height/weight as examples.  Extrinsic features could include a player's coach, the talent surrounding them, even the franchise they belong to.  We can't faithfully model all of the real-world factors, but we can try our best to understand what factors have the greatest influence on a player's performance and model these.  This concept is called large-world uncertainty.

<font size=4>
In my question of "Whom should I choose?" and brainstorming the intrinic/extrinsic factors I found a bias in my own thinking. I was seeking out the obvious - variables that discretely generate points (touchdowns, rushing yards, catches, etc. and are positively correlated to fantasy points). I became a little more sophisticated and reasoned that age might play a role as it's been suggested and mostly observed that as a player ages, their performance is likely to diminish; a widely observed relationship in sports is that age has a negative correlation to production.  Then the most obvious thing finally dawned on me - what about injuries?  If a player is injured, they can't play.  If they can't play, they can't generate points and that could have just as much of an impact on a fantasy team than using some other variable. So, if I choose a high performing player (we'll call them "A")  and they get injured and are out several games, than their higher performances are diluted over the span of the season as they aren't generating points during the games they're injured (ie their mean-season performance is decreased).  Now, the prospect of grabbing a player who is a moderate performer ("B") but stays healthy across the season might not seem like such a bad choice.  I'll point out that this is not black and white scenario - it might just be that A won your team games for a few weeks whereas playing B in those same weeks wouldn't have led to the same result so it's possible the risk of selecting A if there was a prior belief that they would get injured might still be worth it, but in general, especially during a draft, you don't have any certainty about future performance - and therefore, your limited to predicting the future.


<font size=4><p>  
It's reasonable to assume, however, that an owner would want to avoid prolonged injuries as a risk-mitigations strategy for earlier selection where the very best players are presumed to still be available. Considering other owners will be selecting the highest expected performers early in the draft, owners choosing players who are high performers and less at risk of prolonged injury ensures that they'll be able to stay competitive. Therefore, in asking the question, "Whom should I choose?", it's important to ask an intersecting question of "Whom is likely to be injured?".  "Injured" or more generally "Health" can be multi-faceted (players could have minor injuries that take them out of a game, long-term injuries that require them to be sit out a predefined number of weeks with a designation of "Injured Reserve", or they could catch covid and sit out of games until they recover).  

<font size=4>
In this analysis, I explore this question using the presense/non-presence of a player's stint on Injured Reserve as a proxy for health, as long-term inactivity (from injury) will have a significant impact on fantasy performance.</font><p>

    

<font size=4> __Data Representation__ </font><p>
For this first attempt, We're going to build a simple model using vector representations of player careers. Our resulting model will be fairly simplistic as it will only be looking at the probability of injury given a sequence of prior injuries (using markov chains).  Real-life is obviously much more nuanced but I love thinking about data in terms of the toolsets I've learned from my courses and experimenting with their application. We'll explore alternative representations and techniques in the future.
    
 
With that, let's get our data for our analysis...

In [1]:
# Let's import the important libraries for our work.

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import time
from tqdm import tqdm

### 1.1 Data Scraping 

Our first step is to get a list of players that we can perform our analysis on.  The best way to do this is to cycle through one or more years of NFL Drafts and retrieve each drafted player and their player page.  This won't give us a complete list of players in the NFL but it should give us a good amount of data to work with. Here's an example of a page we'll be scraping: https://www.footballdb.com/draft/draft.html?yr=2022

In [19]:
def GetDraftees(draft_years):
    
    """
    Produces a dictionary of players selected during an NFL Draft based on a provided list of seasons
    
    Parameters:
        draft_years (list): a list of 4-digit years corresponding to NFL Drafts of interest
    Returns:
        dictionary of each desired year's NFL Draft picks
    """
    
    
    head = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.2 Safari/605.1.15"}
    year_lst=[]
    # dround=[]
    # pick=[]
    # team = []
    # player = []
    # ppage = []
    # position = []
    # school = []

   
    for year in tqdm(draft_years):
        for rnd in range(1,8):
            time.sleep(2)
            url = f"https://www.footballdb.com/draft/draft.html?lg=NFL&yr={year}&rnd={rnd}"
            resp = requests.get(url,headers=head)
            soup = BeautifulSoup(resp.content)
            table = soup.find_all('div', class_='tr')
            for row in tqdm(table[1:]):        
                cell_lst = row.findChildren('div')
                # try to retrieve player's page so we can scrap transactions
                try:
                    page = cell_lst[2].findChildren('a',href=True)[1]['href']
                # account for instances in which we can't get a page
                except:
                    page = "unavailable"
                position.append(cell_lst[4].get_text())
                school.append(cell_lst[5].get_text())
                # keep a small memory footprint by yielding out the results as a generator
                yield {'year': year,
                       'round': cell_lst[0].get_text(),
                       'pick' : cell_lst[1].get_text(),
                       'team' : cell_lst[2].findChildren('b')[0].get_text(),
                       'player': cell_lst[3].get_text(),
                       'page' : page,
                       'position': cell_lst[4].get_text()
                      }



In [20]:
# 2013 was the first year that IR data appears to be collected for this website
lst  = [x for x in range(2013,2023)]
df = pd.DataFrame(data=GetDraftees(lst))
df.head()


100%|██████████| 32/32 [00:00<00:00, 13093.14it/s]
100%|██████████| 30/30 [00:00<00:00, 13031.18it/s]
100%|██████████| 35/35 [00:00<00:00, 13729.95it/s]
100%|██████████| 36/36 [00:00<00:00, 20801.07it/s]
100%|██████████| 35/35 [00:00<00:00, 15807.11it/s]
100%|██████████| 38/38 [00:00<00:00, 15347.48it/s]
100%|██████████| 48/48 [00:00<00:00, 15409.61it/s]
100%|██████████| 32/32 [00:00<00:00, 15102.70it/s]
100%|██████████| 32/32 [00:00<00:00, 13309.97it/s]
100%|██████████| 36/36 [00:00<00:00, 13284.79it/s]
100%|██████████| 40/40 [00:00<00:00, 15379.24it/s]
100%|██████████| 36/36 [00:00<00:00, 14463.12it/s]
100%|██████████| 39/39 [00:00<00:00, 15317.71it/s]
100%|██████████| 41/41 [00:00<00:00, 15094.05it/s]


Unnamed: 0,year,round,pick,team,player,page,position
0,2013,1,1,Kansas City Chiefs,Eric Fisher,/players/eric-fisher-fisheer01,OT
1,2013,1,2,Jacksonville Jaguars,Luke Joeckel,/players/luke-joeckel-joecklu01,OT
2,2013,1,3,Miami Dolphins,Dion Jordan,/players/dion-jordan-jordadi01,DE
3,2013,1,4,Philadelphia Eagles,Lane Johnson,/players/lane-johnson-johnsla06,OT
4,2013,1,5,Detroit Lions,Ezekiel Ansah,/players/ezekiel-ansah-ansahez01,DE


In [13]:
df = pd.read_csv("2013_2022_Draft.csv")
df = df[df['year'] < 2022]

Great! We have all of the NFL Draftees from 2014 to 2021 (2022 draftees haven't provided us much info at the time of this writing).  To prepare for the joining of the transaction data, I need to create rows  for every player between the time they were drafted and now.  Using only the transaction data will only provide us the seasons in which an IR transaction occurred, but we can't rely on it to account for every season. This is crucial, as we want to represent every season of a player's career (whether there was an injury or not). 

Let me illustrate this point by taking Derrick Henry's career represented as a sequence of his career with/without IR. With only transaction data, we get the following sequence: 

Derrick Henry:  [Foot] 

where <code>Foot</code> was an injury that occurred in 2021.  But Henry didn't play just one season.  So we need to represent his career as several <code>Healthy</code> seasons followed by an IR Injury.  Since he started his career in 2016, we'd want his career vector to look like this:

<code>['Healthy','Healthy','Healthy','Healthy','Healthy',Foot']


The following code establishes prepares each player to be represented this way (prior to joining the IR data).  Note: Only run this cell once or run the prior cell first before the following cell to prevent unncessary duplication of rows.
    
   

In [15]:
# Create a list of seasons
lst_seasons = [2014,2015,2016,2017,2018,2019,2020,2021]


# Create a pandas series that cycles through the number of players (represented as the # of pandas rows * the list of seasons we care about
seasons = pd.Series(df.shape[0] *  lst_seasons)


# Duplicate each player row by the length of the number of seasons (length of the list of seasons above).  
# Reset the index so that the series can match up to the df's index
df = df.iloc[df.index.repeat(len(lst_seasons)),:].reset_index()


# Create a new column from the pandas series
df['season'] = seasons




# Let's remove any seasons that occurred prior to a player's draft year - because players don't play prior to being drafted
# We also should also filter out any positions we don't care about - namely defensive positions.  My FFB League accrues points 
# for QB, RB, WR and TE positions.

# df = df[(df['position'].isin(['QB','RB','WR','TE'])) &
#        (df['year'] <= df['season'])]

df = df[df['year'] <= df['season']]



df.head(15)

Unnamed: 0,index,year,round,pick,team,player,page,position,season
0,0,2013,1,1,Kansas City Chiefs,Eric Fisher,/players/eric-fisher-fisheer01,OT,2014
1,0,2013,1,1,Kansas City Chiefs,Eric Fisher,/players/eric-fisher-fisheer01,OT,2015
2,0,2013,1,1,Kansas City Chiefs,Eric Fisher,/players/eric-fisher-fisheer01,OT,2016
3,0,2013,1,1,Kansas City Chiefs,Eric Fisher,/players/eric-fisher-fisheer01,OT,2017
4,0,2013,1,1,Kansas City Chiefs,Eric Fisher,/players/eric-fisher-fisheer01,OT,2018
5,0,2013,1,1,Kansas City Chiefs,Eric Fisher,/players/eric-fisher-fisheer01,OT,2019
6,0,2013,1,1,Kansas City Chiefs,Eric Fisher,/players/eric-fisher-fisheer01,OT,2020
7,0,2013,1,1,Kansas City Chiefs,Eric Fisher,/players/eric-fisher-fisheer01,OT,2021
8,1,2013,1,2,Jacksonville Jaguars,Luke Joeckel,/players/luke-joeckel-joecklu01,OT,2014
9,1,2013,1,2,Jacksonville Jaguars,Luke Joeckel,/players/luke-joeckel-joecklu01,OT,2015


Ok, we have an accounting of the seasons and we've eliminated seasons that existed prior to the player's draft year, let's focus on the transaction data now...

## 2.0 Generate Transaction Data ##

Let's generate a generator function that scrapes all of the pages of the players who we obtained from every year's draft (Offensive positions only).  This will yield a data object containing each player's transaction history over their career.

In [7]:
head = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.2 Safari/605.1.15"}
transdate = []
transteam = []
transaction = []
players = []

def GetTransactions(pages):
    time.sleep(2)
    # player = r['player']
    # page = r['page']
    
    
    for page in tqdm(pages):
        url = f'https://www.footballdb.com/{page}/transactions'
        resp = requests.get(url,headers=head)
        soup = BeautifulSoup(resp.content)
        body = soup.find_all('tbody')[0]
        tr_lst = body.find_all('tr')
        for row in tr_lst:
            obj = {}
            cells = row.find_all('td')
            obj['page'] = page
            obj['transdate'] = cells[0].find_all('span')[0].get_text()
            obj['team'] = cells[1].find_all('span')[0].get_text()
            obj['transaction'] = cells[2].get_text()
            yield obj

        
        
pages = df.page.unique()      
trans = pd.DataFrame(GetTransactions(pages))      

In [8]:
# Save the data so I don't have to repeat the scraping all over again
# trans.to_csv('full_transactions.csv',index=False)
trans.to_csv('data/all_pos_full_transactions.csv', index=False)


In [7]:
# Let's read the transaction data into the csv
# take Derrick Henry and see what transactions he's generated (and whether we've done a good job collecting them)

trans = pd.read_csv('data/all_pos_full_transactions.csv')
trans[trans['page'].str.contains('derrick-henry.*')==True]


Unnamed: 0,page,transdate,team,transaction
9056,/players/derrick-henry-henryde01,01/21/2022,Tennessee (NFL),Activated from Injured Reserve
9057,/players/derrick-henry-henryde01,01/05/2022,Tennessee (NFL),Designated for return from Injured Reserve
9058,/players/derrick-henry-henryde01,11/01/2021,Tennessee (NFL),Placed on Injured Reserve (Foot)
9059,/players/derrick-henry-henryde01,04/02/2020,Tennessee (NFL),Signed
9060,/players/derrick-henry-henryde01,03/16/2020,Tennessee (NFL),Designated as franchise player
9061,/players/derrick-henry-henryde01,05/09/2016,Tennessee (NFL),Signed


Perfect!  looks like we've captured all of the transactions when compared to his actual history on footballdb.com (as of this writing). We'll need to do some housecleaning including extracting the year/month out of the transactions and determining what season number the player was in for the particular ir transaction.  Using season numbers rather than season years makes sense since players start/end their careers at various years.

In [8]:
trans['transdate'] = pd.to_datetime(trans['transdate'])
trans['trans_year'] = trans['transdate'].dt.year
trans['trans_month'] = trans['transdate'].dt.month

In [9]:
# Let's take a look to see what types of transactions include some mention of retire

print(trans.transaction.unique().reshape(-1,1))
print('Retirement related transactions: ', [x for x in list(trans.transaction.unique()) if 'retire' in x.lower()] )

[['Activated from the Reserve/COVID-19 List']
 ['Placed on COVID-19 List']
 ['Signed']
 ['Released']
 ['Placed on Injured Reserve (Achilles)']
 ['Placed on Injured Reserve (Knee)']
 ['Placed on Injured Reserve (Ankle)']
 ['Signed to the Active Roster from the Practice Squad']
 ['Signed to the Practice Squad']
 ['Re-signed']
 ['Activated from the Reserve/Non-Football Injury List']
 ['Placed on the Non-Football Injury List (Knee)']
 ['Waived']
 ['Reinstated']
 ['Placed on the Reserve/Suspended List']
 ['Activated from the Reserve/Suspended List']
 ['Placed on Injured Reserve (Biceps)']
 ['Placed on Injured Reserve (Shoulder)']
 ['Designated as franchise player']
 ['Traded to the Houston Texans']
 ['Traded to the New England Patriots']
 ['Acquired via waivers (from the New England Patriots)']
 ['Placed on Injured Reserve (Leg)']
 ['Activated from Injured Reserve']
 ['Designated for return from Injured Reserve']
 ['Placed on Injured Reserve (Quadriceps)']
 ['Released from Injured Reserve']

In [10]:
len(trans[trans['transaction']== 'Placed on the Reserve/Retired List'])

69

Hmm...only 10 cases where a player actually retired...looking through the list and comparing some players it seems like we would want to also check for cases where a player was released and was not picked up by another team.

In [11]:
trans = trans.sort_values('transdate',ascending=True)
last = trans.groupby('page')['transdate','transaction'].agg({'transdate':'max','transaction':'last'}).reset_index()
last['end'] = last.apply(lambda x: 1 if  "Waived" in x['transaction'] else 0, axis=1)
last

  last = trans.groupby('page')['transdate','transaction'].agg({'transdate':'max','transaction':'last'}).reset_index()


Unnamed: 0,page,transdate,transaction,end
0,/players/aaron-banks-banksaa01,2021-05-13,Signed,0
1,/players/aaron-burbridge-burbraa01,2019-07-17,Waived,1
2,/players/aaron-colvin-colviaa01,2020-09-06,Signed to the Practice Squad,0
3,/players/aaron-dobson-dobsoaa01,2017-09-06,Released from Injured Reserve,0
4,/players/aaron-donald-donalaa01,2018-09-08,Reinstated,0
...,...,...,...,...
2291,/players/zaven-collins-colliza02,2021-06-09,Signed,0
2292,/players/zaviar-gooden-goodeza01,2017-09-02,Waived,1
2293,/players/zay-jones-jonesis02,2022-03-17,Signed,0
2294,/players/zech-mcphearson-mcpheze01,2021-06-04,Signed,0


In [16]:
ir = trans[(trans['transaction'].str.contains('Placed on Injured Reserve.*')==True)| 
           (trans['transaction'].str.contains('Placed on the Physically Unable to Perform List.*')==True) ]
p_ir = df.merge(ir, how='left',
                left_on=['page','season'],
               right_on=['page','trans_year'])

print(len(p_ir))
# Once again, let's use Derrick Henry and see if everything lines up
p_ir[p_ir['player']=='Derrick Henry']

11284


Unnamed: 0,index,year,round,pick,team_x,player,page,position,season,transdate,team_y,transaction,trans_year,trans_month
6176,810,2016,2,45,Tennessee Titans,Derrick Henry,/players/derrick-henry-henryde01,RB,2016,NaT,,,,
6177,810,2016,2,45,Tennessee Titans,Derrick Henry,/players/derrick-henry-henryde01,RB,2017,NaT,,,,
6178,810,2016,2,45,Tennessee Titans,Derrick Henry,/players/derrick-henry-henryde01,RB,2018,NaT,,,,
6179,810,2016,2,45,Tennessee Titans,Derrick Henry,/players/derrick-henry-henryde01,RB,2019,NaT,,,,
6180,810,2016,2,45,Tennessee Titans,Derrick Henry,/players/derrick-henry-henryde01,RB,2020,NaT,,,,
6181,810,2016,2,45,Tennessee Titans,Derrick Henry,/players/derrick-henry-henryde01,RB,2021,2021-11-01,Tennessee (NFL),Placed on Injured Reserve (Foot),2021.0,11.0


Merged successfully! 

## Season Numbers and sequencing IR Data

The NaN values in the prior output just mean that there was no data for the particular season. the transaction month is especially important as it acts as a dividing line between a player's season number.  Let's imagine Derrick Henry got injured in January '22 as opposed to November '21.  In reality, this wouldn't have an impact on fantasy, but we want to represent that he was on IR and we want to attribute it to the correct season number of his career. Therefore, we need to make some logic choices about season attribution; should we associate the IR to his 6th season as a player or the 7th?  In this (imaginary) case, since it's still part of the '21 season, we should attribute it to season 6 of his careern.  It makes sense then, that we want to associate IR stints with the months 9-12, and 1-2) and any month after then we restart another season.  

To make things a bit easier, even though we have several rows (player seasons) with no data, I'm going to set the default transaction month to 9 (September) , to ensure we get the correct season number for the player.



In [17]:
# Let's obtain the season number for the player - to make this easy I'm breaking out pieces of the transaction date into separate columns
# I'll assume that the NFL Season ends in February, and thus will attribute transactions that happen in Jan/Feb to the prior year (hence the -1)

p_ir['tran_month'] = p_ir['trans_month'].fillna(9)
p_ir

p_ir['season_num'] = p_ir.apply(lambda x: x['season'] - x['year'] \
                                if x['trans_month'] <=2 \
                                else x['season'] - x['year'] + 1, 
                                axis=1)

p_ir[p_ir['player']=='Derrick Henry']

Unnamed: 0,index,year,round,pick,team_x,player,page,position,season,transdate,team_y,transaction,trans_year,trans_month,tran_month,season_num
6176,810,2016,2,45,Tennessee Titans,Derrick Henry,/players/derrick-henry-henryde01,RB,2016,NaT,,,,,9.0,1
6177,810,2016,2,45,Tennessee Titans,Derrick Henry,/players/derrick-henry-henryde01,RB,2017,NaT,,,,,9.0,2
6178,810,2016,2,45,Tennessee Titans,Derrick Henry,/players/derrick-henry-henryde01,RB,2018,NaT,,,,,9.0,3
6179,810,2016,2,45,Tennessee Titans,Derrick Henry,/players/derrick-henry-henryde01,RB,2019,NaT,,,,,9.0,4
6180,810,2016,2,45,Tennessee Titans,Derrick Henry,/players/derrick-henry-henryde01,RB,2020,NaT,,,,,9.0,5
6181,810,2016,2,45,Tennessee Titans,Derrick Henry,/players/derrick-henry-henryde01,RB,2021,2021-11-01,Tennessee (NFL),Placed on Injured Reserve (Foot),2021.0,11.0,11.0,6


In [18]:
# Let's check on Zack Moss (who was placed on IR in January of '21) and verify whether the season attribution logic worked
p_ir[p_ir['player']=='Zack Moss']

Unnamed: 0,index,year,round,pick,team_x,player,page,position,season,transdate,team_y,transaction,trans_year,trans_month,tran_month,season_num
10678,1867,2020,3,86,Buffalo Bills,Zack Moss,/players/zack-moss-mossza01,RB,2020,NaT,,,,,9.0,1
10679,1867,2020,3,86,Buffalo Bills,Zack Moss,/players/zack-moss-mossza01,RB,2021,2021-01-12,Buffalo (NFL),Placed on Injured Reserve (Ankle),2021.0,1.0,1.0,1


In [19]:
p_ir.shape

(11284, 16)

In [20]:
# p_ir.to_csv('data/all_ir_transactions.csv', index=False)
p_ir.to_csv('data/all_pos_ir_transactions.csv', index=False)

In [21]:
# Review the unique list of IR-related transactions
p_ir = pd.read_csv('data/all_pos_ir_transactions.csv')
p_ir.transaction.unique()

array([nan, 'Placed on Injured Reserve (Achilles)',
       'Placed on Injured Reserve (Knee)',
       'Placed on Injured Reserve (Ankle)',
       'Placed on Injured Reserve (Shoulder)',
       'Placed on Injured Reserve (Biceps)',
       'Placed on Injured Reserve (Quadriceps)',
       'Placed on Injured Reserve (Designated for Return) (Wrist)',
       'Placed on Injured Reserve (Hamstring)',
       'Placed on Injured Reserve (Finger)',
       'Placed on Injured Reserve (Toe)',
       'Placed on the Physically Unable to Perform List (Foot)',
       'Placed on Injured Reserve (Groin)',
       'Placed on Injured Reserve (Back)',
       'Placed on Injured Reserve (Foot)',
       'Placed on Injured Reserve (Hip)',
       'Placed on the Physically Unable to Perform List (Knee)',
       'Placed on Injured Reserve (Designated for Return) (Elbow)',
       'Placed on Injured Reserve (Forearm)',
       'Placed on Injured Reserve (Designated for Return) (Leg)',
       'Placed on Injured Reserve (

In [22]:
# Let's extract the injury - we can use regex to extract the word(s) between the parentheses
p_ir['injury'] = p_ir['transaction'].str.extract(r'\(([A-Za-z\s]+)\)')

In [23]:
p_ir.injury.value_counts()

Knee                     522
Ankle                    210
Undisclosed              175
Hamstring                155
Shoulder                 140
Foot                     130
Achilles                  59
Concussion                54
Back                      51
Groin                     50
Leg                       45
Pectoral                  42
Neck                      33
Calf                      31
Quadriceps                26
Toe                       26
Hip                       25
Elbow                     24
Designated for Return     22
Wrist                     19
Hand                      17
Biceps                    17
Abdomen                   16
Thumb                     13
Ribs                      11
Chest                      9
Thigh                      8
Forearm                    8
Clavicle                   6
Arm                        6
Finger                     6
Core Muscle                6
Triceps                    5
Hernia                     4
Spine         

In [24]:
test = p_ir.set_index('player')
test.loc['Darrell Henderson']

Unnamed: 0_level_0,index,year,round,pick,team_x,page,position,season,transdate,team_y,transaction,trans_year,trans_month,tran_month,season_num,injury
player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Darrell Henderson,1597,2019,3,70,Los Angeles Rams,/players/darrell-henderson-hendeda01,RB,2019,2019-12-28,LA Rams (NFL),Placed on Injured Reserve (Ankle),2019.0,12.0,12.0,1,Ankle
Darrell Henderson,1597,2019,3,70,Los Angeles Rams,/players/darrell-henderson-hendeda01,RB,2020,2020-12-29,LA Rams (NFL),Placed on Injured Reserve (Ankle),2020.0,12.0,12.0,2,Ankle
Darrell Henderson,1597,2019,3,70,Los Angeles Rams,/players/darrell-henderson-hendeda01,RB,2021,2021-12-28,LA Rams (NFL),Placed on Injured Reserve (Knee),2021.0,12.0,12.0,3,Knee


In [25]:
p_ir



Unnamed: 0,index,year,round,pick,team_x,player,page,position,season,transdate,team_y,transaction,trans_year,trans_month,tran_month,season_num,injury
0,0,2013,1,1,Kansas City Chiefs,Eric Fisher,/players/eric-fisher-fisheer01,OT,2014,,,,,,9.0,2,
1,0,2013,1,1,Kansas City Chiefs,Eric Fisher,/players/eric-fisher-fisheer01,OT,2015,,,,,,9.0,3,
2,0,2013,1,1,Kansas City Chiefs,Eric Fisher,/players/eric-fisher-fisheer01,OT,2016,,,,,,9.0,4,
3,0,2013,1,1,Kansas City Chiefs,Eric Fisher,/players/eric-fisher-fisheer01,OT,2017,,,,,,9.0,5,
4,0,2013,1,1,Kansas City Chiefs,Eric Fisher,/players/eric-fisher-fisheer01,OT,2018,,,,,,9.0,6,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11279,2291,2021,7,255,New Orleans Saints,Kawaan Baker,/players/kawaan-baker-bakerka01,WR,2021,,,,,,9.0,1,
11280,2292,2021,7,256,Green Bay Packers,Kylin Hill,/players/kylin-hill-hillky02,RB,2021,2021-11-01,Green Bay (NFL),Placed on Injured Reserve (Knee),2021.0,11.0,11.0,1,Knee
11281,2293,2021,7,257,Detroit Lions,Jermar Jefferson,/players/jermar-jefferson-jeffeje01,RB,2021,,,,,,9.0,1,
11282,2294,2021,7,258,Washington Football Team,Dax Milne,/players/dax-milne-milneda01,WR,2021,,,,,,9.0,1,


In [26]:
career = p_ir.groupby(['player','position','season_num'])['injury'].first().unstack()

An interesting thing happened...all of the seasons where an injury transaction is not captured, we see None, for any season that wasn't played by the player, we see NaN.  This makes a nice division between seasons that the player was healthy (designated as None) and the end point in the sequence of their career (their last season) as NaN.  Though this wasn't intentional, it's certainly appreciated.

I'm going to update each series (column) with Healthy if it spots a "None" value

In [28]:
for i in range(1,9):
    career[i] = career[i].apply(lambda x: "Healthy" if x== None else x)

In [29]:
# Let's save off our work so we don't go into a tirade when we lose our work/forgot what we did.
#career.to_csv('data/ir_career.csv') # for just the skill positions drafted
career.to_csv('data/all_pos_ir_career.csv') # for all positions

In [30]:
career[1].value_counts()

Healthy                  1638
Knee                      115
Ankle                      48
Shoulder                   35
Foot                       29
Hamstring                  28
Undisclosed                22
Designated for Return      13
Concussion                 12
Back                       12
Groin                      10
Achilles                   10
Leg                         9
Thumb                       7
Neck                        6
Toe                         5
Hip                         5
Quadriceps                  5
Wrist                       4
Hand                        4
Calf                        3
Pectoral                    3
Upper Body                  2
Hernia                      2
Lisfranc                    2
Shin                        2
Elbow                       2
Abdomen                     2
Arm                         2
Core Muscle                 1
Thigh                       1
Forearm                     1
Ribs                        1
Clavicle  

In [29]:
career[(career[1]=='Healthy')  &  
       (career[2]=='Knee') &
      (career[3]=='Healthy') &
      (career[4]=='Healthy')]

(39, 9)

In [31]:
career.groupby([])[[1,2,3,4,5,6,7,8]].agg(['count'])

ValueError: No group keys passed!

In [32]:
# I see there are cases where "Placed on Injured Reserver" has no injury association...however...this points out another concern - most of these are not nfl teams (see team)
# We could do an analysis on these, but I want to keep it specific to NFL so I'll filter out these other leagues.

ir = ir[ir['team'].str.contains('NFL')==True]
ir.shape

(2102, 6)

In [33]:
ir
# ir['injury'] = ir['injury'].fillna('Undisclosed')

Unnamed: 0,page,transdate,team,transaction,trans_year,trans_month
2344,/players/nicholas-williams-willini03,2013-08-25,Pittsburgh (NFL),Placed on Injured Reserve (Knee),2013,8
2158,/players/jeremy-harris-harrije04,2013-08-25,Jacksonville (NFL),Placed on Injured Reserve (Back),2013,8
1454,/players/steve-williams-willist08,2013-08-26,San Diego (NFL),Placed on Injured Reserve (Pectoral),2013,8
1352,/players/jesse-williams-willije07,2013-08-26,Seattle (NFL),Placed on Injured Reserve (Knee),2013,8
1213,/players/phillip-thomas-thomaph02,2013-08-26,Washington (NFL),Placed on Injured Reserve (Lisfranc),2013,8
...,...,...,...,...,...,...
16628,/players/joejuan-williams-willijo31,2022-08-16,New England (NFL),Placed on Injured Reserve (Shoulder),2022,8
18239,/players/mekhi-becton-bectome01,2022-08-16,NY Jets (NFL),Placed on Injured Reserve (Knee),2022,8
11837,/players/adam-shaheen-shahead01,2022-08-16,Miami (NFL),Placed on Injured Reserve (Knee),2022,8
15856,/players/cornell-armstrong-armstco01,2022-08-16,Atlanta (NFL),Placed on Injured Reserve (Undisclosed),2022,8


In [118]:
df

Unnamed: 0,year,round,pick,team,player,page,position
0,2014,1,1,Houston Texans,Jadeveon Clowney,/players/jadeveon-clowney-clownja01,DE
1,2014,1,2,St. Louis Rams,Greg Robinson,/players/greg-robinson-robingr05,OT
2,2014,1,3,Jacksonville Jaguars,Blake Bortles,/players/blake-bortles-bortlbl01,QB
3,2014,1,4,Buffalo Bills,Sammy Watkins,/players/sammy-watkins-watkisa01,WR
4,2014,1,5,Oakland Raiders,Khalil Mack,/players/khalil-mack-mackkh01,LB
...,...,...,...,...,...,...,...
2299,2022,7,258,Green Bay Packers,Samori Toure,/players/samori-toure-touresa01,WR
2300,2022,7,259,Kansas City Chiefs,Nazeeh Johnson,/players/nazeeh-johnson-johnsna05,DB
2301,2022,7,260,Los Angeles Chargers,Alexander Horvath,/players/alexander-horvath-horvaal01,RB
2302,2022,7,261,Los Angeles Rams,AJ Arcuri,/players/aj-arcuri-arcuraj01,OT


In [117]:
ir.head()

Unnamed: 0,year,round,pick,draft_team,player,page,position,transdate,team,transaction,season_num,trans_month,trans_year,injury
29,2014,1,4,Buffalo Bills,Sammy Watkins,/players/sammy-watkins-watkisa01,WR,2016-09-30,Buffalo (NFL),Placed on Injured Reserve (Foot),2,9,2016,Foot
33,2014,1,7,Tampa Bay Buccaneers,Mike Evans,/players/mike-evans-evansmi03,WR,2019-12-18,Tampa Bay (NFL),Placed on Injured Reserve (Hamstring),5,12,2019,Hamstring
37,2014,1,10,Detroit Lions,Eric Ebron,/players/eric-ebron-ebroner01,TE,2021-11-27,Pittsburgh (NFL),Placed on Injured Reserve (Knee),7,11,2021,Knee
41,2014,1,10,Detroit Lions,Eric Ebron,/players/eric-ebron-ebroner01,TE,2019-11-25,Indianapolis (NFL),Placed on Injured Reserve (Ankle),5,11,2019,Ankle
49,2014,1,12,New York Giants,Odell Beckham Jr.,/players/odell-beckham-beckhod01,WR,2020-10-27,Cleveland (NFL),Placed on Injured Reserve (Knee),6,10,2020,Knee


In [121]:
ircols = ['year','round','pick','draft_team','player','position','injury','season_num'] 
comb = df.merge(ir[ircols], how='left', on=['year','round','pick','player'])

In [122]:
comb

Unnamed: 0,year,round,pick,team,player,page,position_x,draft_team,position_y,injury,season_num
0,2014,1,1,Houston Texans,Jadeveon Clowney,/players/jadeveon-clowney-clownja01,DE,,,,
1,2014,1,2,St. Louis Rams,Greg Robinson,/players/greg-robinson-robingr05,OT,,,,
2,2014,1,3,Jacksonville Jaguars,Blake Bortles,/players/blake-bortles-bortlbl01,QB,,,,
3,2014,1,4,Buffalo Bills,Sammy Watkins,/players/sammy-watkins-watkisa01,WR,Buffalo Bills,WR,Foot,2.0
4,2014,1,5,Oakland Raiders,Khalil Mack,/players/khalil-mack-mackkh01,LB,,,,
...,...,...,...,...,...,...,...,...,...,...,...
2484,2022,7,258,Green Bay Packers,Samori Toure,/players/samori-toure-touresa01,WR,,,,
2485,2022,7,259,Kansas City Chiefs,Nazeeh Johnson,/players/nazeeh-johnson-johnsna05,DB,,,,
2486,2022,7,260,Los Angeles Chargers,Alexander Horvath,/players/alexander-horvath-horvaal01,RB,,,,
2487,2022,7,261,Los Angeles Rams,AJ Arcuri,/players/aj-arcuri-arcuraj01,OT,,,,


In [115]:
ir_agg = pd.DataFrame(ir.groupby(['season_num','injury']).size().reset_index())
ir_agg = ir_agg.rename({0:'total'},axis=1)

Unnamed: 0,season_num,injury,total
0,0,Abdomen,1
1,0,Achilles,3
2,0,Ankle,20
3,0,Back,5
4,0,Concussion,5
...,...,...,...
122,6,Quadriceps,1
123,7,Concussion,2
124,7,Hamstring,2
125,7,Hip,1


In [60]:
files = [pd.read_csv(f'{x}_transactions.csv', parse_dates=['transaction_date']) for x in years]

In [61]:
trans = pd.concat(files)

In [62]:
trans.columns

Index(['player', 'transaction_date', 'team', 'transaction'], dtype='object')

In [65]:
trans.drop_duplicates(inplace=True)

In [66]:
trans.transaction_date.dt.year.value_counts()

2021    947
2020    614
2019    408
2018    394
2017    283
2022    251
2016    240
2015    166
2014     79
Name: transaction_date, dtype: int64

In [67]:
trans.to_csv('2014_2022_Draftee_Transactions.csv',index=False)

In [132]:
file = open('Players.csv','r')

In [135]:
new = [ int(x) for x in test]

In [136]:
new = np.reshape(new,-1)

In [137]:
new

array([ 167,  273,  274, ..., 3236, 3247, 3248])

In [129]:
file = open('player_lst.csv','r')


In [139]:
import json


In [147]:
lst = [str(x) for x in range(10)]

In [148]:
",".join(lst)

'0,1,2,3,4,5,6,7,8,9'

In [133]:
years = pd.Series([x for x in range(2014,2023)], name='year')

In [140]:
test = df.merge(years,how='cross', on='year')

MergeError: Can not pass on, right_on, left_on or set right_index=True or left_index=True

In [139]:
test[test['player']=='Blake Bortles']

Unnamed: 0,year,round,pick,team,player,page,position
2,2014,1,3,Jacksonville Jaguars,Blake Bortles,/players/blake-bortles-bortlbl01,QB


In [4]:
df.head()

Unnamed: 0,year,round,pick,team,player,page,position
0,2014,1,1,Houston Texans,Jadeveon Clowney,/players/jadeveon-clowney-clownja01,DE
1,2014,1,2,St. Louis Rams,Greg Robinson,/players/greg-robinson-robingr05,OT
2,2014,1,3,Jacksonville Jaguars,Blake Bortles,/players/blake-bortles-bortlbl01,QB
3,2014,1,4,Buffalo Bills,Sammy Watkins,/players/sammy-watkins-watkisa01,WR
4,2014,1,5,Oakland Raiders,Khalil Mack,/players/khalil-mack-mackkh01,LB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d['season'] = pd.Series(len(df.player.unique()) *  [2014,2015,2016,2017,2018,2019,2020,2021,2022])


Unnamed: 0,year,round,pick,team,player,page,position,season
0,2014,1,1,Houston Texans,Jadeveon Clowney,/players/jadeveon-clowney-clownja01,DE,2014
0,2014,1,1,Houston Texans,Jadeveon Clowney,/players/jadeveon-clowney-clownja01,DE,2014
0,2014,1,1,Houston Texans,Jadeveon Clowney,/players/jadeveon-clowney-clownja01,DE,2014
0,2014,1,1,Houston Texans,Jadeveon Clowney,/players/jadeveon-clowney-clownja01,DE,2014
0,2014,1,1,Houston Texans,Jadeveon Clowney,/players/jadeveon-clowney-clownja01,DE,2014
...,...,...,...,...,...,...,...,...
2303,2022,7,262,San Francisco 49ers,Brock Purdy,/players/brock-purdy-purdybr01,QB,2022
2303,2022,7,262,San Francisco 49ers,Brock Purdy,/players/brock-purdy-purdybr01,QB,2022
2303,2022,7,262,San Francisco 49ers,Brock Purdy,/players/brock-purdy-purdybr01,QB,2022
2303,2022,7,262,San Francisco 49ers,Brock Purdy,/players/brock-purdy-purdybr01,QB,2022


In [30]:
pd.DataFrame(d)

Unnamed: 0,year,round,pick,team,player,page,position
0,2014,1,1,Houston Texans,Jadeveon Clowney,/players/jadeveon-clowney-clownja01,DE
0,2014,1,1,Houston Texans,Jadeveon Clowney,/players/jadeveon-clowney-clownja01,DE
0,2014,1,1,Houston Texans,Jadeveon Clowney,/players/jadeveon-clowney-clownja01,DE
0,2014,1,1,Houston Texans,Jadeveon Clowney,/players/jadeveon-clowney-clownja01,DE
0,2014,1,1,Houston Texans,Jadeveon Clowney,/players/jadeveon-clowney-clownja01,DE
...,...,...,...,...,...,...,...
2303,2022,7,262,San Francisco 49ers,Brock Purdy,/players/brock-purdy-purdybr01,QB
2303,2022,7,262,San Francisco 49ers,Brock Purdy,/players/brock-purdy-purdybr01,QB
2303,2022,7,262,San Francisco 49ers,Brock Purdy,/players/brock-purdy-purdybr01,QB
2303,2022,7,262,San Francisco 49ers,Brock Purdy,/players/brock-purdy-purdybr01,QB


In [34]:
d['season'] = pd.Series(len(df.player.unique()) *  [2014,2015,2016,2017,2018,2019,2020,2021,2022])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d['season'] = pd.Series(len(df.player.unique()) *  [2014,2015,2016,2017,2018,2019,2020,2021,2022])


In [38]:
trans

NameError: name 'trans' is not defined