## 1.0 Problem Formulation and Data Gathering ##

In this notebook, I'll walk-through an explanation of my goals and cover the foundational work needed to reach those goals;  extraction, cleaning and manipulation cluminating into a desired representation of the data.

### 1.1 Introduction to the Problem/ Fantasy Football Primer
<br>


<font size=5>__Key Takeaways__</font>

 - Fantasy Football offers a lot of interesting data science tasks/problems to explore. One of those is predicting players a fantasy owner should choose.
 - Choosing the right players is crucial for being competitive in a league.
 - Aside from choosing a player based on performance (which has a more direct relationship with points, modeling  potential for injury by using past player ir data could inform owners about players to avoid
 - We can represent a players' career as a vector with each index in the vector representing a single season and each item's value representing whether the player was healthy or an injury that caused the player to be placed on IR.


<br>
<font size=5>
    <b>Background</b>
</font>
<br>
<font size=10>M</font><font size=4>y ventures into data science originated (unbeknownst to me at the time) as a desire to formulate a strategy for selecting fantasy football players. If you're unfamiliar with Fantasy Football, it basically boils down to this - you along with several other people (labeled as "owners") form a "league" choosing real-life NFL players at various positions to form an imaginary team.  Each week, players will play in their real-life NFL games and their statistics, such as yards rushed or passing touchdowns generate "points" which are aggregated into point totals for each imaginary team.  There are several different formats to a league, but the one I've come to know and love is called a Head-Head League; you play against another owner's imaginary team and whomever obtains the most points generated by the combination of players for their respective "teams" wins the week (and conversely the other owner loses).  The "season" coincides with the real NFL season and your resulting record of wins/losses determines your advancement into a playoff period leading eventually to a single champion. As the season progresses, owners have to continually make decisions about which players to "start" (capture their performances) vs "sit" (ignore their performances). These decisions can be based on prior performance, strength of opponent, health, etc.</font>


<font size=4>Regarding player selection, this can occur either prior to the season starting and in-season.  Prior to the season, leagues will have "drafts" which is an event that allows each owner to build a requisite roster of players. It's assumed you'll be choosing your highest performers in this event (though may not be the case). The risk here is that you're mostly reliant on the past seasons as well as advice from internet/traditional media "experts" who at best are running their own models and at worst are using their non-empirical judgements to rank players for public consumption.  Drafts require their own form of strategy forcing owners to balance "performance" with "cost" all in the midst of "hype" - which is the media/fandom's inflation of a player's value based on perceived or expected performance. The higher the hype, the higher the price with the expected result of that player producing excellent results relative to other players who could've been chosen at that time or for the asking price. A reasonable strategy that can lead to success is finding players who perform better than their average draft position (referred to as ADP). The larger the distance between ADP and actual selection, where the former is earlier in the draft than the actual selection, the implied greater value.</font>

<font size=4>Alternatively, owners can pick-up or trade players when the season is underway and the factors that go into these decisions can be different than in the draft. An owner may select a player (free agent) if they have a favorable matchup for a week or trade for a player they believe will perform better than the one they're giving up for the remainder of the season.  Usually, there's a greater emphasis on selection based on matchup though some factors that go into draft selection can be seen during the season as well.</font>


<font size=4>The key idea is that as a fantasy owner, your success depends on choosing high-performing players for your team and playing them when they have a higher potential for success (e.g. perhaps their opponent is not good at defending the player's position) OR when the cost of keeping inactive (ie not playing them) is higher than if they were to play. There are quite a number of considerations that go into "starting" (playing) a player and all are beyond the scope of this analysis.  However, there are two critical questions to keep in mind when managing a fantasy football:  __Who should I choose?__ and __When should I use them?__.</font>

<br>
   
<font size=5>__Problem Formulation__</font>
<br>
<font size=10>F</font>
<font size=4>or this analysis, I'll be focusing on the "Who should I chose?" question.  My long-term desire is to build a reasonably accurate predictive model to aide in choosing a player for an upcoming season. That simple question becomes very complex when considering the variables that inform a player's future success.  You may think that a prior season's performance is a good predictor of the future but we could reason that modeling future performance based on prior statistics is most likely a simplistic representation of the complex real-world. But what other factors influence a player's performance?  Here, I would make a distinction between intrinsic (player-dependent) and extrinsic (outside of the player) factors. Beyond past performance, I would suggest that intrinsic factors could include age, experience, or even potentially height/weight as examples.  Extrinsic features could include a player's coach, the talent surrounding them, even the franchise they belong to.  We can't faithfully model all of the real-world factors, but we can try our best to understand what factors have the greatest influence on a player's performance and model these.  This concept is called large-world uncertainty.

<font size=4>
In my question of "Who should I choose?" and brainstorming the intrinic/extrinsic factors I found a bias in my own thinking. I was seeking out the obvious - variables that discretely generate points (touchdowns, rushing yards, catches, etc. and are positively correlated to fantasy points). I got a little more sophisticated and reasoned that age might play a role as it's been suggested and mostly observed that as a player ages, their performance is likely to diminish; age has a negative correlation to production.  Then the most obvious thing finally dawned on me - what about injuries?  If a player is injured, they can't play.  If they can't play, they can't generate points and that could have just as much of an impact on a fantasy team than using some other variable. So, if I choose a high performing player (we'll call them "A")  and they get injured and are out several games, than their higher performances are "decayed" over the span of the season as they aren't generating points during the games they're injured (ie their mean-season performance is decreased).  Now, the prospect of grabbing a player who is a moderate performer ("B") but stays healthy across the season might not seem like such a bad choice.  I'll point out that this is not black and white scenario - it might just be that A won your team games for a few weeks whereas playing B in those same weeks wouldn't have led to the same result so it's possible the risk of selecting A if there was a prior belief that they would get injured might still be worth it, but in general, especially during a draft, you don't have an understanding of future performance - and therefore, your limited to predicting the future.


<font size=4>    
It's reasonable to assume, however, that an owner would want to avoid prolonged injuries as a risk-mitigations strategy for earlier selection where the very best players are presumed to be available. Considering other owners will be selecting the highest expected performers early in the draft, owners choosing players who are high performers and less at risk of prolonged injury ensures that they'll be able to stay competitive. Therefore, in asking the question, "Who should I choose?", it's important to ask an intersecting question of "Who is likely to be injured?".  "Injured" or more generally "Health" can be multi-faceted (players could have minor injuries that take them out of a game, long-term injuries that require them to be sit out a predefined number of weeks with a designation of "Injured Reserve", or they could catch covid and sit out of games until they recover).  

<font size=4>
In this analysis, I explore this question using the presense/non-presence of a player's stint on Injured Reserve as a proxy for health.</font>

    
    

<font size=5>__Data Representation__</font>










In [422]:
# Let's import the important libraries for our work.

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import time

### 1.1 

In [421]:
def GetDraftees(year_lst):


    head = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.2 Safari/605.1.15"}
    year_lst=[]
    dround=[]
    pick=[]
    team = []
    player = []
    ppage = []
    position = []
    school = []


    for year in [x for x in range(2000,2023)]:
        for rnd in range(1,8):
            time.sleep(2)
            url = f"https://www.footballdb.com/draft/draft.html?lg=NFL&yr={year}&rnd={rnd}"
            print(url)
            resp = requests.get(url,headers=head)
            soup = BeautifulSoup(resp.content)
            table = soup.find_all('div', class_='tr')
            for row in table[1:]:        
                # try:
                cell_lst = row.findChildren('div')
                year_lst.append(year)
                dround.append(cell_lst[0].get_text())
                pick.append(cell_lst[1].get_text())
                team.append(cell_lst[2].findChildren('b')[0].get_text())
                player.append(cell_lst[3].get_text())
                try:
                    ppage.append(cell_lst[2].findChildren('a',href=True)[1]['href'])
                except:
                    ppage.append("unavailable")
                position.append(cell_lst[4].get_text())
                school.append(cell_lst[5].get_text())
 


https://www.footballdb.com/draft/draft.html?lg=NFL&yr=2000&rnd=1
https://www.footballdb.com/draft/draft.html?lg=NFL&yr=2000&rnd=2
https://www.footballdb.com/draft/draft.html?lg=NFL&yr=2000&rnd=3
https://www.footballdb.com/draft/draft.html?lg=NFL&yr=2000&rnd=4
https://www.footballdb.com/draft/draft.html?lg=NFL&yr=2000&rnd=5
https://www.footballdb.com/draft/draft.html?lg=NFL&yr=2000&rnd=6
https://www.footballdb.com/draft/draft.html?lg=NFL&yr=2000&rnd=7
https://www.footballdb.com/draft/draft.html?lg=NFL&yr=2001&rnd=1
https://www.footballdb.com/draft/draft.html?lg=NFL&yr=2001&rnd=2
https://www.footballdb.com/draft/draft.html?lg=NFL&yr=2001&rnd=3
https://www.footballdb.com/draft/draft.html?lg=NFL&yr=2001&rnd=4
https://www.footballdb.com/draft/draft.html?lg=NFL&yr=2001&rnd=5
https://www.footballdb.com/draft/draft.html?lg=NFL&yr=2001&rnd=6
https://www.footballdb.com/draft/draft.html?lg=NFL&yr=2001&rnd=7
https://www.footballdb.com/draft/draft.html?lg=NFL&yr=2002&rnd=1
https://www.footballdb.co

In [90]:
df = pd.DataFrame({'year': year_lst,
              'round':dround,
              'pick': pick,
              'team': team,
              'player': player,
                   'page': ppage,
              'position': position})


df.to_csv("2014_2022_Draft.csv",index=False)

NameError: name 'year_lst' is not defined

In [294]:
df = pd.read_csv("2014_2022_Draft.csv")
df = df[df['year'] < 2022]

Great! We have all of the NFL Draftees from 2014 to 2021 (2022 draftees haven't provided us much info at the time of this writing).  To prepare for the joining of the transaction data, I need to create rows  for every player between the time they were drafted and now.  Using only the transaction data will only provide us the seasons in which an IR transaction occurred, but we can't rely on it to account for every season. This is crucial, as we want to represent every season of a player's career (whether there was an injury or not). 

Let me illustrate this point by taking Derrick Henry's career represented as a sequence of his career with/without IR. With only transaction data, we get the following sequence: 

Derrick Henry:  [Foot] 

where <code>Foot</code> was an injury that occurred in 2021.  But Henry didn't play just one season.  So we need to represent his career as several <code>Healthy</code> seasons followed by an IR Injury.  Since he started his career in 2016, we'd want his career vector to look like this:

<code>['Healthy','Healthy','Healthy','Healthy','Healthy',Foot']


The following code establishes prepares each player to be represented this way (prior to joining the IR data).  Note: Only run this cell once or run the prior cell first before the following cell to prevent unncessary duplication of rows.
    
   

In [295]:
# Create a list of seasons
lst_seasons = [2014,2015,2016,2017,2018,2019,2020,2021]


# Create a pandas series that cycles through the number of players (represented as the # of pandas rows * the list of seasons we care about
seasons = pd.Series(df.shape[0] *  lst_seasons)


# Duplicate each player row by the length of the number of seasons (length of the list of seasons above).  
# Reset the index so that the series can match up to the df's index
df = df.iloc[df.index.repeat(len(lst_seasons)),:].reset_index()


# Create a new column from the pandas series
df['season'] = seasons




# Let's remove any seasons that occurred prior to a player's draft year - because players don't play prior to being drafted
# We also should also filter out any positions we don't care about - namely defensive positions.  My FFB League accrues points 
# for QB, RB, WR and TE positions.

df = df[(df['position'].isin(['QB','RB','WR','TE'])) &
       (df['year'] <= df['season'])]


df.head(15)

Unnamed: 0,index,year,round,pick,team,player,page,position,season
16,2,2014,1,3,Jacksonville Jaguars,Blake Bortles,/players/blake-bortles-bortlbl01,QB,2014
17,2,2014,1,3,Jacksonville Jaguars,Blake Bortles,/players/blake-bortles-bortlbl01,QB,2015
18,2,2014,1,3,Jacksonville Jaguars,Blake Bortles,/players/blake-bortles-bortlbl01,QB,2016
19,2,2014,1,3,Jacksonville Jaguars,Blake Bortles,/players/blake-bortles-bortlbl01,QB,2017
20,2,2014,1,3,Jacksonville Jaguars,Blake Bortles,/players/blake-bortles-bortlbl01,QB,2018
21,2,2014,1,3,Jacksonville Jaguars,Blake Bortles,/players/blake-bortles-bortlbl01,QB,2019
22,2,2014,1,3,Jacksonville Jaguars,Blake Bortles,/players/blake-bortles-bortlbl01,QB,2020
23,2,2014,1,3,Jacksonville Jaguars,Blake Bortles,/players/blake-bortles-bortlbl01,QB,2021
24,3,2014,1,4,Buffalo Bills,Sammy Watkins,/players/sammy-watkins-watkisa01,WR,2014
25,3,2014,1,4,Buffalo Bills,Sammy Watkins,/players/sammy-watkins-watkisa01,WR,2015


Ok, we have an accounting of the seasons and we've eliminated seasons that existed prior to the player's draft year, let's focus on the transaction data now...

## 2.0 Generate Transaction Data ##

Let's generate a generator function that scrapes all of the pages of the players who we obtained from every year's draft (Offensive positions only).  This will yield a data object containing each player's transaction history over their career.

In [296]:
head = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.2 Safari/605.1.15"}
transdate = []
transteam = []
transaction = []
players = []

def GetTransactions(pages):
    time.sleep(2)
    # player = r['player']
    # page = r['page']
    
    
    for page in pages:
        url = f'https://www.footballdb.com/{page}/transactions'
        resp = requests.get(url,headers=head)
        soup = BeautifulSoup(resp.content)
        body = soup.find_all('tbody')[0]
        tr_lst = body.find_all('tr')
        for row in tr_lst:
            obj = {}
            cells = row.find_all('td')
            obj['page'] = page
            obj['transdate'] = cells[0].find_all('span')[0].get_text()
            obj['team'] = cells[1].find_all('span')[0].get_text()
            obj['transaction'] = cells[2].get_text()
            yield obj

        
        
pages = df.page.unique()      
trans = pd.DataFrame(GetTransactions(pages))      

In [297]:
# Save the data so I don't have to repeat the scraping all over again
trans.to_csv('full_transactions.csv',index=False)

In [298]:
# Let's read the transaction data into the csv
# take Derrick Henry and see what transactions he's generated (and whether we've done a good job collecting them)

trans = pd.read_csv('full_transactions.csv')
trans[trans['page'].str.contains('derrick-henry.*')==True]


Unnamed: 0,page,transdate,team,transaction
1984,/players/derrick-henry-henryde01,01/21/2022,Tennessee (NFL),Activated from Injured Reserve
1985,/players/derrick-henry-henryde01,01/05/2022,Tennessee (NFL),Designated for return from Injured Reserve
1986,/players/derrick-henry-henryde01,11/01/2021,Tennessee (NFL),Placed on Injured Reserve (Foot)
1987,/players/derrick-henry-henryde01,04/02/2020,Tennessee (NFL),Signed
1988,/players/derrick-henry-henryde01,03/16/2020,Tennessee (NFL),Designated as franchise player
1989,/players/derrick-henry-henryde01,05/09/2016,Tennessee (NFL),Signed


Perfect!  looks like we've captured all of the transactions when compared to his actual history on footballdb.com (as of this writing). We'll need to do some housecleaning including extracting the year/month out of the transactions and determining what season number the player was in for the particular ir transaction.  Using season numbers rather than season years makes sense since players start/end their careers at various years.

In [299]:
trans['transdate'] = pd.to_datetime(trans['transdate'])
trans['trans_year'] = trans['transdate'].dt.year
trans['trans_month'] = trans['transdate'].dt.month

In [300]:
# Let's take a look to see what types of transactions include some mention of retire

print(trans.transaction.unique().reshape(-1,1))
print('Retirement related transactions: ', [x for x in list(trans.transaction.unique()) if 'retire' in x.lower()] )

[['Released']
 ['Signed to a future contract']
 ['Restored to the Practice Squad']
 ['Placed on Practice Squad/COVID-19 List']
 ['Returned to the Practice Squad']
 ['Signed to the Active Roster from the Practice Squad']
 ['Signed to the Practice Squad']
 ['Released from the Practice Squad']
 ['Signed']
 ['Signed from the Denver Broncos practice squad']
 ['Activated from the Reserve/COVID-19 List']
 ['Placed on COVID-19 List']
 ['Traded to the Los Angeles Rams']
 ['Activated from Injured Reserve']
 ['Placed on Injured Reserve (Foot)']
 ['Placed on Injured Reserve (Hamstring)']
 ['Activated from the Reserve/Suspended List']
 ['Placed on the Reserve/Suspended List']
 ['Placed on Injured Reserve (Knee)']
 ['Placed on Injured Reserve (Ankle)']
 ['Waived']
 ['Traded to the Cleveland Browns']
 ['Traded to the Houston Texans']
 ['Traded to the New England Patriots']
 ['Placed on Injured Reserve (Thumb)']
 ['Placed on the 1-Game Injured List']
 ['Traded to the Montreal Alouettes']
 ['Traded to 

In [301]:
len(trans[trans['transaction']== 'Placed on the Reserve/Retired List'])

10

Hmm...only 10 cases where a player actually retired...looking through the list and comparing some players it seems like we would want to also check for cases where a player was released and was not picked up by another team.

In [302]:
trans = trans.sort_values('transdate',ascending=True)
last = trans.groupby('page')['transdate','transaction'].agg({'transdate':'max','transaction':'last'}).reset_index()
last['end'] = last.apply(lambda x: 1 if  "Waived" in x['transaction'] else 0, axis=1)
last

  last = trans.groupby('page')['transdate','transaction'].agg({'transdate':'max','transaction':'last'}).reset_index()


Unnamed: 0,page,transdate,transaction,end
0,/players/aaron-burbridge-burbraa01,2019-07-17,Waived,1
1,/players/aaron-jones-jonesaa02,2021-03-26,Re-signed,0
2,/players/aaron-murray-murraaa01,2017-05-11,Waived,1
3,/players/aaron-ripkowski-ripkoaa01,2019-05-03,Waived,1
4,/players/adam-shaheen-shahead01,2021-09-15,Activated from the Reserve/COVID-19 List,0
...,...,...,...,...
639,/players/zach-gentry-gentrza01,2020-11-24,Placed on Injured Reserve (Knee),0
640,/players/zach-mettenberger-metteza01,2019-03-18,Placed on Injured Reserve,0
641,/players/zach-wilson-wilsoza02,2021-07-29,Signed,0
642,/players/zack-moss-mossza01,2021-01-12,Placed on Injured Reserve (Ankle),0


In [303]:
ir = trans[(trans['transaction'].str.contains('Placed on Injured Reserve.*')==True)| 
           (trans['transaction'].str.contains('Placed on the Physically Unable to Perform List.*')==True) ]
p_ir = df.merge(ir, how='left',
                left_on=['page','season'],
               right_on=['page','trans_year'])

print(len(p_ir))
# Once again, let's use Derrick Henry and see if everything lines up
p_ir[p_ir['player']=='Derrick Henry']

2945


Unnamed: 0,index,year,round,pick,team_x,player,page,position,season,transdate,team_y,transaction,trans_year,trans_month
1283,556,2016,2,45,Tennessee Titans,Derrick Henry,/players/derrick-henry-henryde01,RB,2016,NaT,,,,
1284,556,2016,2,45,Tennessee Titans,Derrick Henry,/players/derrick-henry-henryde01,RB,2017,NaT,,,,
1285,556,2016,2,45,Tennessee Titans,Derrick Henry,/players/derrick-henry-henryde01,RB,2018,NaT,,,,
1286,556,2016,2,45,Tennessee Titans,Derrick Henry,/players/derrick-henry-henryde01,RB,2019,NaT,,,,
1287,556,2016,2,45,Tennessee Titans,Derrick Henry,/players/derrick-henry-henryde01,RB,2020,NaT,,,,
1288,556,2016,2,45,Tennessee Titans,Derrick Henry,/players/derrick-henry-henryde01,RB,2021,2021-11-01,Tennessee (NFL),Placed on Injured Reserve (Foot),2021.0,11.0


Merged successfully! 

## Season Numbers and sequencing IR Data

The NaN values in the prior output just mean that there was no data for the particular season. the transaction month is especially important as it acts as a dividing line between a player's season number.  Let's imagine Derrick Henry got injured in January '22 as opposed to November '21.  In reality, this wouldn't have an impact on fantasy, but we want to represent that he was on IR and we want to attribute it to the correct season number of his career. Therefore, we need to make some logic choices about season attribution; should we associate the IR to his 6th season as a player or the 7th?  In this (imaginary) case, since it's still part of the '21 season, we should attribute it to season 6 of his careern.  It makes sense then, that we want to associate IR stints with the months 9-12, and 1-2) and any month after then we restart another season.  

To make things a bit easier, even though we have several rows (player seasons) with no data, I'm going to set the default transaction month to 9 (September) , to ensure we get the correct season number for the player.



In [304]:
# Let's obtain the season number for the player - to make this easy I'm breaking out pieces of the transaction date into separate columns
# I'll assume that the NFL Season ends in February, and thus will attribute transactions that happen in Jan/Feb to the prior year (hence the -1)

p_ir['tran_month'] = p_ir['trans_month'].fillna(9)
p_ir

p_ir['season_num'] = p_ir.apply(lambda x: x['season'] - x['year'] \
                                if x['trans_month'] <=2 \
                                else x['season'] - x['year'] + 1, 
                                axis=1)

p_ir[p_ir['player']=='Derrick Henry']

Unnamed: 0,index,year,round,pick,team_x,player,page,position,season,transdate,team_y,transaction,trans_year,trans_month,tran_month,season_num
1283,556,2016,2,45,Tennessee Titans,Derrick Henry,/players/derrick-henry-henryde01,RB,2016,NaT,,,,,9.0,1
1284,556,2016,2,45,Tennessee Titans,Derrick Henry,/players/derrick-henry-henryde01,RB,2017,NaT,,,,,9.0,2
1285,556,2016,2,45,Tennessee Titans,Derrick Henry,/players/derrick-henry-henryde01,RB,2018,NaT,,,,,9.0,3
1286,556,2016,2,45,Tennessee Titans,Derrick Henry,/players/derrick-henry-henryde01,RB,2019,NaT,,,,,9.0,4
1287,556,2016,2,45,Tennessee Titans,Derrick Henry,/players/derrick-henry-henryde01,RB,2020,NaT,,,,,9.0,5
1288,556,2016,2,45,Tennessee Titans,Derrick Henry,/players/derrick-henry-henryde01,RB,2021,2021-11-01,Tennessee (NFL),Placed on Injured Reserve (Foot),2021.0,11.0,11.0,6


In [305]:
# Let's check on Zack Moss (who was placed on IR in January of '21) and verify whether the season attribution logic worked
p_ir[p_ir['player']=='Zack Moss']

Unnamed: 0,index,year,round,pick,team_x,player,page,position,season,transdate,team_y,transaction,trans_year,trans_month,tran_month,season_num
2769,1613,2020,3,86,Buffalo Bills,Zack Moss,/players/zack-moss-mossza01,RB,2020,NaT,,,,,9.0,1
2770,1613,2020,3,86,Buffalo Bills,Zack Moss,/players/zack-moss-mossza01,RB,2021,2021-01-12,Buffalo (NFL),Placed on Injured Reserve (Ankle),2021.0,1.0,1.0,1


In [306]:
p_ir.shape

(2945, 16)

In [308]:
p_ir.to_csv('all_ir_transactions.csv', index=False)

In [309]:
# Review the unique list of IR-related transactions
p_ir = pd.read_csv('all_ir_transactions.csv')
p_ir.transaction.unique()

array([nan, 'Placed on Injured Reserve (Foot)',
       'Placed on Injured Reserve (Hamstring)',
       'Placed on Injured Reserve (Ankle)',
       'Placed on Injured Reserve (Knee)',
       'Placed on Injured Reserve (Thumb)',
       'Placed on the Physically Unable to Perform List (Knee)',
       'Placed on Injured Reserve (Back)',
       'Placed on Injured Reserve (Abdomen)',
       'Placed on Injured Reserve (Shoulder)',
       'Placed on Injured Reserve (Wrist)',
       'Placed on Injured Reserve (Leg)',
       'Placed on Injured Reserve (Concussion)',
       'Placed on Injured Reserve (Designated for Return) (Ankle)',
       'Placed on Injured Reserve (Pectoral)',
       'Placed on Injured Reserve', 'Placed on Injured Reserve (Hip)',
       'Placed on the Physically Unable to Perform List (Foot)',
       'Placed on Injured Reserve (Thigh)',
       'Placed on Injured Reserve (Undisclosed)',
       'Placed on the Physically Unable to Perform List (Shoulder)',
       'Placed on Injur

In [None]:
# Let's extract the injury - we can use regex to extract the word(s) between the parentheses
p_ir['injury'] = p_ir['transaction'].str.extract(r'\(([A-Za-z\s]+)\)')

In [313]:
p_ir.injury.value_counts()

Knee                     144
Ankle                     65
Hamstring                 55
Undisclosed               40
Foot                      39
Shoulder                  32
Concussion                18
Achilles                  14
Back                      14
Wrist                     10
Leg                       10
Neck                      10
Calf                       8
Groin                      8
Hip                        7
Thumb                      6
Abdomen                    5
Designated for Return      5
Hand                       5
Elbow                      4
Ribs                       4
Chest                      4
Quadriceps                 4
Clavicle                   3
Thigh                      3
Pectoral                   3
Toe                        3
Shin                       1
Oblique                    1
Finger                     1
Kidney                     1
Arm                        1
Biceps                     1
Lower Body                 1
Hernia        

In [337]:
test = p_ir.set_index('player')
test.loc['Darrell Henderson']

Unnamed: 0_level_0,index,year,round,pick,team_x,page,position,season,transdate,team_y,transaction,trans_year,trans_month,tran_month,season_num,injury
player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Darrell Henderson,1343,2019,3,70,Los Angeles Rams,/players/darrell-henderson-hendeda01,RB,2019,2019-12-28,LA Rams (NFL),Placed on Injured Reserve (Ankle),2019.0,12.0,12.0,1,Ankle
Darrell Henderson,1343,2019,3,70,Los Angeles Rams,/players/darrell-henderson-hendeda01,RB,2020,2020-12-29,LA Rams (NFL),Placed on Injured Reserve (Ankle),2020.0,12.0,12.0,2,Ankle
Darrell Henderson,1343,2019,3,70,Los Angeles Rams,/players/darrell-henderson-hendeda01,RB,2021,2021-12-28,LA Rams (NFL),Placed on Injured Reserve (Knee),2021.0,12.0,12.0,3,Knee


In [347]:
p_ir



Unnamed: 0,index,year,round,pick,team_x,player,page,position,season,transdate,team_y,transaction,trans_year,trans_month,tran_month,season_num,injury
0,2,2014,1,3,Jacksonville Jaguars,Blake Bortles,/players/blake-bortles-bortlbl01,QB,2014,,,,,,9.0,1,
1,2,2014,1,3,Jacksonville Jaguars,Blake Bortles,/players/blake-bortles-bortlbl01,QB,2015,,,,,,9.0,2,
2,2,2014,1,3,Jacksonville Jaguars,Blake Bortles,/players/blake-bortles-bortlbl01,QB,2016,,,,,,9.0,3,
3,2,2014,1,3,Jacksonville Jaguars,Blake Bortles,/players/blake-bortles-bortlbl01,QB,2017,,,,,,9.0,4,
4,2,2014,1,3,Jacksonville Jaguars,Blake Bortles,/players/blake-bortles-bortlbl01,QB,2018,,,,,,9.0,5,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2940,2031,2021,7,249,Los Angeles Rams,Bennett Skowronek,/players/bennett-skowronek-skowrbe01,WR,2021,,,,,,9.0,1,
2941,2037,2021,7,255,New Orleans Saints,Kawaan Baker,/players/kawaan-baker-bakerka01,WR,2021,,,,,,9.0,1,
2942,2038,2021,7,256,Green Bay Packers,Kylin Hill,/players/kylin-hill-hillky02,RB,2021,2021-11-01,Green Bay (NFL),Placed on Injured Reserve (Knee),2021.0,11.0,11.0,1,Knee
2943,2039,2021,7,257,Detroit Lions,Jermar Jefferson,/players/jermar-jefferson-jeffeje01,RB,2021,,,,,,9.0,1,


In [390]:
career = p_ir.groupby(['player','season_num'])['injury'].first().unstack()

An interesting thing happened...all of the seasons where an injury transaction is not captured, we see None, for any season that wasn't played by the player, we see NaN.  This makes a nice division between seasons that the player was healthy (designated as None) and the end point in the sequence of their career (their last season) as NaN.  Though this wasn't intentional, it's certainly appreciated.

I'm going to update each series (column) with Healthy if it spots a "None" value

In [394]:
for i in range(1,9):
    career[i] = career[i].apply(lambda x: "Healthy" if x== None else x)

In [396]:
# Let's save off our work so we don't go into a tirade when we lose our work/forgot what we did.
career.to_csv('ir_career.csv')

In [401]:
career[1].value_counts()

Healthy                  518
Knee                      33
Ankle                     19
Hamstring                 12
Shoulder                  10
Foot                       7
Undisclosed                6
Concussion                 5
Designated for Return      5
Thumb                      5
Back                       5
Achilles                   3
Leg                        2
Quadriceps                 2
Neck                       2
Wrist                      2
Groin                      2
Abdomen                    1
Hand                       1
Elbow                      1
Shin                       1
Upper Body                 1
Hernia                     1
Name: 1, dtype: int64

In [414]:
career[(career[1]=='Healthy')  &  
       (career[2]=='Knee') &
      (career[3]=='Healthy') &
      (career[4]=='Healthy')]

season_num,1,2,3,4,5,6,7,8
player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Aaron Jones,Healthy,Knee,Healthy,Healthy,Healthy,,,
Beau Sandland,Healthy,Knee,Healthy,Healthy,Healthy,Healthy,,
Carson Wentz,Healthy,Knee,Healthy,Healthy,Healthy,Healthy,,
Cooper Kupp,Healthy,Knee,Healthy,Healthy,Healthy,,,
Darius Jackson,Healthy,Knee,Healthy,Healthy,Healthy,Healthy,,
Devin Funchess,Healthy,Knee,Healthy,Healthy,Clavicle,Healthy,Hamstring,
Elijah Hood,Healthy,Knee,Healthy,Healthy,Healthy,,,
Jamal Agnew,Healthy,Knee,Healthy,Healthy,Hip,,,
Jonnu Smith,Healthy,Knee,Healthy,Healthy,Healthy,,,
Kelvin Benjamin,Healthy,Knee,Healthy,Healthy,Healthy,Healthy,Healthy,Healthy


In [410]:
career.groupby([])[[1,2,3,4,5,6,7,8]].agg(['count'])

KeyError: 'injury'

In [101]:
# I see there are cases where "Placed on Injured Reserver" has no injury association...however...this points out another concern - most of these are not nfl teams (see team)
# We could do an analysis on these, but I want to keep it specific to NFL so I'll filter out these other leagues.

ir = ir[ir['team'].str.contains('NFL')==True]
ir.shape

(526, 14)

In [102]:
ir['injury'] = ir['injury'].fillna('Undisclosed')

In [118]:
df

Unnamed: 0,year,round,pick,team,player,page,position
0,2014,1,1,Houston Texans,Jadeveon Clowney,/players/jadeveon-clowney-clownja01,DE
1,2014,1,2,St. Louis Rams,Greg Robinson,/players/greg-robinson-robingr05,OT
2,2014,1,3,Jacksonville Jaguars,Blake Bortles,/players/blake-bortles-bortlbl01,QB
3,2014,1,4,Buffalo Bills,Sammy Watkins,/players/sammy-watkins-watkisa01,WR
4,2014,1,5,Oakland Raiders,Khalil Mack,/players/khalil-mack-mackkh01,LB
...,...,...,...,...,...,...,...
2299,2022,7,258,Green Bay Packers,Samori Toure,/players/samori-toure-touresa01,WR
2300,2022,7,259,Kansas City Chiefs,Nazeeh Johnson,/players/nazeeh-johnson-johnsna05,DB
2301,2022,7,260,Los Angeles Chargers,Alexander Horvath,/players/alexander-horvath-horvaal01,RB
2302,2022,7,261,Los Angeles Rams,AJ Arcuri,/players/aj-arcuri-arcuraj01,OT


In [117]:
ir.head()

Unnamed: 0,year,round,pick,draft_team,player,page,position,transdate,team,transaction,season_num,trans_month,trans_year,injury
29,2014,1,4,Buffalo Bills,Sammy Watkins,/players/sammy-watkins-watkisa01,WR,2016-09-30,Buffalo (NFL),Placed on Injured Reserve (Foot),2,9,2016,Foot
33,2014,1,7,Tampa Bay Buccaneers,Mike Evans,/players/mike-evans-evansmi03,WR,2019-12-18,Tampa Bay (NFL),Placed on Injured Reserve (Hamstring),5,12,2019,Hamstring
37,2014,1,10,Detroit Lions,Eric Ebron,/players/eric-ebron-ebroner01,TE,2021-11-27,Pittsburgh (NFL),Placed on Injured Reserve (Knee),7,11,2021,Knee
41,2014,1,10,Detroit Lions,Eric Ebron,/players/eric-ebron-ebroner01,TE,2019-11-25,Indianapolis (NFL),Placed on Injured Reserve (Ankle),5,11,2019,Ankle
49,2014,1,12,New York Giants,Odell Beckham Jr.,/players/odell-beckham-beckhod01,WR,2020-10-27,Cleveland (NFL),Placed on Injured Reserve (Knee),6,10,2020,Knee


In [121]:
ircols = ['year','round','pick','draft_team','player','position','injury','season_num'] 
comb = df.merge(ir[ircols], how='left', on=['year','round','pick','player'])

In [122]:
comb

Unnamed: 0,year,round,pick,team,player,page,position_x,draft_team,position_y,injury,season_num
0,2014,1,1,Houston Texans,Jadeveon Clowney,/players/jadeveon-clowney-clownja01,DE,,,,
1,2014,1,2,St. Louis Rams,Greg Robinson,/players/greg-robinson-robingr05,OT,,,,
2,2014,1,3,Jacksonville Jaguars,Blake Bortles,/players/blake-bortles-bortlbl01,QB,,,,
3,2014,1,4,Buffalo Bills,Sammy Watkins,/players/sammy-watkins-watkisa01,WR,Buffalo Bills,WR,Foot,2.0
4,2014,1,5,Oakland Raiders,Khalil Mack,/players/khalil-mack-mackkh01,LB,,,,
...,...,...,...,...,...,...,...,...,...,...,...
2484,2022,7,258,Green Bay Packers,Samori Toure,/players/samori-toure-touresa01,WR,,,,
2485,2022,7,259,Kansas City Chiefs,Nazeeh Johnson,/players/nazeeh-johnson-johnsna05,DB,,,,
2486,2022,7,260,Los Angeles Chargers,Alexander Horvath,/players/alexander-horvath-horvaal01,RB,,,,
2487,2022,7,261,Los Angeles Rams,AJ Arcuri,/players/aj-arcuri-arcuraj01,OT,,,,


In [115]:
ir_agg = pd.DataFrame(ir.groupby(['season_num','injury']).size().reset_index())
ir_agg = ir_agg.rename({0:'total'},axis=1)

Unnamed: 0,season_num,injury,total
0,0,Abdomen,1
1,0,Achilles,3
2,0,Ankle,20
3,0,Back,5
4,0,Concussion,5
...,...,...,...
122,6,Quadriceps,1
123,7,Concussion,2
124,7,Hamstring,2
125,7,Hip,1


In [60]:
files = [pd.read_csv(f'{x}_transactions.csv', parse_dates=['transaction_date']) for x in years]

In [61]:
trans = pd.concat(files)

In [62]:
trans.columns

Index(['player', 'transaction_date', 'team', 'transaction'], dtype='object')

In [65]:
trans.drop_duplicates(inplace=True)

In [66]:
trans.transaction_date.dt.year.value_counts()

2021    947
2020    614
2019    408
2018    394
2017    283
2022    251
2016    240
2015    166
2014     79
Name: transaction_date, dtype: int64

In [67]:
trans.to_csv('2014_2022_Draftee_Transactions.csv',index=False)

In [132]:
file = open('Players.csv','r')

In [135]:
new = [ int(x) for x in test]

In [136]:
new = np.reshape(new,-1)

In [137]:
new

array([ 167,  273,  274, ..., 3236, 3247, 3248])

In [129]:
file = open('player_lst.csv','r')


In [139]:
import json


In [147]:
lst = [str(x) for x in range(10)]

In [148]:
",".join(lst)

'0,1,2,3,4,5,6,7,8,9'

In [133]:
years = pd.Series([x for x in range(2014,2023)], name='year')

In [140]:
test = df.merge(years,how='cross', on='year')

MergeError: Can not pass on, right_on, left_on or set right_index=True or left_index=True

In [139]:
test[test['player']=='Blake Bortles']

Unnamed: 0,year,round,pick,team,player,page,position
2,2014,1,3,Jacksonville Jaguars,Blake Bortles,/players/blake-bortles-bortlbl01,QB


In [4]:
df.head()

Unnamed: 0,year,round,pick,team,player,page,position
0,2014,1,1,Houston Texans,Jadeveon Clowney,/players/jadeveon-clowney-clownja01,DE
1,2014,1,2,St. Louis Rams,Greg Robinson,/players/greg-robinson-robingr05,OT
2,2014,1,3,Jacksonville Jaguars,Blake Bortles,/players/blake-bortles-bortlbl01,QB
3,2014,1,4,Buffalo Bills,Sammy Watkins,/players/sammy-watkins-watkisa01,WR
4,2014,1,5,Oakland Raiders,Khalil Mack,/players/khalil-mack-mackkh01,LB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d['season'] = pd.Series(len(df.player.unique()) *  [2014,2015,2016,2017,2018,2019,2020,2021,2022])


Unnamed: 0,year,round,pick,team,player,page,position,season
0,2014,1,1,Houston Texans,Jadeveon Clowney,/players/jadeveon-clowney-clownja01,DE,2014
0,2014,1,1,Houston Texans,Jadeveon Clowney,/players/jadeveon-clowney-clownja01,DE,2014
0,2014,1,1,Houston Texans,Jadeveon Clowney,/players/jadeveon-clowney-clownja01,DE,2014
0,2014,1,1,Houston Texans,Jadeveon Clowney,/players/jadeveon-clowney-clownja01,DE,2014
0,2014,1,1,Houston Texans,Jadeveon Clowney,/players/jadeveon-clowney-clownja01,DE,2014
...,...,...,...,...,...,...,...,...
2303,2022,7,262,San Francisco 49ers,Brock Purdy,/players/brock-purdy-purdybr01,QB,2022
2303,2022,7,262,San Francisco 49ers,Brock Purdy,/players/brock-purdy-purdybr01,QB,2022
2303,2022,7,262,San Francisco 49ers,Brock Purdy,/players/brock-purdy-purdybr01,QB,2022
2303,2022,7,262,San Francisco 49ers,Brock Purdy,/players/brock-purdy-purdybr01,QB,2022


In [30]:
pd.DataFrame(d)

Unnamed: 0,year,round,pick,team,player,page,position
0,2014,1,1,Houston Texans,Jadeveon Clowney,/players/jadeveon-clowney-clownja01,DE
0,2014,1,1,Houston Texans,Jadeveon Clowney,/players/jadeveon-clowney-clownja01,DE
0,2014,1,1,Houston Texans,Jadeveon Clowney,/players/jadeveon-clowney-clownja01,DE
0,2014,1,1,Houston Texans,Jadeveon Clowney,/players/jadeveon-clowney-clownja01,DE
0,2014,1,1,Houston Texans,Jadeveon Clowney,/players/jadeveon-clowney-clownja01,DE
...,...,...,...,...,...,...,...
2303,2022,7,262,San Francisco 49ers,Brock Purdy,/players/brock-purdy-purdybr01,QB
2303,2022,7,262,San Francisco 49ers,Brock Purdy,/players/brock-purdy-purdybr01,QB
2303,2022,7,262,San Francisco 49ers,Brock Purdy,/players/brock-purdy-purdybr01,QB
2303,2022,7,262,San Francisco 49ers,Brock Purdy,/players/brock-purdy-purdybr01,QB


In [34]:
d['season'] = pd.Series(len(df.player.unique()) *  [2014,2015,2016,2017,2018,2019,2020,2021,2022])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d['season'] = pd.Series(len(df.player.unique()) *  [2014,2015,2016,2017,2018,2019,2020,2021,2022])


In [38]:
trans

NameError: name 'trans' is not defined