# Collecting Table Draft Data Notebook

This is a notebook aimed at collecting data from the Pro-Football-Reference website. Our goal is to create a table of WR's drafted, which includes data from their time in the NFL and at the college level. To do so, we scrape the draft table from the website called **draft_table**, which includes basic information about each player. Then, we scrape more data from the NFL and college webpages of each player and store it in the columns of each corresponding row. (step 0)

We then clean and verifiy the validity of the generated data. Issues come from the fact the webpages are missing or the NFL/college scrapped data isn't correct: hidden tables, disordered data, etc... In order to recognize how and where the additional data is generated, we also generate columns of comments for each player corresponding to messages that describe how the data was scrapped.

# Summary

We scrap the pro-football-reference website as follow. There are multiple sets of potential errors, so we clean the data one step at a time:

### 0) First, generate the data:
- tabledraft scrapes the draft table from the soup document and returns all players drafted with columns list_of_drafts: table_draft
- cfb_url_fill_in will generate the cfb url for those that are missing
- expand_draft_table adds the columns list_of_added_columns
- using the urls of table_draft, collegestats and nflstats both operate similarly and generate tables respectively for college football and the nfl
- row_cfb and row_nfl both pick the aggregate row from the provided table. It is capable of handling multiple cases of playing for many teams and not playing at all
- main puts it all together (this is the only command executed in this notebook). Watch as we also provide additional comments for each row: the caption from the nfl table and the method used. If numbers are generated it is a SUCCESS FROM MAIN, otherwise a FAIL (for example if the link is broken)

### 1) 1st set of cleaning:
- if the nfl table caption doesn't contain appropriate keywords, we must reconsider the nfl web page
- we use the webdriver.Chrome to find hidden tables
- as soon as we pick one with appropriate caption, we select it
- we re-run nflstats on it
- we edit the method used: SUCCESS FROM MAIN1. If the code failed to capture the appropriate table, write FAIL AFTER MAIN1

### 2) 2nd set of cleaning:
- if the yards from the original draft table ('receiving yards') don't match those from the nfl table ('nfl yards'), something is wrong. We identify those mismatches using identifier_for_main2 and print the URL of those nfl tables
- we manually inspect each nfl table. Very frequently, the mismatch rises because the nfl table has the rushing/receiving stats switched. For those, we run main2 which automatically does the flip. If the mismatch was taken care of this, the method reads SUCCESS FROM MAIN2, otherwise FAIL AFTER MAIN2

### 3) 3rd set of cleaning:
- we once again compare 'receiving yards'and 'nfl yards'. If they still don't match we print the url and work manually the nfl tables. We feed those mismatches and the list of manually edited nfl tables into main3

### Importing the module 
Collecting_Data_TableDraft_Functions.py (to be initialized)

In [1]:
import Collecting_Data_TableDraft_Functions
import importlib
importlib.reload(Collecting_Data_TableDraft_Functions)
from Collecting_Data_TableDraft_Functions import *

# Getting the draft table and Export

In [9]:
#url_wr_draft_1 = 'https://www.pro-football-reference.com/play-index/draft-finder.cgi?request=1&year_min=1936&year_max=2019&type=&round_min=1&round_max=30&slot_min=1&slot_max=500&league_id=&team_id=&pos%5B%5D=WR&college_id=all&conference=any&show=all&offset=0'
#table_draft_1 = main(url_wr_draft_1)

#url_wr_draft_2 = 'https://www.pro-football-reference.com/play-index/draft-finder.cgi?request=1&year_min=1936&year_max=2019&type=&round_min=1&round_max=30&slot_min=1&slot_max=500&league_id=&team_id=&pos%5B%5D=WR&college_id=all&conference=any&show=all&offset=300'
#table_draft_2 = main(url_wr_draft_2)

url_wr_draft_3 = 'https://www.pro-football-reference.com/play-index/draft-finder.cgi?request=1&year_min=1936&year_max=2019&type=&round_min=1&round_max=30&slot_min=1&slot_max=500&league_id=&team_id=&pos%5B%5D=WR&college_id=all&conference=any&show=all&offset=600'
table_draft_3 = main(url_wr_draft_3)

In [10]:
#table_draft_1.to_csv(r'data_sets/wr_draft_1_complete.csv',index=False)
#table_draft_2.to_csv(r'data_sets/wr_draft_2_complete.csv',index=False)
table_draft_3.to_csv(r'data_sets/wr_draft_3_complete.csv',index=False)

In [11]:
td1_0 = pd.read_csv('data_sets/wr_draft_3_complete.csv')

## 1st Set of Cleaning: making sure wrong NFL tables are off

In [12]:
td1 = main1(td1_0)

In [13]:
td1.to_csv(r'data_sets/wr_draft_3_complete1.csv',index=False)

In [14]:
td1  = pd.read_csv('data_sets/wr_draft_3_complete1.csv')

In [16]:
#wrong_nfl = check_nfl_table(table_draft_1)
#display(table_draft_1.loc[still_wrong_nfl])
still_wrong_nfl = check_nfl_table(td1)

[print(x) for x in td1.loc[still_wrong_nfl,'nfl url']]

https://www.pro-football-reference.com/players/W/WyatAn20.htm
https://www.pro-football-reference.com/players/G/GaitTo00.htm
https://www.pro-football-reference.com/players/S/SilvNi20.htm
https://www.pro-football-reference.com/players/M/MillEd21.htm
https://www.pro-football-reference.com/players/W/WestRo20.htm
https://www.pro-football-reference.com/players/S/SwanCh20.htm


[None, None, None, None, None, None]

### Manual Editing:
most likely the still_wrong_nfl list is a set of players with no 'Receiving & Rushing' stats

In [17]:
#td1.columns.get_loc("nfl target")
for i in still_wrong_nfl:
    td1.loc[i,['nfl age','nfl pos', 'nfl no']] = np.nan*3
    td1.iloc[i,47:71] = np.nan*24
display(td1.loc[still_wrong_nfl])

Unnamed: 0,year,round,pick,player,nfl url,pos,draft age,team,entry year,last year,1st team pro select,pro select,weighted career av,years as primary starter,games,games started,rushing attempts,rushing yards,rushing td,receiving attemps,receiving yards,receiving td,college,cfb url,cfb school,cfb conference,cfb class,cfb pos,cfb games,cfb receptions,cfb yards,cfb average,cfb td,cfb attemps rushing,cfb yards rushing,cfb avg rushing,cfb td rushing,cfb scrimmages,cfb yards total,cfb avg total,cfb td total,nfl age,nfl team,nfl pos,nfl no,nfl game,nfl game started,nfl target,nfl receptions,nfl yards,nfl y/r,nfl td,nfl first downs,nfl longest rec,nfl rec per game,nfl yards per game,nfl catch ratio,nfl yards per target,nfl rushes,nfl rush yards,nfl rush td,nfl first downs rush,nfl longest rush,nfl rush yards per attempt,nfl rush yards per game,nfl rush attempt per games,nfl total touches,nfl yards per touch,nfl yards from scrimmage,nfl total td,nfl fumbles,nfl av,nfl table type,cfb method,nfl method
123,1997,6,190,Antwuan Wyatt,https://www.pro-football-reference.com/players...,WR,22.0,PHI,1997.0,1997.0,0,0,0,0.0,1.0,0.0,,,,,,,Bethune-Cookman,https://www.sports-reference.com/cfb/players/a...,Clemson,ACC,,WR,32.0,81,300,3.7,1,76,972,12.8,6,157,1272,8.1,7,,PHI,,,1,,,,,,,,,,,,,,,,,,,,,,,,,,,Kick & Punt Returns Table,success,fail after main1
124,1997,6,192,Tony Gaiter,https://www.pro-football-reference.com/players...,WR,23.0,NWE,1997.0,1998.0,0,0,0,0.0,6.0,0.0,,,,,,,Miami (FL),https://www.sports-reference.com/cfb/players/t...,Miami (FL),Big East,,WR,27.0,35,679,19.4,7,19,81,4.3,0,54,760,14.1,7,,SDG,,,6,,,,,,,,,,,,,,,,,,,,,,,,,,,Kick & Punt Returns Table,success,fail after main1
154,1996,6,180,Nilo Silvan,https://www.pro-football-reference.com/players...,WR,22.0,TAM,1996.0,1996.0,0,0,0,0.0,7.0,0.0,,,,,,,Tennessee,https://www.sports-reference.com/cfb/players/n...,Tennessee,SEC,,WR,44.0,28,691,24.7,0,21,409,19.5,1,,,,,,TAM,,,7,,,,,,,,,,,,,,,,,,,,,,,,,,,Kick & Punt Returns Table,success,fail after main1
271,1992,9,225,Eddie Miller,https://www.pro-football-reference.com/players...,WR,23.0,IND,1992.0,1993.0,0,0,0,0.0,15.0,0.0,,,,,,,South Carolina,https://www.sports-reference.com/cfb/players/e...,South Carolina,Ind,,WR,44.0,76,1464,19.3,10,17,79,4.6,3,93,1543,16.6,13,,IND,,,15,,,,,,,,,,,,,,,,,,,,,,,,,,,Defense & Fumbles Table,success,fail after main1
273,1992,9,237,Ronnie West,https://www.pro-football-reference.com/players...,WR,24.0,MIN,1992.0,1992.0,0,0,0,0.0,12.0,0.0,,,,,,,Pittsburg St.,https://www.sports-reference.com/cfb/players/r...,fail,fail,fail,fail,fail,fail,fail,fail,fail,fail,fail,fail,fail,fail,fail,fail,fail,,MIN,,,12,,,,,,,,,,,,,,,,,,,,,,,,,,,Kick & Punt Returns Table,fail (eg link broken),fail after main1
288,1992,12,323,Charles Swann,https://www.pro-football-reference.com/players...,WR,21.0,NYG,1994.0,1994.0,0,0,0,0.0,13.0,0.0,,,,,,,Indiana St.,https://www.sports-reference.com/cfb/players/c...,fail,fail,fail,fail,fail,fail,fail,fail,fail,fail,fail,fail,fail,fail,fail,fail,fail,,DEN,,,13,,,,,,,,,,,,,,,,,,,,,,,,,,,Kick & Punt Returns Table,fail (eg link broken),fail after main1


## 2nd Set of cleaning: mismatch between NFL numbers

On the NFL table, the rushing stats and the receiving stats may be swapped. This issue often comes up because the NFL numbers from the draft table and from the NFL table don't match 

In [35]:
mismatch_yards = identifier_for_main2(td1)
[print(x, row ['receiving yards'], row['nfl yards'], row['nfl url']) for x, row in td1.loc[mismatch_yards].iterrows()]

[]

In [9]:
recovered_mismatch = [ ]
td1 = main2(td1, recovered_mismatch, mismatch_yards)

In [38]:
mismatch_yards = identifier_for_main2(td1)
[print(x, row['nfl url']) for x, row in td1.loc[mismatch_yards].iterrows()]

[]

In [37]:
td1.to_csv(r'data_sets/wr_draft_3_complete3.csv',index=False)

In [3]:
td1 = pd.read_csv('data_sets/wr_draft_3_complete2.csv')

## 3rd Set of Cleaning: Manual editing

In [6]:
#example of wrong table + correct table inverted

# url = 'https://www.pro-football-reference.com/players/W/WebbJo00.htm'
# url = 'https://www.pro-football-reference.com/players/L/LawrQu00.htm'
# url = 'https://www.pro-football-reference.com/players/S/SlatMa00.htm'
table_list = generate_all_nfl_tables(url)
table = find_nfl_table(table_list)
table3 = nflstats1(table)

In [7]:
td1 = main3(td1,[],[])

In [8]:
td1.to_csv(r'data_sets/wr_draft_3_complete3.csv',index=False)

# Trying out some particular cases in CFB

In [22]:
#all normal
marquise_brown_cfb_url = 'https://www.sports-reference.com/cfb/players/marquise-brown-1.html'

#multiple schools
jalen_hurd_cfb_url = 'https://www.sports-reference.com/cfb/players/jalen-hurd-1.html'

url_cfb_sample = marquise_brown_cfb_url
html =urlopen(url_cfb_sample)
soup = BeautifulSoup(html,'html.parser')

table=collegestats(soup)

display(table)
print(row_cfb(table))

Unnamed: 0,school,conference,class,pos,games,receptions,yards,average,td,attemps rushing,yards rushing,avg rushing,td rushing,scrimmages,yards total,avg total,td total
0,Oklahoma,Big 12,SO,WR,13.0,57,1095,19.2,7,1,0,0.0,0,58,1095,18.9,7
1,Oklahoma,Big 12,JR,WR,12.0,75,1318,17.6,10,2,0,0.0,0,77,1318,17.1,10
2,Oklahoma,,,,,132,2413,18.3,17,3,0,0.0,0,135,2413,17.9,17


['Oklahoma', 'Big 12', 'JR', 'WR', '25.0', '132', '2413', '18.3', '17', '3', '0', '0.0', '0', '135', '2413', '17.9', '17']


# Trying out some particular cases in the NFL

In [31]:
#all normal
mike_thomas_nfl_url = 'https://www.pro-football-reference.com/players/T/ThomMi04.htm'

#no play time
jalen_hurd_nfl_url = 'https://www.pro-football-reference.com/players/H/HurdJa00.htm'

#little play time
juwann_winfree_nfl_url = 'https://www.pro-football-reference.com/players/W/WinfJu00.htm'

#played with multiple teams
odell_beckham_jr_nfl_url = 'https://www.pro-football-reference.com/players/B/BeckOd00.htm'

#Has a stat kick and punt returns
dwayne_harris_nfl_url = 'https://www.pro-football-reference.com/players/H/HarrDw00.htm'

#has nfl stats inverted (rushes before receiving)
ardarius_stewart_nfl_url = 'https://www.pro-football-reference.com/players/S/StewAr00.htm'

#has a stat kick and punt returns and nfl stats inverted
dri_archer_nfl_url = 'https://www.pro-football-reference.com/players/A/ArchDr00.htm'

In [30]:
url_nfl_sample = odell_beckham_jr_nfl_url
url = url_nfl_sample
html =urlopen(url_nfl_sample)
soup = BeautifulSoup(html,'html.parser')

table=nflstats(soup)

display(table)

#row_nfl(table)

Unnamed: 0,age,team,pos,no,game,game started,target,receptions,yards,y/r,td,first downs,longest rec,rec per game,yards per game,catch ratio,yards per target,rushes,rush yards,rush td,first downs rush,longest rush,rush yards per attempt,rush yards per game,rush attempt per games,total touches,yards per touch,yards from scrimmage,total td,fumbles,av,table type
0,22,NYG,WR,13.0,12,11,130,91,1305,14.3,12,58,80,7.6,108.8,70.0%,10.0,7,35,0,2,13,5.0,2.9,0.6,98,13.7,1340,12,1,11,Receiving & Rushing Table
1,23,NYG,WR,13.0,15,15,158,96,1450,15.1,13,68,87,6.4,96.7,60.8%,9.2,1,3,0,1,3,3.0,0.2,0.1,97,15.0,1453,13,2,13,Receiving & Rushing Table
2,24,NYG,WR,13.0,16,16,169,101,1367,13.5,10,66,75,6.3,85.4,59.8%,8.1,1,9,0,0,9,9.0,0.6,0.1,102,13.5,1376,10,3,10,Receiving & Rushing Table
3,25,NYG,wr,13.0,4,2,41,25,302,12.1,3,14,48,6.3,75.5,61.0%,7.4,1,8,0,0,8,8.0,2.0,0.3,26,11.9,310,3,0,2,Receiving & Rushing Table
4,26,NYG,WR,13.0,12,12,124,77,1052,13.7,6,50,51,6.4,87.7,62.1%,8.5,5,19,0,2,11,3.8,1.6,0.4,82,13.1,1071,6,2,9,Receiving & Rushing Table
5,27,CLE,WR,13.0,16,15,133,74,1035,14.0,4,44,89,4.6,64.7,55.6%,7.8,3,10,0,1,11,3.3,0.6,0.2,77,13.6,1045,4,1,9,Receiving & Rushing Table
6,,,,,75,71,755,464,6511,14.0,48,300,89,6.2,86.8,61.5%,8.6,18,84,0,6,13,4.7,1.1,0.2,482,13.7,6595,48,9,54,Receiving & Rushing Table
7,NYG,NYG,,,59,56,622,390,5476,14.0,44,256,87,6.6,92.8,62.7%,8.8,15,74,0,5,13,4.9,1.3,0.3,405,13.7,5550,44,8,45,Receiving & Rushing Table
8,CLE,CLE,,,16,15,133,74,1035,14.0,4,44,89,4.6,64.7,55.6%,7.8,3,10,0,1,11,3.3,0.6,0.2,77,13.6,1045,4,1,9,Receiving & Rushing Table


## Code: create the table of draft without merging with tables of cfb and nfl

In [28]:
#url_wr_draft = 'https://www.pro-football-reference.com/play-index/draft-finder.cgi?request=1&year_min=1936&year_max=2019&type=&round_min=1&round_max=30&slot_min=1&slot_max=500&league_id=&team_id=&pos%5B%5D=WR&college_id=all&conference=any&show=all&offset=0'
url_wr_draft = 'https://www.pro-football-reference.com/play-index/draft-finder.cgi?request=1&year_min=1936&year_max=2019&type=&round_min=1&round_max=30&slot_min=1&slot_max=500&league_id=&team_id=&pos%5B%5D=WR&college_id=all&conference=any&show=all&offset=300'
html =urlopen(url_wr_draft)
soup = BeautifulSoup(html,'html.parser')

In [29]:
table0=tabledraft(soup)

table1 = cfb_url_fill_in(table0)
table1 = expend_draft_table(table1)

## Code: merging data from Julio Jones' cfb/nfl to draft table

In [469]:
#index 255
table1=table_draft_1
julio_jones_cfb_url = table1.loc[table1.loc[:,'player'] == 'Julio Jones']['cfb url'].item()
print(julio_jones_cfb_url)
julio_jones_nfl_url = table1.loc[table1.loc[:,'player'] == 'Julio Jones']['nfl url'].item()
print(julio_jones_nfl_url)

https://www.sports-reference.com/cfb/players/julio-jones-1.html
https://www.pro-football-reference.com/players/J/JoneJu02.htm


  This is separate from the ipykernel package so we can avoid doing imports until
  """


In [470]:
url_cfb_sample = julio_jones_cfb_url
html = urlopen(url_cfb_sample)
soup = BeautifulSoup(html,'html.parser')

table2=collegestats(soup)
display(table2)
print(row_cfb(table2))

Unnamed: 0,school,conference,class,pos,games,receptions,yards,average,td,attemps rushing,yards rushing,avg rushing,td rushing,scrimmages,yards total,avg total,td total
0,Alabama,SEC,FR,WR,14.0,58,924,15.9,4,,,,,58,924,15.9,4
1,Alabama,SEC,SO,WR,13.0,43,596,13.9,4,2.0,4.0,2.0,0.0,45,600,13.3,4
2,Alabama,SEC,JR,WR,13.0,78,1133,14.5,7,8.0,135.0,16.9,2.0,86,1268,14.7,9
3,Alabama,,,,,179,2653,14.8,15,10.0,139.0,13.9,2.0,189,2792,14.8,17


['Alabama', 'SEC', 'JR', 'WR', '40.0', '179', '2653', '14.8', '15', '10', '139', '13.9', '2', '189', '2792', '14.8', '17']


In [19]:
url_nfl_sample = julio_jones_nfl_url
html =urlopen(url_nfl_sample)
soup = BeautifulSoup(html,'html.parser')

table3=nflstats(soup)
display(table3)
print(row_nfl(table3))

Unnamed: 0,age,team,pos,no,game,game started,target,receptions,yards,y/r,td,first downs,longest rec,rec per game,yards per game,catch ratio,yards per target,rushes,rush yards,rush td,first downs rush,longest rush,rush yards per attempt,rush yards per game,rush attempt per games,total touches,yards per touch,yards from scrimmage,total td,fumbles,av
0,22.0,ATL,WR,11,13,13,95,54,959.0,17.8,8,36,80.0,4.2,73.8,56.8%,10.1,6.0,56.0,0.0,4.0,19.0,9.3,4.3,0.5,60.0,16.9,1015,8,1,10.0
1,23.0,ATL,WR,11,16,15,128,79,1198.0,15.2,10,56,80.0,4.9,74.9,61.7%,9.4,6.0,30.0,0.0,3.0,18.0,5.0,1.9,0.4,85.0,14.4,1228,10,0,13.0
2,24.0,ATL,wr,11,5,5,59,41,580.0,14.1,2,25,81.0,8.2,116.0,69.5%,9.8,1.0,7.0,0.0,1.0,7.0,7.0,1.4,0.2,42.0,14.0,587,2,2,5.0
3,25.0,ATL,WR,11,15,15,163,104,1593.0,15.3,6,76,79.0,6.9,106.2,63.8%,9.8,1.0,1.0,0.0,1.0,1.0,1.0,0.1,0.1,105.0,15.2,1594,6,2,14.0
4,26.0,ATL,WR,11,16,16,203,136,1871.0,13.8,8,93,70.0,8.5,116.9,67.0%,9.2,,,,,,,,,136.0,13.8,1871,8,3,16.0
5,27.0,ATL,WR,11,14,14,129,83,1409.0,17.0,6,64,75.0,5.9,100.6,64.3%,10.9,,,,,,,,,83.0,17.0,1409,6,0,16.0
6,28.0,ATL,WR,11,16,16,148,88,1444.0,16.4,3,67,53.0,5.5,90.3,59.5%,9.8,1.0,15.0,0.0,1.0,15.0,15.0,0.9,0.1,89.0,16.4,1459,3,0,14.0
7,29.0,ATL,WR,11,16,16,170,113,1677.0,14.8,8,80,58.0,7.1,104.8,66.5%,9.9,2.0,12.0,0.0,1.0,11.0,6.0,0.8,0.1,115.0,14.7,1689,8,2,14.0
8,30.0,ATL,WR,11,15,15,157,99,1394.0,14.1,6,77,54.0,6.6,92.9,63.1%,8.9,2.0,-3.0,0.0,0.0,1.0,-1.5,-0.2,0.1,101.0,13.8,1391,6,1,11.0
9,,,,126,125,1252,797,12125,15.2,57.0,574,81,6.3,96.2,63.7%,9.7,19.0,118.0,0.0,11.0,19.0,6.2,0.9,0.2,816.0,15.0,12243.0,57,11,113,


['30', 'ATL', 'WR', '11', '126', '125', '1252', '797', '12125', '15.2', '57', '574', '81', '6.3', '96.2', '63.7%', '9.7', '19', '118', '0', '11', '19', '6.2', '0.9', '0.2', '816', '15.0', '12243', '57', '11', '113']


In [346]:
#table2.iloc[-2][0:4].tolist()+ [str(table2['games'].apply(pd.to_numeric).iloc[:-1].sum())] + table2.iloc[-1][5:].tolist()
#table3.iloc[-2][0:4].tolist()+ table3.iloc[-1][4:].tolist()


#table1.loc[255,24:72] = table2.iloc[-2][0:4].tolist()+ [str(table2['games'].apply(pd.to_numeric).iloc[:-1].sum())] + table2.iloc[-1][5:].tolist()+table3.iloc[-2][0:4].tolist()+ table3.iloc[-1][4:].tolist()

## Uselul commands
Do not run unless initialized

In [5]:
os.getcwd()

url_wr_draft_1 = 'https://www.pro-football-reference.com/play-index/draft-finder.cgi?request=1&year_min=1936&year_max=2019&type=&round_min=1&round_max=30&slot_min=1&slot_max=500&league_id=&team_id=&pos%5B%5D=WR&college_id=all&conference=any&show=all&offset=0'
html =urlopen(url_wr_draft_1)
soup = BeautifulSoup(html,'html.parser')

wb.open_new_tab(url_wr_draft_1)
table_0 = soup.find_all('table')[0]
rowx= table_0.find_all('tr')[4]
columnx = rowx.find_all('td')[3]
linkx = columnx.find_all('a')[0]
print('https://www.pro-football-reference.com'+linkx.get('href'))

#list of added columns
#['cfb '+ x for x in table2.columns.tolist()] + ['nfl '+ x for x in table3.columns.tolist()]

https://www.pro-football-reference.com/players/S/SamuDe00.htm
