# Collecting Draft Data Notebook

This is a notebook aimed at collecting data from the **Pro-Football-Reference website**. Our goal is to create a table of WR's drafted, which includes data from their time in the NFL and at the college level (CFB). To do so, we use the package **Beautiful Soup** to scrape the draft table from the website pro-football website, which includes basic information about each player, including url's of webpages that contain additional NFL and CFB data. We then scrape these additional NFL and CFB tables for each player and store it in the columns of each corresponding row. All the data is then handled with **pandas** and **numpy**.

The main difficulty is how non-uniform the data is layed out in each table. For example, CFB data may be stored in either a 'Receiving & Rushing Table' (which is standard) but for other players, the table is labelled 'Rushing & Receiving Table' instead, meaning the rushing and receiving stats are swapped. Additionally, some of our tables of interest are hidden to the Beautiful Soup document; they're located further down the webpages. For those, we use the Chrome Webdriver which is an extensive scrapping method (compared to Beautiful Soup).

### Importing the module 
Our user-defined functions are stored in **Collecting_Data_TableDraft_Functions.py**, which needs to be initialized.

In [1]:
import Collecting_Data_TableDraft_Functions
import importlib
importlib.reload(Collecting_Data_TableDraft_Functions)
from Collecting_Data_TableDraft_Functions import *

## Scraping Draft, CFB and NFL

Our main function inputs the draft url and operates as follow:

1) Scrape draft table using BeautifulSoup and store in Pandas dataframe.

2) Store cfb, nfl urls for each player (each row).

3) Scrape cfb table using cfb url. Watch for possibilities of swapped receiving / rushing stats. Watch for hidden tables; use Chrome Webdriver for those.

4) Scrape nfl table using nfl url. Watch for possibilities of swapped receiving / rushing stats. Watch for hidden tables; use Chrome Webdriver for those. Watch for players with missing receiving / rushing tables (they likely never played at WR).

5) Save only the receiving yards per year (stored in a list) and the aggregate yards.

6) Create two additional columns indicating the caption of the cfb and nfl tables, in order for the user to understand how the data was scrapped and how partially complete it maybe.

7) Convert all entries to ints and floats. Do not fill in for missing entries (np.nan).

Since the draft table is spread over multiple webpages on Pro-Football-Reference, we go one draft url (corresponding to one draft table) after the other, store each resulting dataframe in a list of dataframe and concatenate all them to obtain our final dataframe

In [14]:
#list_of_url_draft = ['https://www.pro-football-reference.com/play-index/draft-finder.cgi?request=1&year_min=2011&year_max=2012&pick_type=overall&pos%5B%5D=wr&conference=any&show=all&order_by=default']

list_of_url_draft = ['https://www.pro-football-reference.com/play-index/draft-finder.cgi?request=1&year_min=2010&year_max=2014&pick_type=overall&pos%5B%5D=wr&conference=any&show=all&order_by=default',
'https://www.pro-football-reference.com/play-index/draft-finder.cgi?request=1&year_min=2005&year_max=2009&pick_type=overall&pos%5B%5D=wr&conference=any&show=all&order_by=default',
'https://www.pro-football-reference.com/play-index/draft-finder.cgi?request=1&year_min=2000&year_max=2004&pick_type=overall&pos%5B%5D=wr&conference=any&show=all&order_by=default']

In [15]:
list_of_td = []
for url_draft in list_of_url_draft:
    list_of_td.append(draft_nfl_cfb_scrap(url_draft))
    
td = pd.concat(list_of_td, ignore_index=True)

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


## Verifiying and Fixing

In order to verify the validity of the scraping of the cfb and nfl tables, we compare:
- the school from the draft table with the cfb_school from the cfb data
- the rec_yards from the draft table with the nfl_yards and the sum of the nfl_yards_per_year from the nfl data

There should be no mismatch, except on the labels of the school, for example 'Oklahoma St.' vs 'Oklahoma State'.

In [18]:
mismatch_cfb(td)[:5]

[(3, 'Oregon St.', 'Oregon State'),
 (4, 'Florida St.', 'Florida State'),
 (8, 'Fresno St.', 'Fresno State'),
 (10, 'Penn St.', 'Penn State'),
 (13, 'Mississippi', 'Ole Miss')]

There are a couple rows on which the nfl scrapping isn't working since the yards don't match:

In [54]:
for x in mismatch_nfl(td):
    print('For player %s, rec_yards = %.f while rec_yards from nfl page = %f and list from page is %s.'
    %(td.loc[x[0],'player'], x[1], x[2], x[3]))
    print('At the same time, nfl_table_name = %s, indicating the scrapping failed for some unknown reason.\n'
    %td.loc[x[0],'nfl_table_name'])

For player Joe Webb, rec_yards = 74 while rec_yards from nfl page = nan and list from page is [nan].
At the same time, nfl_table_name = unexpected error, indicating the scrapping failed for some unknown reason.

For player Matthew Slater, rec_yards = 46 while rec_yards from nfl page = nan and list from page is [nan].
At the same time, nfl_table_name = unexpected error, indicating the scrapping failed for some unknown reason.

For player Maurice Stovall, rec_yards = 668 while rec_yards from nfl page = nan and list from page is [nan].
At the same time, nfl_table_name = unexpected error, indicating the scrapping failed for some unknown reason.



We fix those **manually**:

In [57]:
for x in mismatch_nfl(td):
    print('index', x[0], td.loc[x[0],'nfl_url'])

index 148 https://www.pro-football-reference.com/players/W/WebbJo00.htm
index 209 https://www.pro-football-reference.com/players/S/SlatMa00.htm
index 264 https://www.pro-football-reference.com/players/S/StovMa00.htm


In [63]:
td.loc[148,'nfl_yards_per_year'] = [0, 9, 0, 33, 16, 0, 3, 0, 13]
td.loc[148,'nfl_yards'] = 74
td.loc[209,'nfl_yards_per_year'] = [0, 0, 0, 46, 0, 0]
td.loc[209,'nfl_yards'] = 46
td.loc[264,'nfl_yards_per_year'] = [102, 86, 25, 366, 81, 8, 0]
td.loc[264,'nfl_yards'] = 668
td.loc[[148, 209, 264], 'nfl_table_name'] = 'receiving & rushing table manually'

In [61]:
mismatch_nfl(td)

[]

In [None]:
# There is also one row with unexpected error when the pulling the cfb data. No data was scrapped, as if there were no link:
tdc.loc[tdc.cfb_table_name == 'unexpected error','cfb_table_name'] = 'cfb link is missing'

**FIXED!** Ready for export

In [68]:
td.to_csv(r'data_sets/td.csv',index=False)

In [121]:
td = pd.read_csv('data_sets/td.csv')

## A bunch of testings 

We test each of our functions separately, in particular the multiple scenarios for each cfb or nfl table.

### Testing Draft url

In [7]:
url_draft = 'https://www.pro-football-reference.com/play-index/draft-finder.cgi?request=1&year_min=2011&year_max=2012&pick_type=overall&pos%5B%5D=wr&conference=any&show=all&order_by=default'

soup_draft = BeautifulSoup(urlopen(url_draft),'html.parser')
table_draft = tabledraft(soup_draft)

### Testing CFB url

In [27]:
# First CFB table is Receiving & Rushing Table
url_cfb = 'https://www.sports-reference.com/cfb/players/mohamed-sanu-1.html'

# First CFB table is Rushing & Receiving Table
#url_cfb = 'https://www.sports-reference.com/cfb/players/greg-little-1.html'

# First CFB table is Kick & Punt Returns Table. Needs webdriver
#url_cfb = 'https://www.sports-reference.com/cfb/players/tj-graham-1.html'

In [28]:
# the soup file
soup_cfb = BeautifulSoup(urlopen(url_cfb),'html.parser')
# all the tables from the soup file
all_tables_cfb = soup_cfb.find_all('table')
# all the captions on the page
captions_cfb = find_captions_from_list_of_tables(all_tables_cfb)

captions_cfb

['Receiving & Rushing Table']

In [207]:
# If none of the captions are Receiving / Rushing, use webdriver to 
# to generate all the tables beyond those on the soup document
all_tables = generate_all_tables(url_cfb)
# all the new captions
captions_cfb = find_captions_from_list_of_tables(all_tables)

In [29]:
# the cfb table and the data stored in the draft table
table=cfbstats(all_tables_cfb[0])
display(table)
print('Data captured and stored in the draft table:')
print(row_cfb(table))

Unnamed: 0,cfb_school,cfb_conference,cfb_class,cfb_pos,cfb_games,cfb_receptions,cfb_rec_yards,cfb_rec_yards_per_reception,cfb_rec_td,cfb_rushes,cfb_rushing_yards,cfb_rushing_yards_per_rush,cfb_rushing_td,cfb_plays,cfb_yards,cfb_yards_per_play,cfb_td
0,Rutgers,Big East,FR,WR,13.0,51,639,12.5,3,62,346,5.6,5,113,985,8.7,8
1,Rutgers,Big East,SO,WR,12.0,44,418,9.5,2,59,309,5.2,4,103,727,7.1,6
2,Rutgers,Big East,JR,WR,13.0,115,1206,10.5,7,4,-2,-0.5,0,119,1204,10.1,7
3,Rutgers,,,,,210,2263,10.8,12,125,653,5.2,9,335,2916,8.7,21


Data captured and stored in the draft table:
['Rutgers', 'Big East', 'JR', 'WR', '38.0', '210', '2263', '10.8', '12', '125', '653', '5.2', '9', '335', '2916', '8.7', '21']


### Testing NFL url

In [30]:
#1st table isn't Receiving & Rushing Table but can still be found
url_nfl = 'https://www.pro-football-reference.com/players/S/SanuMo00.htm'

#no table at all
#url_nfl = 'https://www.pro-football-reference.com/players/H/HurdJa00.htm'

#one Games table and that's it
#url_nfl = 'https://www.pro-football-reference.com/players/W/WinfJu00.htm'

#played with multiple teams
#url_nfl = 'https://www.pro-football-reference.com/players/B/BeckOd00.htm'

#has nfl stats inverted (rushes before receiving)
#url_nfl = 'https://www.pro-football-reference.com/players/S/StewAr00.htm'

#First table is Kick & Punt Returns. Needs webdriver
#url_nfl = 'https://www.pro-football-reference.com/players/H/HarrDw00.htm'

#has a stat kick and punt returns and nfl stats inverted
#url_nfl = 'https://www.pro-football-reference.com/players/A/ArchDr00.htm'

In [31]:
# the soup file
soup_nfl = BeautifulSoup(urlopen(url_nfl),'html.parser')
# all the tables from the soup file
all_tables_nfl = soup_nfl.find_all('table')
# all the captions
captions_nfl = find_captions_from_list_of_tables(all_tables_nfl)

captions_nfl

['2020 Games Table',
 'Receiving & Rushing Table',
 'Advanced Passing Table',
 'Advanced Passing Table',
 'Advanced Passing Table',
 'Advanced Passing Table']

In [32]:
if desired_captions[0] in captions_nfl:
    x, y = scrap_data(all_tables_nfl, captions_nfl, desired_captions[0], [nflstats, row_nfl]), desired_captions[0].lower()
elif desired_captions[1] in captions_nfl:
    x, y = scrap_data(all_tables_nfl, captions_nfl, desired_captions[1], [nflstats_invert, row_nfl]), desired_captions[0].lower() + ' inverted'
else:
    all_tables = generate_all_tables(url_nfl)
    captions_nfl = find_captions_from_list_of_tables(all_tables)
    
    if desired_captions[0] in captions_nfl:
        x, y = scrap_data(all_tables, captions_nfl, desired_captions[0], [nflstats, row_nfl]), desired_captions[0].lower() + ' webdriver'
    elif desired_captions[1] in captions_nfl:
        x, y = scrap_data(all_tables, captions_nfl, desired_captions[1], [nflstats_invert, row_nfl]), desired_captions[0].lower() + ' webdriver inverted'
    else:
        x, y = [np.nan]*len(list_of_nfl), 'failed after webdriver'

In [34]:
print('the receiving yards per year list', x[0], 'and its sum', np.sum(x[0]))
print('the total number receiving yards', x[1])
print('the caption of the table', y)

the receiving yards per year list [154.0, 455.0, 790.0, 394.0, 653.0, 703.0, 838.0, 520.0, 9.0] and its sum 4516.0
the total number receiving yards 4516.0
the caption of the table receiving & rushing table


In [35]:
table=nflstats(all_tables_nfl[1])
display(table)
print('Data captured and stored in the draft table:')
row_nfl(table)

Unnamed: 0,age,team,pos,no,games,games_started,targets,receptions,rec_yards,rec_yards_per_reception,rec_td,rec_first_downs,rec_longest,receptions_per_game,rec_yards_per_game,rec_catch_ratio,rec_yards_per_target,rushes,rushing_yards,rushing_td,rushing_first_downs,rushing_longest,rushing_yards_per_rush,rushing_yards_per_game,rushes_per_game,plays,yards_per_play,yards,td,fumbles,approximate_value
0,23,CIN,,12.0,9,3,25,16,154,9.6,4,10.0,34,1.8,17.1,64.0%,6.2,5.0,15.0,0.0,2.0,7.0,3.0,1.7,0.6,21,8.0,169,4,0,2.0
1,24,CIN,WR,12.0,16,14,77,47,455,9.7,2,28.0,32,2.9,28.4,61.0%,5.9,4.0,16.0,0.0,0.0,9.0,4.0,1.0,0.3,51,9.2,471,2,1,4.0
2,25,CIN,WR,12.0,16,13,98,56,790,14.1,5,38.0,76,3.5,49.4,57.1%,8.1,7.0,51.0,0.0,2.0,26.0,7.3,3.2,0.4,63,13.3,841,5,0,8.0
3,26,CIN,,12.0,16,4,49,33,394,11.9,0,17.0,52,2.1,24.6,67.3%,8.0,10.0,71.0,2.0,4.0,25.0,7.1,4.4,0.6,43,10.8,465,2,2,5.0
4,27,ATL,WR,12.0,15,15,81,59,653,11.1,4,33.0,59,3.9,43.5,72.8%,8.1,1.0,5.0,0.0,0.0,5.0,5.0,0.3,0.1,60,11.0,658,4,1,7.0
5,28,ATL,WR,12.0,15,15,96,67,703,10.5,5,41.0,25,4.5,46.9,69.8%,7.3,4.0,10.0,0.0,3.0,4.0,2.5,0.7,0.3,71,10.0,713,5,0,7.0
6,29,ATL,WR,12.0,16,16,94,66,838,12.7,4,40.0,44,4.1,52.4,70.2%,8.9,7.0,44.0,0.0,5.0,24.0,6.3,2.8,0.4,73,12.1,882,4,2,7.0
7,30,2TM,,,15,12,89,59,520,8.8,2,30.0,28,3.9,34.7,66.3%,5.8,3.0,11.0,0.0,1.0,8.0,3.7,0.7,0.2,62,8.6,531,2,0,5.0
8,,ATL,wr,12.0,7,6,42,33,313,9.5,1,,28,4.7,44.7,,7.5,2.0,3.0,0.0,,2.0,1.5,0.4,0.3,35,9.0,316,1,0,3.0
9,,NWE,wr,14.0,8,6,47,26,207,8.0,1,,22,3.3,25.9,,4.4,1.0,8.0,0.0,,8.0,8.0,1.0,0.1,27,8.0,215,1,0,2.0


Data captured and stored in the draft table:


[[154.0, 455.0, 790.0, 394.0, 653.0, 703.0, 838.0, 520.0, 9.0], 4516.0]