# Collecting Data last notebook: 
Merging draft and combine tables

In this notebook, we import the draft tables we generated with Collecting_Dat_TableDraft and the combine table with Collecting_Data_CombineTable.

### 1) Merging the draft tables together

We merge the nfl tables and keep players satisfying our condition: WR and draftee between 2000 and 2011

### 2) Add the combine data

We add a set of columns corresponding to the combine numbers to each row.
- Using the combine table, we add the columns by identification, either on the player's name/year OR the player's year/pick/round. If the program succeeds, it returns SUCCESS FROM MERGE on an additional column named 'combine method'. If not, it FAILS AFTER MERGE.
- Given that a substantial amount of draftees have no combine data, we go back to the nfl page
- And use the webdriver.Chrome to find hidden tables, particularly the combine one.
- We extract the combine data and make sure it fits with the ordering of other players' combine.
- If it succeeds, the program updates the column 'combine method' with SUCCESS FROM SEARCH.
- Otherwise, it returns FAIL AFTER SEARCH (MISSING TABLE) if the table was absent of the nfl page, or FAIL AFTER SEARCH (BROKEN LINK) if the nfl url was missing in the first place.

### 3) Counting

We count the number of rows with complete, partially complete and incomplete data. 
Remember that for the 'cfb method' column:
- success: cfb data is good
- fail: cfb data is missing

For the 'nfl method' column:
- success from main, main1, main2, or main3: nfl data is good
- fail from main: there's no explicit nfl data (probably because url is missing)
- fail from main1: player never had 'Receiving & Rushing' stats (he was used for something else)
- fail from main2: player who have mismatch between 'receiving yards' and 'nfl yards'. Should be rare because those numbers are taken care of main3 manually. 

And for the 'combine method':
- success from merge, search: combine data is good
- fail after search (missing table): no combine table was found on the nfl web page
- fail after search (missing link): no url was even provided

We print out those numbers at the end

In [1]:
import Collecting_Data_Final_Functions
import importlib
importlib.reload(Collecting_Data_Final_Functions)
from Collecting_Data_Final_Functions import *

# 1) Merging draft tables

In [6]:
td1 = pd.read_csv('data_sets/wr_draft_1_complete3.csv')
td2 = pd.read_csv('data_sets/wr_draft_2_complete3.csv')
td3 = pd.read_csv('data_sets/wr_draft_3_complete3.csv')

td = pd.concat([td1, td2], ignore_index = True)
td = pd.concat([td, td3], ignore_index = True)

In [7]:
td = td[(td['year'] < 2012) & (td['year']>=2000)].reset_index(drop = True)

In [8]:
td.to_csv(r'data_sets/wr_draft_complete.csv',index=False)

# 2) Adding NFL combines

In [3]:
td = pd.read_csv('data_sets/wr_draft_complete.csv')

In [2]:
#tc = pd.read_csv('data_sets/wr_combine.csv')
tc = pd.read_csv('data_sets/combine_data_since_2000.csv')

## Matching by 
- year, round and pick: tdc1
- year, name: tdc2

In [6]:
tdc1 = add_combine(td,tc)
print(tdc1['combine method'].value_counts())
s1 = (tdc1['combine method'] == 'success from merge').tolist()

success from merge    323
fail after merge       70
Name: combine method, dtype: int64


In [8]:
tdc2 = add_combine(td,tc)
print(tdc2['combine method'].value_counts())
s2 = (tdc2['combine method'] == 'success from merge').tolist()

success from merge    313
fail after merge       80
Name: combine method, dtype: int64


In [10]:
s1not = [not i for i in s1]
list_of_matches_using_names_only = []
for i in range(len(s1)):
    if s2[i] and s1not[i]:
        list_of_matches_using_names_only.append(i)
        
list_of_matches_using_names_only

[181, 289, 305, 351, 376, 381, 387, 392]

8 additional players were found by year + name
The draft numbers or the team weren't reported in the combine table
Also a lot of them simply didn't have any nfl careers but still had a team

In [12]:
tdc = add_combine(td,tc)

#tdc.loc[list_of_matches_using_names_only]
# for i in list_of_matches_using_names_only:
#     name = tdc.loc[i,'player']
#     display(tc.loc[tc.loc[:,'Player'] == name])

## Finding the missing combine numbers:

There are *this many* many draftees without combine numbers

In [13]:
tdc.loc[:,'combine method'].value_counts()

success from merge    331
fail after merge       62
Name: combine method, dtype: int64

For those, let's try the same web.chrome method on the players with 'fail after merge' and search through all the tables of the nfl page. Example:

In [95]:
#Everything good
html = 'https://www.pro-football-reference.com/players/J/JoneJu02.htm'
#Table is missing (no combine)
#html = 'https://www.pro-football-reference.com/players/D/DurhKr00.htm'

table_list = generate_all_nfl_tables(html)
table0 = find_combine_table(table_list)

table = combine_nfl(table0)
display(table)
display(row_combine(table))

Unnamed: 0,year,pos,height,weight,40yd,bench,broad jump,shuttle,3 cone,vertical
0,Year,Pos,Ht,Wt,40yd,Bench,Broad Jump,Shuttle,3Cone,Vertical
1,2011,WR,75,220,4.34,17,135,4.25,6.66,38.5


(['WR', '75', '220', '4.34', '38.5', '17', '135', '6.66', '4.25', '2011'],
 'success from search')

In [14]:
missing_combine = tdc.index[tdc.loc[:,'combine method'] == 'fail after merge'].tolist()

#[print(x) for x in tdc.loc[missing_combine[:10]],'nfl url']]

In [15]:
tdcp = search_combine(tdc)

In [17]:
tdcp.to_csv('data_sets/wr_draft_combine.csv',index=False)

# 3) Counting

### Combine Counts

In [16]:
print(len(tdcp), '\n',
tdcp['combine method'].value_counts())

393 
 success from merge                   331
fail after search (missing table)     44
fail after search (missing link)      18
Name: combine method, dtype: int64


In [10]:
tdc = pd.read_csv('data_sets/wr_draft_combine.csv')

### All the counts

In [11]:
print(tdc['nfl method'].value_counts(), '\n',
tdc['cfb method'].value_counts(), '\n',
tdc['combine method'].value_counts())

success from main        269
success from main1        50
fail (eg link broken)     49
fail after main1          20
success from main3         3
success from main2         2
Name: nfl method, dtype: int64 
 success                  340
fail (eg link broken)     53
Name: cfb method, dtype: int64 
 success from merge                   331
fail after search (missing table)     44
fail after search (missing link)      18
Name: combine method, dtype: int64


In [20]:
cfb_nfl_comb = []
for i, row in tdc.iterrows():
    if (row['cfb method'] == 'success') and (first_word(row['nfl method']) == 'success') and (first_word(row['combine method']) == 'success'):
        cfb_nfl_comb.append(i)
        
cfb_comb = []
for i, row in tdc.iterrows():
    if (row['cfb method'] == 'success') and (first_word(row['combine method']) == 'success'):
        cfb_comb.append(i)
        
nocomb_nocfb = []
for i, row in tdc.iterrows():
    if (first_word(row['cfb method']) == 'fail') and  (first_word(row['combine method']) == 'fail'):
        nocomb_nocfb.append(i)   

nocomb = []
for i, row in tdc.iterrows():
    if (first_word(row['combine method']) == 'fail'):
        nocomb.append(i)

In [21]:
print('total number of rows:',len(tdc),'\n',
'rows with complete data (cfb, nfl, comb):', len(cfb_nfl_comb),'\n',
'rows with cfb and comb:', len(cfb_comb), '\n',
'row with no cfb and no comb:', len(nocomb_nocfb), '\n',
'rows with no comb data:', len(nocomb))

total number of rows: 393 
 rows with complete data (cfb, nfl, comb): 262 
 rows with cfb and comb: 300 
 row with no cfb and no comb: 22 
 rows with no comb data: 62
