# Collecting Data last notebook: 
Merging draft and combine tables

In this notebook, we import the draft tables we generated with Collecting_Dat_TableDraft and the combine table with Collecting_Data_CombineTable.

### 1) Merging the draft tables together

We merge the nfl tables and keep players satisfying our condition: WR and draftee between 2000 and 2011

### 2) Add the combine data

We add a set of columns corresponding to the combine numbers to each row.
- Using the combine table, we add the columns by identification, either on the player's name/year OR the player's year/pick/round. If the program succeeds, it returns SUCCESS FROM MERGE on an additional column named 'combine method'. If not, it FAILS AFTER MERGE.
- Given that a substantial amount of draftees have no combine data, we go back to the nfl page
- And use the webdriver.Chrome to find hidden tables, particularly the combine one.
- We extract the combine data and make sure it fits with the ordering of other players' combine.
- If it succeeds, the program updates the column 'combine method' with SUCCESS FROM SEARCH.
- Otherwise, it returns FAIL AFTER SEARCH (MISSING TABLE) if the table was absent of the nfl page, or FAIL AFTER SEARCH (BROKEN LINK) if the nfl url was missing in the first place.

### 3) Cleaning

We convert all data (except the obvious strings such as players names and teams) to floats.
We also fill in missing data with zero's depending on the "nfl method" column value:
- success from main, main1, main2, or main3: convert to floats
- fail from main: there's no explicit nfl data (probably because url is missing)
- fail from main1: player never had 'Receiving & Rushing' stats (he was used for something else)
- fail from main2: player who have mismatch between 'receiving yards' and 'nfl yards'. Should be rare because those numbers are taken care of main3 manually. 

- Note: Even for players with missing data (e.g. fail) main2 filters 'points out' if there's a mismatch between 'receiving yards' and 'nfl yards' (= 0 in this case).

In [185]:
import Collecting_Data_Final_Functions
import importlib
importlib.reload(Collecting_Data_Final_Functions)
from Collecting_Data_Final_Functions import *

# 1) Merging draft tables

In [6]:
td1 = pd.read_csv('data_sets/wr_draft_1_complete3.csv')
td2 = pd.read_csv('data_sets/wr_draft_2_complete3.csv')
td3 = pd.read_csv('data_sets/wr_draft_3_complete3.csv')

td = pd.concat([td1, td2], ignore_index = True)
td = pd.concat([td, td3], ignore_index = True)

In [7]:
td = td[(td['year'] < 2012) & (td['year']>=2000)].reset_index(drop = True)

In [8]:
td.to_csv(r'data_sets/wr_draft_complete.csv',index=False)

# 2) Adding NFL combines

In [98]:
td = pd.read_csv('data_sets/wr_draft_complete.csv')

In [99]:
tc = pd.read_csv('data_sets/wr_combine.csv')

## Matching by 
- year, round and pick: tdc1
- year, name: tdc2

In [322]:
tdc1 = add_combine(td,tc)
print((tdc1['combine method'] == 'success').value_counts())
s1 = (tdc1['combine method'] == 'success').tolist()

True     318
False     75
Name: combine method, dtype: int64


In [325]:
tdc2 = add_combine(td,tc)
print((tdc2['combine method'] == 'success').value_counts())
s2 = (tdc2['combine method'] == 'success').tolist()

True     325
False     68
Name: combine method, dtype: int64


In [326]:
s1not = [not i for i in s1]
list_of_matches_using_names_only = []
for i in range(len(s1)):
    if s2[i] and s1not[i]:
        list_of_matches_using_names_only.append(i)

In [100]:
tdc = add_combine(td,tc)

#tdc.loc[list_of_matches_using_names_only]
# for i in list_of_matches_using_names_only:
#     name = tdc.loc[i,'player']
#     display(tc.loc[tc.loc[:,'Player'] == name])

7 additional players were found by year + name
The draft numbers or the team weren't reported in the combine table
Also 6/7 simply didn't have any nfl careers but still had a team

In [14]:
list_of_combine_w_no_team = []
for i, row in tc.iterrows():
    try:
        if np.isnan(tc.loc[i,'Team']):
            list_of_combine_w_no_team.append(i)
        else:
            pass
    except:
        if tc.loc[i,'Team'] == '':
            list_of_combine_w_no_team.append(i)
        else:
            pass
len(list_of_combine_w_no_team)

211

There is a LOT of combine players with no team, most are assumed to never be drafted but we showed that some were still drafted

## Finding the missing combine numbers:

There are *this many* many draftees without combine numbers

In [101]:
(tdc.loc[:,'combine method'] == 'fail after merge').value_counts()

False    325
True      68
Name: combine method, dtype: int64

For those, let's try the same web.chrome method on the players with 'fail after merge' and search through all the tables of the nfl page. Example:

In [95]:
#Everything good
html = 'https://www.pro-football-reference.com/players/J/JoneJu02.htm'

#Table is missing (no combine)
#html = 'https://www.pro-football-reference.com/players/D/DurhKr00.htm'

table_list = generate_all_nfl_tables(html)
table0 = find_combine_table(table_list)

table = combine_nfl(table0)
display(table)
display(row_combine(table))

Unnamed: 0,year,pos,height,weight,40yd,bench,broad jump,shuttle,3 cone,vertical
0,Year,Pos,Ht,Wt,40yd,Bench,Broad Jump,Shuttle,3Cone,Vertical
1,2011,WR,75,220,4.34,17,135,4.25,6.66,38.5


(['WR', '75', '220', '4.34', '38.5', '17', '135', '6.66', '4.25', '2011'],
 'success from search')

In [102]:
missing_combine = tdc.index[tdc.loc[:,'combine method'] == 'fail after merge'].tolist()

#[print(x) for x in tdc.loc[missing_combine[:10]],'nfl url']]

For a lot of them, the nfl combine table is simply missing

In [73]:
tdcp = search_combine(tdc)

In [93]:
print(len(tdcp), '\n',
(tdcp.loc[:,'combine method'] == 'success from merge').value_counts(), '\n',
(tdcp.loc[:,'combine method'] == 'success from search').value_counts(),'\n',
(tdcp.loc[:,'combine method'] == 'fail after search (missing table)').value_counts(),'\n',
(tdcp.loc[:,'combine method'] == 'fail after search (missing link)').value_counts())

393 
 True     325
False     68
Name: combine method, dtype: int64 
 False    389
True       4
Name: combine method, dtype: int64 
 False    349
True      44
Name: combine method, dtype: int64 
 False    373
True      20
Name: combine method, dtype: int64


In [123]:
tdcp.to_csv('data_sets/wr_draft_combine.csv',index=False)

# 3) Cleaning

In [124]:
tdc = pd.read_csv('data_sets/wr_draft_combine.csv')

In [107]:
print(tdc['nfl method'].unique(), '\n',
tdc['cfb method'].unique(), '\n',
tdc['combine method'].unique())

['success from main' 'success from main1' 'fail (eg link broken)'
 'success from main2' 'success from main3' 'fail after main1'] 
 ['success' 'fail (eg link broken)'] 
 ['success from merge' 'fail after search (missing table)'
 'success from search' 'fail after search (missing link)']


In [189]:
cfb_nfl_comb = []
for i, row in tdc.iterrows():
    if (row['cfb method'] == 'success') and (first_word(row['nfl method']) == 'success') and (first_word(row['combine method']) == 'success'):
        cfb_nfl_comb.append(i)
    
nfl_comb = []
for i, row in tdc.iterrows():
    if (first_word(row['cfb method']) == 'fail') and (first_word(row['nfl method']) == 'success') and (first_word(row['combine method']) == 'success'):
        nfl_comb.append(i)
    
cfb_nfl = []
for i, row in tdc.iterrows():
    if (row['cfb method'] == 'success') and (first_word(row['nfl method']) == 'success') and (first_word(row['combine method']) == 'fail'):
        cfb_nfl.append(i)
        
cfb_comb = []
for i, row in tdc.iterrows():
    if (row['cfb method'] == 'success') and (first_word(row['nfl method']) == 'fail') and (first_word(row['combine method']) == 'success'):
        cfb_comb.append(i)
        
cfb_only = []
for i, row in tdc.iterrows():
    if (row['cfb method'] == 'success') and (first_word(row['nfl method']) == 'fail') and (first_word(row['combine method']) == 'fail'):
        cfb_only.append(i)
        
nfl_only = []
for i, row in tdc.iterrows():
    if (first_word(row['cfb method']) == 'fail') and (first_word(row['nfl method']) == 'success') and (first_word(row['combine method']) == 'fail'):
        nfl_only.append(i)
        
comb_only = []
for i, row in tdc.iterrows():
    if (first_word(row['cfb method']) == 'fail') and (first_word(row['nfl method']) == 'fail') and (first_word(row['combine method']) == 'success'):
        comb_only.append(i)
        
none_only = []
for i, row in tdc.iterrows():
    if (first_word(row['cfb method']) == 'fail') and (first_word(row['nfl method']) == 'fail') and (first_word(row['combine method']) == 'fail'):
        none_only.append(i)
        
zerogames_nonfl = []
for i, row in tdc.iterrows():
    if (convert(tdc['games'])[i] == 0) and (first_word(row['nfl method']) == 'fail'):
        zerogames_nonfl.append(i)

nocomb = []
for i, row in tdc.iterrows():
    if (first_word(row['combine method']) == 'fail'):
        nocomb.append(i)




In [190]:
print(len(tdc),'\n',
len(cfb_nfl_comb),'\n',
len(nfl_comb),
len(cfb_nfl),
len(cfb_comb), '\n',
len(cfb_only),
len(nfl_only),
len(comb_only),
len(none_only), '\n',
len(zerogames_nonfl), len(nocomb))

393 
 262 
 25 26 36 
 16 11 6 11 
 49 64


In [177]:
262+25+26+36+16+11+6+11

393

### Cleaning one row after the other

In [None]:
# issues = ['draft age' #reads float, should be int, can be NaN
#           ,'entry age' #reads float, should be in, can be NaN
#           ,'last year' #same
#           ,'years as primary starter'] #same

In [155]:
print(j, tdc.columns[j], tdc.iloc[:,j].dtype)
j = j+1

13 years as primary starter float64
