# Testing Notebook
## Ideas for this notebook
- Using [PFR's](https://www.pro-football-reference.com/) data and beautiful soup, I was planning to download the stats from the 2021 NFL season to get a few insights into fantasy football trends
- Some stats I want to see:
  - Percentage of team target share per player (RB, TE, WR)
  - A correlation between target share and fantasy points
  - Team play breakdowns
    - Rushing, Passing, Punting, Kicking
    - Ratio of offensive and defensive plays
  - **TODO:** As ideas come, add them to list

## Code Section

### All Imports

In [33]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
import hashlib

### Parsing Each Major Fantasy Stat From 2021 into CSVs
My goal with all of the major fantasy stat types (passing, rushing, receiving) is to parse them into a CSV file for easy importing into a pandas dataframe.  Eventually I might change the storage from CSVs to a sql or nosql db.  Depends on how far I get into this.

#### 2021 Passing
The basic procedure for parsing the passing stats goes as follows:
1. Request the page from pro football reference using `requests.get`
2. Turn the page content into some beautiful soup
3. Get the body and find the table that we are looking for
4. Once we have the table, parse the header for all of the column names
5. Once the column names are parsed, iterate through each row and grab values for each player
   1. To potentially make it easier to discern two players with the same name apart, I am tempted to generate a hash of their player link on PFR giving each player a unique hash.  It will be so sad if I have a hash collision on the few players that have played in the league
6. Write all of the values to a file and save as csv

##### 1. Request page
##### 2. Turn page into beautiful soup

In [8]:
page = requests.get("https://www.pro-football-reference.com/years/2021/passing.htm")
page_soup = BeautifulSoup(page.content)


In [10]:
print(page_soup.body)

<body class="pfr">
<!-- Google Tag Manager (noscript) -->
<noscript><iframe height="0" src="https://www.googletagmanager.com/ns.html?id=GTM-PSRGLHM" style="display:none;visibility:hidden" width="0"></iframe></noscript>
<!-- End Google Tag Manager (noscript) -->
<div id="wrap">
<div id="header" role="banner">
<ul class="notranslate" id="subnav">
<li><a href="https://www.sports-reference.com/"><svg height="15px" width="20px"><use xlink:href="#ic-sr-pennant"></use></svg> Sports Reference ®</a></li>
<li><a href="https://www.baseball-reference.com/">Baseball</a></li>
<li class="current"><a href="https://www.pro-football-reference.com/">Football</a> <a href="https://www.sports-reference.com/cfb/">(college)</a></li>
<li><a href="https://www.basketball-reference.com/">Basketball</a> <a href="https://www.sports-reference.com/cbb/">(college)</a></li>
<li><a href="https://www.hockey-reference.com/">Hockey</a></li>
<li><a href="https://fbref.com/it/">Calcio</a></li>
<li><a href="https://www.sports

##### 3. Grab table
##### 4. Parse headers from table

In [65]:
table = page_soup.find(id='passing')
header = table.find("thead")
columns = header.find_all('th')
parsed_lines = []
stat_types = []
for column in columns:
    if column.text == 'Rk':
        continue
    stat_types.append(column.text)
stat_types_str = 'id,' + ','.join(stat_types)
parsed_lines.append(stat_types_str)

##### 5. Parse players

In [62]:
table_body = table.find('tbody')
player_rows = table_body.find_all('tr')
for player in player_rows:
    td_text = []
    td_list = player.find_all('td')
    for td in td_list:
        if td['data-stat'] == 'player':
            a = td.find('a')
            td_text = [hashlib.md5(a['href'].encode()).hexdigest()] + td_text

        td_text.append(td.text)
    td_str = ','.join(td_text)
    parsed_lines.append(td_str)


##### 6. Write to csv file

In [63]:
out_file = open('data/passing-2021.csv', 'w')
for line in parsed_lines:
    out_file.write(line + '\n')
out_file.close()

In [77]:
df_passing = pd.read_csv('data/passing-2021.csv')
df_passing.head(10)

Unnamed: 0,id,Player,Tm,Age,Pos,G,GS,QBrec,Cmp,Att,...,Y/G,Rate,QBR,Sk,Yds.1,Sk%,NY/A,ANY/A,4QC,GWD
0,cdb2ea417e9eba6b3f5b400c2d6ad5dc,Tom Brady*,TAM,44,QB,17,17,13-4-0,485,719,...,312.7,102.1,68.1,22,144,3.0,6.98,7.41,3.0,5.0
1,cf8559b75a6ca2666444e40301c1f658,Justin Herbert*,LAC,23,QB,17,17,9-8-0,443,672,...,294.9,97.7,65.6,31,214,4.4,6.83,6.95,5.0,5.0
2,441e8d598eebd985904aeb7ed56713de,Matthew Stafford,LAR,33,QB,17,17,12-5-0,404,601,...,287.4,102.9,63.8,30,243,4.8,7.36,7.45,3.0,4.0
3,5405d1bc6735bb6bc461228e11b93b33,Patrick Mahomes*,KAN,26,QB,17,17,12-5-0,436,658,...,284.6,98.5,62.2,28,146,4.1,6.84,7.07,3.0,3.0
4,55d10f9c888c5d2ad0209d4ce0016243,Derek Carr,LVR,30,QB,17,17,10-7-0,428,626,...,282.6,94.0,52.4,40,241,6.0,6.85,6.6,3.0,6.0
5,e2168ca4bcd47632f942a970b61e007b,Joe Burrow,CIN,25,QB,16,16,10-6-0,366,520,...,288.2,108.3,54.3,51,370,8.9,7.43,7.51,2.0,3.0
6,34532b329ccca52d2a94fca0611461f2,Dak Prescott,DAL,28,QB,16,16,11-5-0,410,596,...,278.1,104.2,54.6,30,144,4.8,6.88,7.34,1.0,2.0
7,e3abcd42d9eb213d6a0a91b18bc59c66,Josh Allen,BUF,25,QB,17,17,11-6-0,409,646,...,259.2,92.2,60.7,26,164,3.9,6.31,6.38,,
8,a47cce4d3b46564b75a2873bd42fb6ae,Kirk Cousins*,MIN,33,QB,16,16,8-8-0,372,561,...,263.8,103.1,52.3,28,197,4.8,6.83,7.42,3.0,4.0
9,2a74d26f6a750f33178624b75cd02e07,Aaron Rodgers*+,GNB,38,QB,16,16,13-3-0,366,531,...,257.2,111.9,69.1,30,188,5.3,7.0,8.0,1.0,2.0


#### Rushing and Receiving Stats
Thankfully, PFR has a scrimmage stats page that contains a table with rushing, receiving, and total stats.  This will be much more efficient than parsing them individually.

The process will be identical to that of the passing table:
1. Request the page from pro football reference using `requests.get`
2. Turn the page content into some beautiful soup
3. Get the body and find the table that we are looking for
4. Once we have the table, parse the header for all of the column names
5. Once the column names are parsed, iterate through each row and grab values for each player
   1. To potentially make it easier to discern two players with the same name apart, I am tempted to generate a hash of their player link on PFR giving each player a unique hash.  It will be so sad if I have a hash collision on the few players that have played in the league
6. Write all of the values to a file and save as csv

##### 1. Request page
##### 2. Cook up some soup

In [66]:
non_passing_page = requests.get('https://www.pro-football-reference.com/years/2021/scrimmage.htm')
non_passing_soup = BeautifulSoup(non_passing_page.content)

##### 3. Grab Table
##### 4. Parse Headers

In [79]:
non_passing_table = non_passing_soup.find('table')
non_passing_thead = non_passing_table.find('thead')
non_passing_columns = non_passing_thead.find_all('tr')[1].find_all('th')
count = 0
non_passing_stat_types = []
non_passing_parsed_lines = []
repeat_stats = ['Yds', 'TD', '1D', 'Lng', 'Y/G']
for column in non_passing_columns:
    if column.text == 'Rk':
        continue
    if column.text == 'Yds':
        count += 1
    if column.text in repeat_stats:
        match count:
            case 1:
                non_passing_stat_types.append('rec_'+column.text)
            case 2:
                non_passing_stat_types.append('rus_'+column.text)
            case default:
                non_passing_stat_types.append(column.text)
    else:
        non_passing_stat_types.append(column.text)
non_passing_stat_str = 'id,' + ','.join(non_passing_stat_types)
print(non_passing_stat_str)
non_passing_parsed_lines.append(non_passing_stat_str)

id,Player,Tm,Age,Pos,G,GS,Tgt,Rec,rec_Yds,Y/R,rec_TD,rec_1D,rec_Lng,R/G,rec_Y/G,Ctch%,Y/Tgt,Att,rus_Yds,rus_TD,rus_1D,rus_Lng,Y/A,rus_Y/G,A/G,Touch,Y/Tch,YScm,RRTD,Fmb


##### 5. Parse Players

In [80]:
tbody = non_passing_table.find('tbody')
non_passing_rows = tbody.find_all('tr')
for player in non_passing_rows:
    td_text = []
    td_list = player.find_all('td')
    for td in td_list:
        if td['data-stat'] == 'player':
            a = td.find('a')
            td_text = [hashlib.md5(a['href'].encode()).hexdigest()] + td_text

        td_text.append(td.text)
    td_str = ','.join(td_text)
    non_passing_parsed_lines.append(td_str)

##### 6. Write to CSV File

In [81]:
out_file = open('data/non-passing-2021.csv', 'w')
for line in non_passing_parsed_lines:
    out_file.write(line + '\n')
out_file.close()

In [82]:
df_non_passing = pd.read_csv('data/non-passing-2021.csv')
df_non_passing.head(10)

Unnamed: 0,id,Player,Tm,Age,Pos,G,GS,Tgt,Rec,rec_Yds,...,rus_1D,rus_Lng,Y/A,rus_Y/G,A/G,Touch,Y/Tch,YScm,RRTD,Fmb
0,e580cfd0bf65ec9f4f676f56cf000249,Jonathan Taylor*+,IND,22,RB,17,17,51,40,360,...,107,83.0,5.5,106.5,19.5,372,5.8,2171,20,4
1,6278e5d665dba3ff9ddb5835f2ec6547,Cooper Kupp*+,LAR,28,WR,17,17,191,145,1947,...,1,18.0,4.5,1.1,0.2,149,13.2,1965,16,0
2,460c71e3a7ee131a9728667cb4a97f88,Deebo Samuel*+,SFO,25,WR,16,15,121,77,1405,...,21,49.0,6.2,22.8,3.7,136,13.0,1770,14,4
3,a172489685f49a85038ad5ff45c06e83,Najee Harris*,PIT,23,RB,17,17,94,74,467,...,62,37.0,3.9,70.6,18.1,381,4.4,1667,10,0
4,cefade4d9059c69bd4197c374e655bd9,Justin Jefferson*,MIN,22,WR,17,17,167,108,1616,...,1,11.0,2.3,0.8,0.4,114,14.3,1630,10,1
5,b027a5e91daf591818ba5e84d4e1aee1,Austin Ekeler,LAC,26,RB,16,16,94,70,647,...,53,28.0,4.4,56.9,12.9,276,5.6,1558,20,4
6,2b5bce06820e6d605ed6fd70b85b7baa,Davante Adams*+,GNB,29,WR,16,16,169,123,1553,...,0,,,,,123,12.6,1553,11,0
7,3f18080a9a24b14533379d5969adbb3a,Joe Mixon*,CIN,25,RB,16,16,48,42,314,...,60,32.0,4.1,75.3,18.3,334,4.5,1519,16,2
8,b456c34f16bc928851782b0fe38be90f,Ja'Marr Chase*,CIN,21,WR,17,17,128,81,1455,...,2,10.0,3.0,1.2,0.4,88,16.8,1476,13,2
9,42d95d5cd20bb00b0241eba902ef6359,Nick Chubb*,CLE,26,RB,14,14,25,20,174,...,61,70.0,5.5,89.9,16.3,248,5.8,1433,9,2
