# Intro

This notebook was created to pull daily NET rankings from the NCAA website for data aggregation.

The NCAA publishes daily NET rankings and updates a single page, but doesn't allow the user to view historical rankings in this easily scrapable form.

The NCAA does have an archive for daily NET rankings, denoted as the 'Nitty Gritty'. However, this archive presents the data in an interesting way that isn't easily digestable - a dataframe contained within a web PDF.

Determined to figure out a way to aggregated historical NET ranking data (in a way other than scraping the NET rankings EVERY day and having to rely on a solution that deals with a scheduled script - here is a way to scrape PDF data from daily PDF files!

From a high level, this is accomplished by:
* Understand the structure of the daily NET Ranking PDF URL - then build PDF file URLs
* Using Requests, download all existing NET ranking PDF files from URL
* Check downloaded filesize to test if there is an updated ranking for that date
* Using Tabula and read_pdf, read in locally downloaded NET ranking PDF files as DataFrames
    * To install Tabula and utilize read_pdf in a Juypter Notebook: conda install -c conda-forge tabula-py
    * More on Tabula: https://pypi.org/project/tabula-py/ 
* Clean DataFrames (sort, slice, format, etc.)

In [3]:
url1 = 'http://extra.ncaa.org/solutions/rpi/Stats%20Library/NET%20Nitty%20Gritty%20-%20Dec.%2017,%202019.pdf'

base = 'http://extra.ncaa.org/solutions/rpi/Stats%20Library/NET%20Nitty%20Gritty%20-%20'
month = 'Dec'
sep1 = '.%20'
day = '17'
sep2 = ',%20'
year = '2019'
end = '.pdf'

url = base + month + sep1 + day+ sep2 + year + end

In [99]:
months = ['Dec', 'Jan', 'Feb', 'Mar', 'Apr']
days = list(range(1,32))
years = ['2019', '2020']

# Build PDF file links

In [97]:

url_list = []
year = '2019'
month = 'Dec'
#December
for x in days[16:]:
    data = base+month+sep1+str(x)+sep2+year+end
    url_list.append(data)

In [102]:
#2020

year = '2020'
month = 'Jan'
#Jan
for x in days[:]:
    data = base+month+sep1+str(x)+sep2+year+end
    url_list.append(data)
        

#Feb
month = 'Feb'
for x in days[:28]:
    data = base+month+sep1+str(x)+sep2+year+end
    url_list.append(data)
    
#Mar
month = 'Mar'
for x in days:
    data = base+month+sep1+str(x)+sep2+year+end
    url_list.append(data)
    
#Apr
month = 'Apr'
for x in days[:30]:
    data = base+month+sep1+str(x)+sep2+year+end
    url_list.append(data)

In [130]:
url_list

['http://extra.ncaa.org/solutions/rpi/Stats%20Library/NET%20Nitty%20Gritty%20-%20Dec.%2017,%202019.pdf',
 'http://extra.ncaa.org/solutions/rpi/Stats%20Library/NET%20Nitty%20Gritty%20-%20Dec.%2018,%202019.pdf',
 'http://extra.ncaa.org/solutions/rpi/Stats%20Library/NET%20Nitty%20Gritty%20-%20Dec.%2019,%202019.pdf',
 'http://extra.ncaa.org/solutions/rpi/Stats%20Library/NET%20Nitty%20Gritty%20-%20Dec.%2020,%202019.pdf',
 'http://extra.ncaa.org/solutions/rpi/Stats%20Library/NET%20Nitty%20Gritty%20-%20Dec.%2021,%202019.pdf',
 'http://extra.ncaa.org/solutions/rpi/Stats%20Library/NET%20Nitty%20Gritty%20-%20Dec.%2022,%202019.pdf',
 'http://extra.ncaa.org/solutions/rpi/Stats%20Library/NET%20Nitty%20Gritty%20-%20Dec.%2023,%202019.pdf',
 'http://extra.ncaa.org/solutions/rpi/Stats%20Library/NET%20Nitty%20Gritty%20-%20Dec.%2024,%202019.pdf',
 'http://extra.ncaa.org/solutions/rpi/Stats%20Library/NET%20Nitty%20Gritty%20-%20Dec.%2025,%202019.pdf',
 'http://extra.ncaa.org/solutions/rpi/Stats%20Library/N

# Download PDF locally and define naming convention
###### Specify subset of url_list you want to use - denoted by urls_test

In [136]:
import requests
urls_test = url_list[:6]

file_list = [] 

for x in urls_test:
    response = requests.get(x)
    name = 'NET'+ x[79:82] + x[86:88] + '.pdf'
    file_list.append(name)
    with open(name, 'wb') as f:
        f.write(response.content)

# Check if the downloaded PDF file is valid. We "create" a filename for every day for simplicity, but not every day is associated with a new NET file.

##### More simply put, the NCAA doesn't upload a new file every day. We can figure out the valid new files by checking the file sizes

In [191]:
import os
for x in file_list:
    
    size = os.path.getsize(x)
    if size <= 788:
        file_list.remove(x)

# Loop through locally downloaded PDFs, convert to DataFrame, clean and save again as a CSV

In [171]:
import tabula
import pandas

In [172]:
from tabula import read_pdf

In [193]:
for x in file_list:
    name_file_1 = 'Downloads/' 
    name_file = name_file_1 + x
    
    df_temp = read_pdf(input_path = x, pandas_options = ({'header' : None}), pages ="all")

    df1_temp = df_temp.rename(columns = {0: "School", 1: "Rank", 2: "Avg. Opponent NET",  3: "Avg. Opponent Rank", 4: "Record", 5: "Conf Record", 6: "Non-Conf Record", 7: "Road Record",  8: "SOS", 9: "NC SOS", 10: "Q1 Rec", 11: "Q2 Rec", 12: "Q3 Rec", 13: "Q4 Rec"})

    df1_temp['Rank'] = df1_temp['Rank'].astype(int)
    df1['School'] = df1['School'].str[2:]
    save_name = x[:-4] + '.csv'
    df1_temp.to_csv(save_name)