# Scraping the School Team History

### The Goal:
Scrape the Team History to create a json file that contains the historical stats as a datafram, with the statistic name as the key. This includes stats like number of tournaments won, most winning years, best big10 years, etc.

In [163]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Title Dictionary

This part gets the url and scrapes it for the historical titles UIUC has won. There are two functions which fix the dataframes once the data is received, because the headers were commonly included in the data. Each title name is used as a key, and the title data is the values as a dataframe. 

In [164]:
url = 'https://fightingillini.com/sports/2021/4/30/mens-basketball-history'


def set_first_row_as_header(df):
    df = df.copy()  # avoid modifying original
    df.columns = df.iloc[0]
    return df[1:].reset_index(drop=True)

def try_fix_header(df):
    if df.shape[0] > 1:
        return set_first_row_as_header(df)
    return df


# Scrape the webpage
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')



all_tables = soup.find_all("table", class_="release")
titles_dict = {}

for table in all_tables:
    rows = table.find_all("tr")
    if not rows:
        continue

    # Try to extract a title from the first <th>
    header_cells = rows[0].find_all("th")
    if not header_cells:
        continue

    title = header_cells[0].get_text(strip=True).replace('\xa0', ' ')

    # Grab all rows
    raw_rows = []
    for tr in table.find_all("tr"):
        cols = [cell.get_text(strip=True) for cell in tr.find_all(["td", "th"])]
        if cols:
            raw_rows.append(cols)

    # Skip section title row (first) if it has only one column
    if len(raw_rows) > 1:
        raw_rows = raw_rows[1:]

    # Create DataFrame
    df = pd.DataFrame(raw_rows)

 

    # Save to dictionary
    titles_dict[title] = df


## Example 1
multirow dataframe

In [165]:
titles_dict['Winningest Seasons – By Win Percentage']

Unnamed: 0,0,1,2,3
0,1.0,1.0,16-0,1915
1,2.0,0.949,37-2,2005
2,3.0,0.944,17-1,1943
3,4.0,0.861,31-5,1989
4,5.0,0.846,22-4,1952
5,6.0,0.84,21-4,1949
6,7.0,0.839,26-5,1984
7,8.0,0.818,18-4,1956
8,,0.818,18-4,1953
9,10.0,0.815,22-5,1951


## Example 2
 single row dataframe

In [166]:
titles_dict['1* National Title']

Unnamed: 0,0
0,1915


## Exporting
This exports the dictionary out using pickle. Cannot directly export as json since the values are dataframes. 

In [168]:
import os
import pickle
output_dir = "data\processed"

os.makedirs(output_dir, exist_ok=True)
file_path = os.path.join(output_dir, 'mbb_history.json')
with open(file_path, 'wb') as f:
    pickle.dump(titles_dict, f)