# Dataset: All Conference
#### Sources: Wikipedia, Manual Data Aggregation (CSV)
> The All Conference dataset works a bit differently.  Half of the data comes from Wikipedia, but Wiki doesn't have standard representations of all the conferences.  Instead, work was done to manually grab that data from multiple disparate sources and normalized into a single CSV file.  This notebook will step through getting data from both sources, normalizing and integrating the findings before summarizing for the final linking step.

In [None]:
#hide
import core_constants as cc
import functions as fx
import requests
import lxml
import time
import json
import pandas as pd
import sqlite3 as sql

## Set Notebook Settings

In [None]:
#Load the page titles you are interested in
pageList = [['sec', 'All-SEC_football_team'], ['bigtwelve', 'All-Big_12_Conference_football_team'], ['bigten', 'All-Big_Ten_Conference_football_team']]
years = cc.get_defYears()
headers= cc.get_header()
csvFile = "..//scrapedData//allConf.csv"

## Get & Save the Wikipedia HTML
#### Source: https://en.wikipedia.org/wiki/2020_Big_Ten_Conference_football_season#All-Conference_Teams
> These Wiki pages contain the all conference records.  We cycle through by conference - which for these three conferences follow the same page layout and url schema for each year we care about.  

##### To run - convert the below cell back to code

fx.get_WikipediaAllConf(pageList, headers, years)

### Process & Save Big Ten/Big 12 Data from Wikipedia
> These files are already saved locally.  Saves it out to a json file

In [None]:
teamDir = ['..//html//wikipedia//allconference//bigten//', '..//html//wikipedia//allconference//bigtwelve//']

with open("..//scrapedData//Ten12AllConf.json", "w") as write_file:
    json.dump(fx.process_WikipediaBigTenBigTwelve(teamDir), write_file)

### Process & Save SEC Data from Wikipedia
> These files are already saved locally.  Saves it out to a json file

In [None]:
teamDir = '..//html//wikipedia//allconference//sec//'

with open("..//scrapedData//SECAllConf.json", "w") as write_file:
    json.dump(fx.process_WikipediaSEC(teamDir), write_file)

## Get & Process the CSV File for the other Conferences

> The CSV file contains the all conference records for the Pac12, ACC and the group of 5 conferences

##### Input: //scrapedData/allconf.csv
##### Output: //scrapedData/allConf.json

In [None]:
with open("..//scrapedData//allConf.json", "w") as write_file:
    write_file.write(json.dumps(fx.process_csvAllConf(fx.get_csvAllConf(csvFile))))

## Summarize the Dataset

> We don't need repetitive fields across the various datasets (ie - I don't need height coming back from 3 sources).  This step strips to only what I care about for the master print out.

In [None]:
outputDir = '..//summarizedData//'
dataset = 'allConf'

with open(outputDir + dataset + ".json", "w", encoding="utf-8") as write_file:
                write_file.write(json.dumps(fx.summarize_allConf()))

## Save to DB

In [None]:
fx.toDB_AllConference()

'DB Write is done'