# Dataset: All Conference
#### Sources: Wikipedia, Manual Data Aggregation (CSV)
> The All Conference dataset works a bit differently.  Half of the data comes from Wikipedia, but Wiki doesn't have standard representations of all the conferences.  Instead, work was done to manually grab that data from multiple disparate sources and normalized into a single CSV file.  This notebook will step through getting data from both sources, normalizing and integrating the findings before summarizing for the final linking step.

In [12]:
#hide
import core_constants as cc
import functions as fx
import requests
import lxml
import time
import json
import pandas as pd
import sqlite3 as sql
import recordlinkage
import csv

## Set Notebook Settings

In [13]:
#Load the page titles you are interested in
pageList = [['sec', 'All-SEC_football_team'], ['bigtwelve', 'All-Big_12_Conference_football_team'], ['bigten', 'All-Big_Ten_Conference_football_team']]
years = cc.get_defYears()
headers= cc.get_header()
csvFile = "..//scrapedData//allConf.csv"
dataset = 'AllConference'

## Get & Save the Wikipedia HTML
#### Source: https://en.wikipedia.org/wiki/2020_Big_Ten_Conference_football_season#All-Conference_Teams
> These Wiki pages contain the all conference records.  We cycle through by conference - which for these three conferences follow the same page layout and url schema for each year we care about.  

##### To run - convert the below cell back to code

In [None]:
fx.get_WikipediaAllConf(pageList, headers, years)

## Processes B1G, SEC and Big 12 on Wikipedia.
> Error outputs are caught exceptions.  The JSON file will still write out but you can go back to wikipedia to figure out why it failed.

In [4]:
teamDir = ['..//html//wikipedia//allconference//bigten//', '..//html//wikipedia//allconference//bigtwelve//', '..//html//wikipedia//allconference//sec//']

with open("..//scrapedData//Wiki_AllConf.json", "w") as write_file:
    json.dump(fx.process_wikiConferences(teamDir), write_file)

ERROR - Couldn"t write record: , Ole Miss (ESPN) Year:2015


## Get & Process the CSV File for the other Conferences

> The CSV file contains the all conference records for the Pac12, ACC and the group of 5 conferences

##### Input: //scrapedData/allconf.csv
##### Output: //scrapedData/allConf.json

In [5]:
with open("..//scrapedData//allConf.json", "w") as write_file:
    write_file.write(json.dumps(fx.process_csvAllConf(fx.get_csvAllConf(csvFile))))

## Clear DB
>Useful for a clean start.  THis removes all of the records for this dataset from the following structures: SourcedPlayers, REcordLinks.  All of the Views auto-cleanse themselves.

In [None]:
fx.clearDB(dataset)

## Save to DB

In [6]:
fx.toDB_AllConference()

'DB Write is done'

## Strict Matching
> This saves to RecordLinking where ID == ID, but returns IDYR as the matching targetID

In [7]:
fx.literalLinking(dataset)

Connected to SQLite


## Fuzzy Matching w/ Threshold
> This is automatically pushing fuzzy matches above a certain threshold into the DB without the need for review.  

>LIES - currently all this does is create a dataframe that is used to create the annotation file.  The thresholds are terribly hardcoded in the source function.  Needs to be cleaned up.

In [14]:
fuzzyDF = fx.doFuzzyMatching(dataset, 'Sports247')

## Create the Annotation File
>This changes the dataframe into a MultiIndex data frame that the annotation function requires.

>Don't forget that the length needs to change below in line 25, based on how large the fuzzyDF is

In [None]:
conn = sql.connect(cc.databaseName) 
          
sql_query = pd.read_sql_query ('''
                               SELECT
                               *
                               FROM SourcedPlayers
                               WHERE KeyDataSet = 1
                               ''', conn)

df_247 = pd.DataFrame(sql_query, columns = ['IDYR', 'College', 'Year', 'Position'])
df_247.set_index('IDYR', append=False, inplace=True)
sql_query = pd.read_sql_query ('''
                               SELECT
                               *
                               FROM UnlinkedAllConference
                               ''', conn)

df_AllConference = pd.DataFrame(sql_query, columns = ['ID', 'College'])
df_AllConference.set_index('ID', append=False, inplace=True)

fuzzyMI = pd.MultiIndex.from_frame(fuzzyDF)
recordlinkage.write_annotation_file(
    "../Annotations/Annotations/annotation_allconference.json",
    fuzzyMI[0:300],
    df_NCAA,
    df_247,
    dataset_a_name="AllConference",
    dataset_b_name="Master"
)

## Read in the Annotation File
> After manually annotating using the RecordLinkage annotator tool, this takes the resulting Annotation file and places it into a dict for further processing.

In [8]:
annotation = recordlinkage.read_annotation_file("..//Annotations//Results//allconf_results.json")
try:
    annotation_dict = (annotation.links).to_flat_index()
except Exception as e:
    print(e)

## Insert Annotation dict into RecordLinks

In [9]:
for record in annotation_dict:
    #MAKE SURE YOU UPDATE THE THIRD VALUE TO THE CORRECT KEYDATASET!!
    Values = [record[0], record[1], 4, 1, 1]
    query = '''INSERT INTO RecordLinks(MasterID, TargetID, KeyDataSet, KeyLinkType, LinkConfidence)
        VALUES (?,?,?,?,?)'''
    
    conn = sql.connect(cc.databaseName)
    c = conn.cursor()
    
    c.execute(query, Values)
    conn.commit()
    
conn.close()