# College Football Simulation with new pairing system
### By Rodrigo Vargas and Daniel Yedidia
Project for PIC-16B Python with Applications II, UCLA Spring 2023 <br>
Github: https://github.com/dyeds/PIC-16B-Project

Structure: (Tentative)

1. Information & Credits: CollegeFootballData.com, https://www.reddit.com/r/CFB/comments/qq553i/what_would_college_football_look_like_under_a/ Post by u/dethwing_.
2. Explanation of the project and how to use the notebook
3. Data Acquisition and Preprocessing. Working with CFBD and BingMaps API's and storing locally on Database
4. Model creation using Tensorflow. Creating various models and motives on why to use Betting Lines.
5. Explanation and Implementation of Minimum Weight Matching Algorithm using Networkx for pairings and using distances.
6. SQL Database with Simulated Games and other table creating Functions.
7. Visualizations of Simulation using Plotly.
8. Biases and Future Improvements for the project. 

## Information & Credits

## Project Explanations and Instructions

In [None]:
#List of Imports here:
import DataFunctions
import cfbd
from cfbd.rest import ApiException
import sqlite3
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras import layers
from matplotlib import pyplot as plt
import networx as nx


## Data Acquisition and Preprocessing

As explained before, we need College Football Data and Locations Data for the simulation, which will be obtained using CFBD and BingMaps API's. They typically require registering and getting an unique key, this is done to keep track of who and how many API calls are needed. Some API's may ask you to pay after certain number of calls, but for this project is not needed. <br>

College football data is needed for all the simulation purposes, extracting all FBS teams and their season stats. First install using command: `"pip install cfbd"`. Then register on https://collegefootballdata.com/key to obtain a key and creating a configuration object with that data, we will be able to use the API.

In [None]:
# import cfbd
# from cfbd.rest import ApiException

configuration = cfbd.Configuration()
# configuration.api_key['Authorization'] = 'YOUR_KEY_HERE'
configuration.api_key['Authorization'] = '3WCU5V2X05Rvh60ZxUG8FarJN4s2D1lcd2c2r6Kz/qL1Y3tVBJtWsuNATnzHRV2h'
configuration.api_key_prefix['Authorization'] = 'Bearer'

Now we have to extract data from 3 classes: `GamesApi` which gives us data for all the games; `BettingApi` which gives us data for all the betting lines; and `StatsApi` which gives us the team season data.<br>

In our predictive modelling, we are going to use a merge between game-data and team-data as our predictors, and betting lines data as our target. Down below is an example for data extraction, more detail was done on functions inside `DataFunctions.py` file. 

In [None]:
api_instance = cfbd.GamesApi(cfbd.ApiClient(configuration))
year=2022
division="fbs"

try:
    api_response = api_instance.get_games(year=year,division=division)
    print(len(api_response))
except:
    print(ApiException)

(Explaining how functions are setup on DataFunctions.py from API to DF to SQL)

Before inserting code into our SQL database, we have to create instances for each sub-API. Also is needed a list of conferences that can be obtained directly from the distances. 

In [None]:
# import sqlite3
# import pandas as pd

# Creating API Instances:
api_instance = cfbd.GamesApi(cfbd.ApiClient(configuration))
api_instance1 = cfbd.BettingApi(cfbd.ApiClient(configuration))
api_instance2 = cfbd.StatsApi(cfbd.ApiClient(configuration))

#creating cols to work on selected stats data
cols = ['team','season','conference','Offensive_ppa','Offensive_success_rate',
        'Offensive_explosiveness','Offensive_power_success',
        'Offensive_stuff_rate','Offensive_line_yards',
        'Defensive_ppa','Defensive_success_rate',
        'Defensive_explosiveness','Defensive_power_success',
        'Defensive_stuff_rate','Defensive_line_yards',
        'Offensive_havoc_total','Offensive_rushing_plays_ppa',
        'Offensive_rushing_plays_success_rate',
        'Offensive_rushing_plays_explosiveness',
        'Offensive_passing_plays_ppa',
        'Offensive_passing_plays_success_rate',
        'Offensive_passing_plays_explosiveness',
        'Defensive_havoc_total','Defensive_rushing_plays_ppa',
        'Defensive_rushing_plays_success_rate',
        'Defensive_rushing_plays_explosiveness',
        'Defensive_passing_plays_ppa',
        'Defensive_passing_plays_success_rate',
        'Defensive_passing_plays_explosiveness']

for year in range(2015,2023):
    #obtaining games data
    gamelist = DataFunctions.get_fbs_games(api_instance=api_instance,year=year)
    games_df = DataFunctions.df_from_games(gamelist=gamelist)
    
    #creating conferences to obtain data from them:
    conferences=[]
    for game in gamelist:
        conferences.append(game.away_conference)
    conferences=set(conferences)
    
    #obtaining betting data
    betting_list=DataFunctions.get_fbs_betting(api_instance=api_instance1,year=year,conferences=conferences)
    betting_df=DataFunctions.df_betting_lines(betting_list)
    
    #obtaining stats data
    teamstats = api_instance2.get_advanced_team_season_stats(year=year)
    stats_df = DataFunctions.df_team_advstats(teamstats=teamstats)
    stats_df = DataFunctions.df_stats_needed(stats_df,cols)
    
    #inserting all dataframes into sql databases.
    conn = sqlite3.connect("CollegeFootball.db")
    games_df.to_sql("games",conn,if_exists="append",index=False)
    betting_df.to_sql("betting_lines",conn,if_exists="append",index=False)
    stats_df.to_sql("stats",conn,if_exists="append",index=False)
    
    conn.close()

Note that we don't need more API calls for College Football as all relevants teams on our simulation and their games are included. Modifications can be made to consider FBS teams or other years.<br>
Now that we have all the information we need for our predictive model in the SQL database, we need to preprocess it to be able to create models with it. <br>
The main idea is to have our predictors be a combination of games and teams data, therefore we need to create a dataframe which correctly retrieves data considering game results, home and away team data.

Firstly, the code below helps the SQL query to obtain the proper columns and sets up their proper labels. As we are obtaining 2 rows of data of the stats table, we need to properly rename the columns to differentiate Home and Away statistics.

In [None]:
gcols = games_df.columns
gstr = ""
for c in gcols:
    gstr += "G."+str(c)+","
gstr

In [None]:
bcols = betting_df.columns
bstr = ""
for b in bcols:
    bstr += "B."+str(b)+","
bstr = bstr[5:]
bstr

In [None]:
s1 = ""
for c in cols:
   s1 += "S1." + str(c) +  " AS Home_" + str(c) + ", "
s1 = s1[:-1]
s1

In [None]:
s2 = ""
for c in cols:
   s2 += "S2." + str(c) +  " AS Away_" + str(c) + ", "
s2 = s2[:-2]
s2

In [None]:
cmd=\
f"""
SELECT {str(gstr)} {str(bstr)} {str(s1)} {str(s2)}
FROM games G
INNER JOIN betting_lines B ON G.id=B.id
INNER JOIN stats S1 ON S1.team=G.home_team
INNER JOIN stats S2 ON S2.team=G.away_team
WHERE (S2.season=G.season AND S1.season=G.season)
"""

conn=sqlite3.connect("CollegeFootball.db")
df_merged=pd.read_sql_query(cmd,conn)
conn.close() 

After the merge we have the following merged dataframe.

In [None]:
df_merged.head()

We proceed to drop some columns to create our predictors dataframe and our target dataframe using betting lines as target.

In [None]:
#predictors df
parameters_df=df_merged.drop(['id','season', 'home_id', 'home_team',
       'home_conference', 'home_points', 'away_id', 'away_team',
       'away_conference', 'away_points', 'game_spread', 'game_totalpts',
       'av_spread', 'av_total'], axis=1)
parameters_df

In [None]:
#target df
predict_betting_df=df_merged[['av_spread','av_total','id']]

(DISTANCES API)

## Predictive Model using Tensorflow

We arrays to use on tensorflow and create our first predictive model, which uses betting lines as predictors. Using a train_test split of 70% train and 30% test data.

In [None]:
# import numpy as np
# from sklearn.model_selection import train_test_split

df = pd.DataFrame()
X = np.array(parameters_df,dtype=np.float32)
y_betting = np.array(predict_betting_df)    #predicting betting info

X_train, X_test, y_train, y_test = train_test_split(X,y_betting,test_size=0.3)

We now create a simple 2 layer neural network using tensorflow. We use the Sequential model, which simply allows us to create a model layer by layer. We compile the model using 'adam' optimizer, an efficient variation of Gradient Descent, and using 'mse' Mean Squared Error as the loss function to minimize. 

In [None]:
model = tf.keras.models.Sequential([
    layers.Dense(100,input_shape=(X_train.shape[1],),activation='relu'),
    layers.Dense(100,activation='relu'),
    layers.Dense(2)
])

model.compile(optimizer='adam',
              loss='mse',
              metrics=['mae','mse'])

model.summary()

We proceed to train our model for 100 epochs, which should be enough to fit the model. After that we can display the performance of the model by plotting the error function progress and

In [None]:
history = model.fit(X_train,y_train[:,:2],epochs=100,verbose=1)

In [None]:
# from matplotlib import pyplot as plt
plt.plot(history.history["mse"][10:])
plt.gca().set(xlabel="epoch",ylabel="mse")
plt.show()

In [None]:
#evaluating on test data
model.evaluate(X_test,y_test[:,:2],verbose=2)

In [None]:
predictions = model.predict(X)

In [None]:
#Spread distribution
plt.hist(predictions[:,0])

In [None]:
#Total points distribution
plt.hist(predictions[:,1])

In [None]:
#creating boxplot
diff = predictions-y_betting[:,:2]
diffmean = diff.mean(axis=0)
bestdiffstd = diff.std(axis=0)   #renamed variable so it's unique to this model
fig, ax = plt.subplots()
bp = ax.boxplot(diff,showmeans=True)
for i, line in enumerate(bp['medians']):
    x, y = line.get_xydata()[1]
    text = ' μ={:.2f}\n σ={:.2f}'.format(diffmean[i], bestdiffstd[i])
    ax.annotate(text, xy=(x, y))
plt.show()

(Working with Home and Away Pts)

(Working with Actual Spread and Pts)

Explaining Findings and decision on working only on Betting Lines. Run example of a single game prediction. 

## Pairing System using Minimum Weight Matching Algorithm

Using teams locations extracted before, we run the Pairing Algorithm using Minimum Weight Matching and Networkx. Which does the following: Selects pairs of vertices where the sum of those edges is minimized. First we retrieve data from the database, then we use NetworkX package to create the Graph and do the pairings, which afterwards we Simulate games using our Neural Network and 

In [None]:
conn = sqlite3.connect("CollegeFootball.db")
distances = pd.read_sql_query("SELECT * FROM distances",conn)
conn.close()
distances

In [None]:
# import networkx as nx

m_dist = np.round(np.array(distances),decimals=3)
L = []
for k in range(126):
    for j in range(126):
        if k>j: L.append((k,j,m_dist[k,j]))
        
CollegeGraph = nx.Graph()
CollegeGraph.add_weighted_edges_from(L)
curr_data = np.zeros(shape=(126,3),dtype=int)
curr_data[:,0] = np.arange(126)

for i in range(12):
    DataFunctions.Simulate(g=CollegeGraph,
                           i=i,c=curr_data,
                           y=2022,st_dev=bestdiffstd)

Note: The code below allows us to remove the previously simulated games from our database. 

In [None]:
# conn = sqlite3.connect("CollegeFootball.db")
# cursor = conn.cursor()
# cursor.execute("DROP TABLE simul_games")
# conn.commit()
# conn.close()

## Simulated Results

(Insert Visualizations Tables)

## Graphical Visualizations of Results

(Insert Plotly visualizations)

## Further Improvements