# <font color= #99C8F5> **SARIMA Model** </font>

_by Isabel Valladolid, Oscar Rocha & Vivienne Toledo_

22/02/2026.

---

# <font color= #99C8F5> **Introduction** </font>

This notebook will cover a SARIMA model developed for predicting total goals scored by a specific team on the Eredivisie Division (Netherlands). Data was obtained from the Football Data API (https://www.football-data.org/), covering from mid 2023 all the way through the end of 2025.

---

# <font color= #99C8F5> **Libraries & Data** </font>


In [2]:
# General
import numpy as np
import pandas as pd
import requests
from dotenv import load_dotenv
import os

# Visualizations
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from plotly.subplots import make_subplots

In [3]:
# Get the API KEY
load_dotenv()
API_KEY = os.getenv('API_KEY')

In [4]:
# # Log into the API 
# BASE_URL = "https://api.football-data.org/v4"
# headers = {"X-Auth-Token": API_KEY}

# # Get data from Eredivisie 
# competition_code = "DED"
# params = {
#     "dateFrom": "2025-07-01",           # Jan 1st 
#     "dateTo": "2025-12-31",             # Dec 31st
#     "status": "FINISHED"                # Finished matches
# }

# # Get all matches
# response = requests.get(
#     f"{BASE_URL}/competitions/{competition_code}/matches",
#     headers=headers,
#     params=params
# )
# data = response.json()
# all_matches = data.get("matches", [])

# # Convert to dataframe
# df = pd.DataFrame([{
#     "date": m["utcDate"],
#     "homeTeam": m["homeTeam"]["name"],
#     "awayTeam": m["awayTeam"]["name"],
#     "homeGoals": m["score"]["fullTime"]["home"],
#     "awayGoals": m["score"]["fullTime"]["away"],
#     "winner": m["score"]["winner"]
# } for m in all_matches])

# # Convert date to datetime and sort
# df["date"] = pd.to_datetime(df["date"])
# df.sort_values("date", inplace=True)

# # Save dataframe locally
# df.to_csv("eredivisie_matches_2025.csv", index=False)

# df.head()

In [5]:
# Access data with no API_KEY
df_2023 = pd.read_csv('eredivisie_matches_2023.csv')
df_2024 = pd.read_csv('eredivisie_matches_2024.csv')
df_2025 = pd.read_csv('eredivisie_matches_2025.csv')

# Merge dataframes
df = pd.concat([df_2023, df_2024, df_2025])

In [6]:
df.head()

Unnamed: 0,date,homeTeam,awayTeam,homeGoals,awayGoals,winner
0,2023-08-11 18:00:00+00:00,FC Volendam,SBV Vitesse,1,2,AWAY_TEAM
1,2023-08-12 14:30:00+00:00,PSV,FC Utrecht,2,0,HOME_TEAM
2,2023-08-12 16:45:00+00:00,SC Heerenveen,RKC Waalwijk,3,1,HOME_TEAM
3,2023-08-12 18:00:00+00:00,AFC Ajax,Heracles Almelo,4,1,HOME_TEAM
4,2023-08-12 19:00:00+00:00,PEC Zwolle,Sparta Rotterdam,1,2,AWAY_TEAM


# <font color= #99C8F5> **Filter Dataset** </font>

The **_AFC Ajax_** team was selected to use for predictions. Thus, filtering the dataset is in order:

In [7]:
team = 'AFC Ajax'

# Convert date column to datetime
df['date'] = pd.to_datetime(df['date'], utc=True)

# Filter matches by team
ajax_matches = df[(df['homeTeam'] == team) | (df['awayTeam'] == team)].copy()

# Add goals scored by team
ajax_matches['goals_scored'] = ajax_matches.apply(
    lambda row: row['homeGoals'] if row['homeTeam'] == team else row['awayGoals'],
    axis=1
)

# Extract the date
ajax_matches['match_date'] = ajax_matches['date'].dt.date
# Sum goals per day
goals_per_day = ajax_matches.groupby('match_date')['goals_scored'].sum().reset_index()
# Add team column
goals_per_day['team'] = team

# Rename columns for clarity
goals_per_day.rename(columns={'match_date': 'date', 'goals_scored': 'total_goals'}, inplace=True)
goals_per_day.head()

Unnamed: 0,date,total_goals,team
0,2023-08-12,4,AFC Ajax
1,2023-08-19,2,AFC Ajax
2,2023-09-03,0,AFC Ajax
3,2023-09-17,1,AFC Ajax
4,2023-09-27,0,AFC Ajax


# <font color= #99C8F5> **Visualization** </font>

In [12]:
# Time series visualization
fig = go.Figure()
fig.add_trace(go.Scatter(x=goals_per_day.date, y=goals_per_day.total_goals, mode='lines', name='Goals per Day'))

fig.update_layout(
    title=f'Daily Goals by {team}',
    xaxis_title='Date',
    yaxis_title='Total Goals'
)
fig.show()

# <font color= #99C8F5> **Selecting Model Order** </font>

[NOTES] Here comes the differentiation part and model order. The time series has NaN values (careful w that)