## Implementing Markov chain in predicting English Premier League Table (2023-24)

**Objective**
- Using the markov chain we intend to predict the final league table of the english premier league of the season 2023-24 predicting the probabilities of a team winning the next match based on the outcome of the current match.

In [1]:
import pandas as pd
pd.set_option('display.max_columns', None)

### About dataset

English Premier League matches from 2023/2024 season, will be updated weekly. Data is scraped from https://fbref.com/en/

Unnamed: 0: An index or identifier column.

Date: The date when the match took place.

Time: The kickoff time of the match.

Comp: The competition name, which is the Premier League for the rows displayed.

Round: The matchweek or round of the competition.

Day: The day of the week the match was played.

Venue: Indicates whether the team was playing at home or away.

Result: The outcome of the match from the perspective of the team mentioned at the end (W = Win, D = Draw, L = Loss).

GF (Goals For): The number of goals scored by the team.

GA (Goals Against): The number of goals conceded by the team.

Opponent: The name of the opposing team.

xG: Expected goals for the team.

xGA: Expected goals against the team.

Poss: Possession percentage during the match.

Attendance: The number of spectators present at the venue.

Captain: The name of the team captain.

Formation: The team's formation.

Referee: The name of the match referee.

Match Report: A link or reference to a detailed match report.

Notes: Any additional notes about the match.

Sh (Shots): Total number of shots taken by the team.

SoT (Shots on Target): Number of shots on target.

Dist: Average distance (likely in meters) from which shots were taken.

FK: Number of free kicks taken.

PK (Penalty Kicks): Number of penalty kicks scored.

PKatt (Penalty Kicks Attempted): Number of penalty kicks attempted.

Season: The season year.

Team: The team the data row is about.

In [2]:
df = pd.read_csv('matches.csv')

In [3]:
df.columns

Index(['Unnamed: 0', 'Date', 'Time', 'Comp', 'Round', 'Day', 'Venue', 'Result',
       'GF', 'GA', 'Opponent', 'xG', 'xGA', 'Poss', 'Attendance', 'Captain',
       'Formation', 'Referee', 'Match Report', 'Notes', 'Sh', 'SoT', 'Dist',
       'FK', 'PK', 'PKatt', 'Season', 'Team'],
      dtype='object')

### Removing unnecssary columns

In [4]:
del_cols = ['Unnamed: 0', 'Time', 'Comp', 'Day', 'Match Report', 'Notes', 'Referee', 'Attendance']
df.drop(del_cols, axis=1, inplace=True)

### Renaming the column names

In [5]:
df = df.rename(columns={'Team':'home_team', 'Opponent':'away_team'})

# changing the order of columns in the dataset
df = df[['Date', 'Round', 'Venue', 'home_team', 'away_team',
         'GF', 'GA', 'Result', 'xG', 'xGA', 'Poss', 'Captain','Formation', 
          'Sh', 'SoT', 'Dist','FK', 'PK', 'PKatt', 'Season']]

### Getting the data in ascending order of time

In [6]:
# convert date to datetime
df['Date'] = pd.to_datetime(df['Date'])

In [7]:
# ascending order
df = df.sort_values(by='Date', ascending=True)
df = df.reset_index()
df = df.drop(['index'], axis=1)

### Subsetting datasets into home and away

In [8]:
home = df[df['Venue']=='Home']
away = df[df['Venue']!='Home']

print(f"Home matches: {home.shape[0]} | Away matches: {away.shape[0]}")

Home matches: 380 | Away matches: 380


### Creating new column for previous match outcome

In [9]:
home['Previous_Match_State'] = home.groupby('home_team')['Result'].shift(1)

# Drop any rows where 'Previous Match State' is NaN (i.e., first match of the team)
home = home.dropna(subset=['Previous_Match_State'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  home['Previous_Match_State'] = home.groupby('home_team')['Result'].shift(1)


### Creating a transition matrix for home games

In [10]:
home['Transition'] = home['Previous_Match_State'] + " -> " + home['Result']

In [11]:
# Step 4: Count Transitions
transition_counts = home.groupby(['Previous_Match_State', 'Result']).size().reset_index(name='Count')

In [12]:
# Pivot the DataFrame to create the transition matrix
transition_matrix = transition_counts.pivot_table(index='Previous_Match_State', columns='Result', values='Count', fill_value=0)

# Normalize to get probabilities
transition_matrix = transition_matrix.div(transition_matrix.sum(axis=1), axis=0)

In [13]:
transition_matrix

Result,D,L,W
Previous_Match_State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
D,0.2875,0.2875,0.425
L,0.176991,0.415929,0.40708
W,0.215569,0.275449,0.508982


This transition matrix tells us that if a **teams previous match** is **drawn** the probability of the 
- teams next match being a **draw** is 28.7% 
- teams next match being a **loss** is 28.7%
- teams next match being a **win** is 42.5%

### Creating a transition matrix for away games

In [14]:
away['Previous_Match_State'] = away.groupby('away_team')['Result'].shift(1)

# Drop any rows where 'Previous Match State' is NaN (i.e., first match of the team)
away = away.dropna(subset=['Previous_Match_State'])

away['Transition'] = away['Previous_Match_State'] + " -> " + away['Result']

# Count Transitions
transition_counts = away.groupby(['Previous_Match_State', 'Result']).size().reset_index(name='Count')
# Pivot the DataFrame to create the transition matrix
transition_matrix = transition_counts.pivot_table(index='Previous_Match_State', columns='Result', values='Count', fill_value=0)

# Normalize to get probabilities
transition_matrix = transition_matrix.div(transition_matrix.sum(axis=1), axis=0)

transition_matrix

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  away['Previous_Match_State'] = away.groupby('away_team')['Result'].shift(1)


Result,D,L,W
Previous_Match_State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
D,0.2875,0.425,0.2875
L,0.215569,0.508982,0.275449
W,0.176991,0.40708,0.415929


### Creating transition matrix of single team

In [15]:
team_name = 'ManchesterCity'
team_df = df[(df['home_team'] == team_name)]

# Calculate Match States and Transitions

# Sort by Date to maintain chronological order
team_df = team_df.sort_values(by=['Date'])

# Track Previous Match State
team_df['Previous_Match_State'] = team_df['Result'].shift(1)

# Drop NaN values (first match)
team_df = team_df.dropna(subset=['Previous_Match_State'])

# Count Transitions
transition_counts = team_df.groupby(['Previous_Match_State', 'Result']).size().reset_index(name='Count')

# Pivot to create Transition Matrix
transition_matrix = transition_counts.pivot_table(index='Previous_Match_State', columns='Result', values='Count', fill_value=0)

# Normalize to get probabilities
transition_matrix = transition_matrix.div(transition_matrix.sum(axis=1), axis=0)
transition_matrix

Result,D,L,W
Previous_Match_State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
D,0.428571,0.142857,0.428571
L,0.0,0.333333,0.666667
W,0.148148,0.037037,0.814815


This transition matrix tells us that if **citys previous match** is **drawn** the probability of the 
- citys next match being a **draw** is 42.8% 
- citys next match being a **loss** is 14.2%
- citys next match being a **win** is 42.8%

### Simulate Remaining Matches

In [16]:
import numpy as np
import random

# Ensure the random seed is set before any random operations
random.seed(42)

def simulate_match(current_state, transition_matrix):
    if current_state in transition_matrix.index:
        next_state = np.random.choice(transition_matrix.columns, p=transition_matrix.loc[current_state])
        return next_state
    else:
        return np.random.choice(transition_matrix.columns)

# Example simulation for Manchester City
current_state = 'W'  # This can be the result of the last known match or a random start
remaining_matches = 18  # Example: Number of matches left in the season

# Function to update points based on match outcome
def update_points(current_state, points):
    if current_state == 'W':
        points += 3
    elif current_state == 'D':
        points += 1
    return points

### Home and away transition matrix

In [17]:
random.seed(42)
def get_home_transition_matrix(team, home):
    # Filter the dataset for a specific team
    team_name = team
    df = home[home['home_team'] == team_name]

    # Sort by Date to maintain chronological order
    df = df.sort_values(by=['Date'])

    df['Previous_Match_State'] = df.groupby('home_team')['Result'].shift(1)

    # Drop any rows where 'Previous Match State' is NaN (i.e., first match of the team)
    df = df.dropna(subset=['Previous_Match_State'])
    df['Transition'] = df['Previous_Match_State'] + " -> " + df['Result']

    # Step 4: Count Transitions
    transition_counts = df.groupby(['Previous_Match_State', 'Result']).size().reset_index(name='Count')

    # Pivot the DataFrame to create the transition matrix
    transition_matrix = transition_counts.pivot_table(index='Previous_Match_State', columns='Result', values='Count', fill_value=0)

    # Normalize to get probabilities
    transition_matrix = transition_matrix.div(transition_matrix.sum(axis=1), axis=0)
    
#     print(f"\nHome Transition Matrix for {team_name}:\n{transition_matrix}")    
    return transition_matrix
    
def get_away_transition_matrix(team, away):
    # Filter the dataset for a specific team
    team_name = team
    df = away[away['home_team'] == team_name]

    # Sort by Date to maintain chronological order
    df = df.sort_values(by=['Date'])

    df['Previous_Match_State'] = df.groupby('home_team')['Result'].shift(1)

    # Drop any rows where 'Previous Match State' is NaN (i.e., first match of the team)
    df = df.dropna(subset=['Previous_Match_State'])
    df['Transition'] = df['Previous_Match_State'] + " -> " + df['Result']

    # Step 4: Count Transitions
    transition_counts = df.groupby(['Previous_Match_State', 'Result']).size().reset_index(name='Count')

    # Pivot the DataFrame to create the transition matrix
    transition_matrix = transition_counts.pivot_table(index='Previous_Match_State', columns='Result', values='Count', fill_value=0)

    # Normalize to get probabilities
    transition_matrix = transition_matrix.div(transition_matrix.sum(axis=1), axis=0)
    
#     print(f"\nAway Transition Matrix for {team_name}:\n{transition_matrix}")    
    return transition_matrix

### Aggregate Results and Create Final League Table

In [18]:
random.seed(42)
def get_results(team):
    home_transition_matrix = get_home_transition_matrix(team, home)  # Assume this is calculated as above
    away_transition_matrix = get_away_transition_matrix(team, away)  # Assume this is calculated similarly for away games

    # Number of remaining home and away matches
    remaining_home_matches = 18
    remaining_away_matches = 19

    points = 0
    current_state = 'W'  # or the last known state

    # Simulate remaining home matches
    for _ in range(remaining_home_matches):
        current_state = simulate_match(current_state, home_transition_matrix)
        points = update_points(current_state, points)

    # Simulate remaining away matches
    current_state = 'W'  # or the last known state
    for _ in range(remaining_away_matches):
        current_state = simulate_match(current_state, away_transition_matrix)
        points = update_points(current_state, points)

    # Store the final points for this team
#     print(f"Final Points for {team}: {points}")
    return team, points

In [19]:
teams = df.home_team.unique().tolist()

# Dictionary to store final points for each team
final_points = {team: 0 for team in teams}

for team in teams:
    random.seed(42)
    team, points = get_results(team)
    
    # Store final points
    final_points[team] = points
    
# Convert the final points dictionary to a DataFrame and rank the teams
final_table = pd.DataFrame(list(final_points.items()), columns=['Team', 'MCPoints'])
final_table = final_table.sort_values(by='MCPoints', ascending=False).reset_index(drop=True)
final_table

Unnamed: 0,Team,MCPoints
0,Arsenal,95
1,ManchesterCity,88
2,Liverpool,79
3,Chelsea,67
4,TottenhamHotspur,63
5,AstonVilla,59
6,NewcastleUnited,54
7,Fulham,51
8,Everton,51
9,WolverhamptonWanderers,50


### Get actual points of the season 23-24

In [20]:
points = df.copy()

points['points'] = np.where(points['Result']=='W', 3, np.where(points['Result']=='D', 1, 0))

# Assuming the points series you have is named `points`
points_df = points.groupby('home_team')['points'].sum().reset_index()

# Rename the columns for clarity
points_df.columns = ['Team', 'EPLPoints']

# Merge the actual points with the predicted points
final_table = pd.merge(final_table, points_df, on='Team')

# Sort the final table based on predicted points or actual points
final_table = final_table.sort_values(by='EPLPoints', ascending=False).reset_index(drop=True)

final_table = final_table[['Team', 'EPLPoints', 'MCPoints']]

In [21]:
final_table

Unnamed: 0,Team,EPLPoints,MCPoints
0,ManchesterCity,91,88
1,Arsenal,89,95
2,Liverpool,82,79
3,AstonVilla,68,59
4,TottenhamHotspur,66,63
5,Chelsea,63,67
6,ManchesterUnited,60,44
7,NewcastleUnited,60,54
8,WestHamUnited,52,39
9,CrystalPalace,49,43


### Model R square

In [23]:
from sklearn.metrics import r2_score
r2_score(final_table['EPLPoints'], final_table['MCPoints'])

0.8613290797293472

The R-squared score for the model is 0.86, indicating a strong correlation between the predicted points and the actual points. This score suggests that **86% of the variability** in the actual points can be explained by the model's predictions.