##PROBLEM STATEMENT:
The goal of this analysis is to develop a linear regression model that can predict the number of wins a team will achieve in the Premier League based on various factors. By identifying the most important variables that influence a team's success, we can gain valuable insights into what it takes to become a champion in one of the world's most competitive football leagues.


##Overview of the datasets :
We've successfully loaded the dataset for the 2010-2011 Premier League season. As we can see, it contains a wealth of information about each match. Let's break down the most important columns for our project:

HomeTeam and AwayTeam: The names of the two teams that played.

FTHG and FTAG: The Full Time Home Goals and Full Time Away Goals. These columns tell us the final score of the match.

FTR: The Full Time Result. This is our target variable. This is what we want to predict. It has three possible values:

'H' for a home win

'A' for an away win

'D' for a draw

Match Statistics: We also have a lot of statistics for each match, like:

HS and AS: Home and Away team shots

HST and AST: Home and Away team shots on target

HC and AC: Home and Away team corners

HF and AF: Home and Away team fouls committed

HY, AY, HR, AR: Yellow and Red cards for home and away teams.

Betting Odds: There are also many columns with betting odds from different bookmakers (B365, BW, IW, etc.). These can be very powerful features because they represent the market's prediction of the match outcome.

In [6]:
!pip install pandas numpy scikit-learn matplotlib seaborn requests



##Data Cleaning:
We need to make sure our data is clean and consistent. For this dataset, we'll focus on selecting only the columns that are useful for our model. We have 71 columns, and not all of them will be useful. For example, the referee's name is unlikely to have a significant impact on the outcome of a match.


In [8]:
import pandas as pd

# Load the dataset again
df = pd.read_csv('/content/premier-league-predictor/data/E2010.csv')

# --- Data Cleaning part ---
# 1. Select relevant columns
# We are selecting the date, teams, final score, full-time result, and key match stats.
# We are also keeping the B365 betting odds as they are a good indicator of match outcome probability.
relevant_columns = [
    'Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR',
    'HS', 'AS', 'HST', 'AST', 'HC', 'AC',
    'B365H', 'B365D', 'B365A'
]
df_cleaned = df[relevant_columns]

# 2. Convert 'Date' column to datetime objects
# The 'Date' is currently a string (e.g., '14/08/10'). We need to convert it to a proper date format
# so we can sort the matches chronologically. The format '%d/%m/%y' tells pandas
# to expect the day, then the month, then the 2-digit year.
df_cleaned['Date'] = pd.to_datetime(df_cleaned['Date'], format='%d/%m/%y')

# Sort the DataFrame by date to ensure matches are in chronological order.
df_cleaned = df_cleaned.sort_values(by='Date').reset_index(drop=True)


# --- Display Results ---
print("Cleaned DataFrame Head:")
print(df_cleaned.head())

print("\nCleaned DataFrame Info:")
df_cleaned.info()

Cleaned DataFrame Head:
        Date     HomeTeam    AwayTeam  FTHG  FTAG FTR  HS  AS  HST  AST  HC  \
0 2010-08-14  Aston Villa    West Ham     3     0   H  23  12   11    2  16   
1 2010-08-14    Blackburn     Everton     1     0   H   7  17    2   12   1   
2 2010-08-14       Bolton      Fulham     0     0   D  13  12    9    7   4   
3 2010-08-14      Chelsea   West Brom     6     0   H  18  10   13    4   3   
4 2010-08-14   Sunderland  Birmingham     2     2   D   6  13    2    7   3   

   AC  B365H  B365D  B365A  
0   7   2.00   3.30    4.0  
1   3   2.88   3.25    2.5  
2   8   2.20   3.30    3.4  
3   1   1.17   7.00   17.0  
4   6   2.10   3.30    3.6  

Cleaned DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 380 entries, 0 to 379
Data columns (total 15 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Date      380 non-null    datetime64[ns]
 1   HomeTeam  380 non-null    object        
 2   AwayTeam

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['Date'] = pd.to_datetime(df_cleaned['Date'], format='%d/%m/%y')


##Feature Engineering Part

A team's performance in an upcoming match is maomly influenced by its recent form. A team that has won its last five games is more likely to win than a team that has lost its last five. Our current dataset doesn't have this "form" information directly. It only tells us about individual matches.

So, here lets create it. We are going to calculate a team's rolling average statistics. For each match, we'll look at the previous 5 matches that team played and calculate the average of their performance metrics.

Let's define the new features we want to create for both the home and away teams for each match:

**Rolling Shots on Target**: Average shots on target over the last 5 games.

**Rolling Goals Scored**: Average goals scored over the last 5 games.

**Rolling Goals Conceded**: Average goals conceded over the last 5 games.

In [10]:
import pandas as pd

# Load and clean the data as before
relevant_columns = [
    'Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR',
    'HS', 'AS', 'HST', 'AST', 'HC', 'AC',
    'B365H', 'B365D', 'B365A'
]
df_cleaned = df[relevant_columns]
df_cleaned['Date'] = pd.to_datetime(df_cleaned['Date'], format='%d/%m/%y')
df_cleaned = df_cleaned.sort_values(by='Date').reset_index(drop=True)

# A dictionary to store the last 5 match stats for each team
team_stats = {}

def get_last_5_matches_stats(team, date):
    """
    Returns the average stats for a team from their last 5 matches before a given date.
    """
    team_df = team_stats.get(team)
    if team_df is None or team_df.empty:
        return 0, 0, 0 #  zeros if no historical data

    # Filter for matches before the current game's date
    past_matches = team_df[team_df['Date'] < date].tail(5)
    if past_matches.empty:
        return 0, 0, 0

    # Calculate rolling averages
    avg_goals_scored = past_matches['GoalsScored'].mean()
    avg_goals_conceded = past_matches['GoalsConceded'].mean()
    avg_shots_on_target = past_matches['ShotsOnTarget'].mean()

    return avg_goals_scored, avg_goals_conceded, avg_shots_on_target

# Lists to store our new features
home_avg_gs, home_avg_gc, home_avg_st = [], [], []
away_avg_gs, away_avg_gc, away_avg_st = [], [], []

# --- Feature Engineering ---
# This loop is the main part of our feature engineering process.
for index, row in df_cleaned.iterrows():
    home_team = row['HomeTeam']
    away_team = row['AwayTeam']
    game_date = row['Date']

    # --- Home Team Stats ---
    # Get the last 5 matches stats for the home team
    h_gs, h_gc, h_st = get_last_5_matches_stats(home_team, game_date)
    home_avg_gs.append(h_gs)
    home_avg_gc.append(h_gc)
    home_avg_st.append(h_st)

    # --- Away Team Stats ---
    # Get the last 5 matches stats for the away team
    a_gs, a_gc, a_st = get_last_5_matches_stats(away_team, game_date)
    away_avg_gs.append(a_gs)
    away_avg_gc.append(a_gc)
    away_avg_st.append(a_st)

    # --- Update the historical data for both teams ---
    # This is crucial. After we calculate the stats beefore the current match, and thenwe add the current match's stats to our history.

    # Update for Home Team
    if home_team not in team_stats:
        team_stats[home_team] = pd.DataFrame(columns=['Date', 'GoalsScored', 'GoalsConceded', 'ShotsOnTarget'])
    new_home_stats = pd.DataFrame([{'Date': game_date, 'GoalsScored': row['FTHG'], 'GoalsConceded': row['FTAG'], 'ShotsOnTarget': row['HST']}], index=[0])
    team_stats[home_team] = pd.concat([team_stats[home_team], new_home_stats], ignore_index=True)


    # Update for Away Team
    if away_team not in team_stats:
        team_stats[away_team] = pd.DataFrame(columns=['Date', 'GoalsScored', 'GoalsConceded', 'ShotsOnTarget'])
    new_away_stats = pd.DataFrame([{'Date': game_date, 'GoalsScored': row['FTAG'], 'GoalsConceded': row['FTHG'], 'ShotsOnTarget': row['AST']}], index=[0])
    team_stats[away_team] = pd.concat([team_stats[away_team], new_away_stats], ignore_index=True)

# Add the new features to our DataFrame
df_cleaned['Home_Avg_GS'] = home_avg_gs
df_cleaned['Home_Avg_GC'] = home_avg_gc
df_cleaned['Home_Avg_ST'] = home_avg_st
df_cleaned['Away_Avg_GS'] = away_avg_gs
df_cleaned['Away_Avg_GC'] = away_avg_gc
df_cleaned['Away_Avg_ST'] = away_avg_st

# Display the DataFrame with the new features.
# The first few rows will have zeros, as there are no previous matches in the season to average.
# This is expected.
print(df_cleaned.tail())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['Date'] = pd.to_datetime(df_cleaned['Date'], format='%d/%m/%y')
  team_stats[home_team] = pd.concat([team_stats[home_team], new_home_stats], ignore_index=True)
  team_stats[away_team] = pd.concat([team_stats[away_team], new_away_stats], ignore_index=True)
  team_stats[home_team] = pd.concat([team_stats[home_team], new_home_stats], ignore_index=True)
  team_stats[away_team] = pd.concat([team_stats[away_team], new_away_stats], ignore_index=True)
  team_stats[home_team] = pd.concat([team_stats[home_team], new_home_stats], ignore_index=True)
  team_stats[away_team] = pd.concat([team_stats[away_team], new_away_stats], ignore_index=True)
  team_stats[home_team] = pd.concat([team_stats[home_team], new_home_stats], ignore_

          Date     HomeTeam   AwayTeam  FTHG  FTAG FTR  HS  AS  HST  AST  ...  \
375 2011-05-22      Everton    Chelsea     1     0   H  11  19    3    9  ...   
376 2011-05-22       Bolton   Man City     0     2   A  16  16    8   10  ...   
377 2011-05-22  Aston Villa  Liverpool     1     0   H  10   9    8    4  ...   
378 2011-05-22       Fulham    Arsenal     2     2   D  10  12    6    6  ...   
379 2011-05-22       Wolves  Blackburn     2     3   A  12  13    5   10  ...   

     AC  B365H  B365D  B365A  Home_Avg_GS  Home_Avg_GC  Home_Avg_ST  \
375   5   3.10   3.25   2.38          1.0          0.8          5.8   
376   5   5.00   3.75   1.70          1.2          2.2          7.4   
377   5   3.20   3.40   2.25          1.4          1.2          6.2   
378   3   3.75   3.60   1.95          2.2          1.2          6.8   
379   3   2.20   3.00   3.75          1.6          1.4          4.4   

     Away_Avg_GS  Away_Avg_GC  Away_Avg_ST  
375          2.2          1.2          9.

The last 3 columns specifies that , suppose based on **Wolves** vs **Blackburn** in the last match of the season gives our model (who already knows their form based on aech teams last 5 matches)   

**Wolves** were scoring an averatge of 1.6 goals, conveding 1.4 goals and having 4.4 shots on target per game. (Home )
Similarly, **Blackburn** were scoring an average of 0.6 goals, conceding 1.0 goals and having 4.4 shots on target per game ( away)

This gives powerful knowledge about each teams performance just not based on their results.

##Preparation of dataset:


1.  **Dewfining the features(X) and Target (y)**
*   Features could be new rolling averages, betting odds etc....
*   Targets can be the actual outcome of the match.

2.   **Handling the Categorical Data**
*   We will convert these team names into a numerical format using a technique called one-hot encoding like **HomeTeam**, **AwayTeam**

3. Splitting into Training and Testing


In [16]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load and process the data as we've done before
df = pd.read_csv('/content/premier-league-predictor/data/E2010.csv')
relevant_columns = [
    'Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR',
    'HS', 'AS', 'HST', 'AST', 'HC', 'AC',
    'B365H', 'B365D', 'B365A'
]
df_cleaned = df[relevant_columns]
df_cleaned['Date'] = pd.to_datetime(df_cleaned['Date'], format='%d/%m/%y')
df_cleaned = df_cleaned.sort_values(by='Date').reset_index(drop=True)

team_stats = {}
def get_last_5_matches_stats(team, date):
    team_df = team_stats.get(team)
    if team_df is None or team_df.empty: return 0, 0, 0
    past_matches = team_df[team_df['Date'] < date].tail(5)
    if past_matches.empty: return 0, 0, 0
    avg_goals_scored = past_matches['GoalsScored'].mean()
    avg_goals_conceded = past_matches['GoalsConceded'].mean()
    avg_shots_on_target = past_matches['ShotsOnTarget'].mean()
    return avg_goals_scored, avg_goals_conceded, avg_shots_on_target

home_avg_gs, home_avg_gc, home_avg_st = [], [], []
away_avg_gs, away_avg_gc, away_avg_st = [], [], []

for index, row in df_cleaned.iterrows():
    home_team, away_team, game_date = row['HomeTeam'], row['AwayTeam'], row['Date']
    h_gs, h_gc, h_st = get_last_5_matches_stats(home_team, game_date)
    home_avg_gs.append(h_gs); home_avg_gc.append(h_gc); home_avg_st.append(h_st)
    a_gs, a_gc, a_st = get_last_5_matches_stats(away_team, game_date)
    away_avg_gs.append(a_gs); away_avg_gc.append(a_gc); away_avg_st.append(a_st)

    if home_team not in team_stats: team_stats[home_team] = pd.DataFrame(columns=['Date', 'GoalsScored', 'GoalsConceded', 'ShotsOnTarget'])
    new_home_stats = pd.DataFrame([{'Date': game_date, 'GoalsScored': row['FTHG'], 'GoalsConceded': row['FTAG'], 'ShotsOnTarget': row['HST']}], index=[0])
    team_stats[home_team] = pd.concat([team_stats[home_team], new_home_stats], ignore_index=True)

    if away_team not in team_stats: team_stats[away_team] = pd.DataFrame(columns=['Date', 'GoalsScored', 'GoalsConceded', 'ShotsOnTarget'])
    new_away_stats = pd.DataFrame([{'Date': game_date, 'GoalsScored': row['FTAG'], 'GoalsConceded': row['FTHG'], 'ShotsOnTarget': row['AST']}], index=[0])
    team_stats[away_team] = pd.concat([team_stats[away_team], new_away_stats], ignore_index=True)

df_cleaned['Home_Avg_GS'] = home_avg_gs
df_cleaned['Home_Avg_GC'] = home_avg_gc
df_cleaned['Home_Avg_ST'] = home_avg_st
df_cleaned['Away_Avg_GS'] = away_avg_gs
df_cleaned['Away_Avg_GC'] = away_avg_gc
df_cleaned['Away_Avg_ST'] = away_avg_st

# --- Preparing Data for the Model ---

# 1. Define Features (X) and Target (y)
feature_columns = [
    'HomeTeam', 'AwayTeam', # Categorical features to be encoded
    'B365H', 'B365D', 'B365A', # Betting odds
    'Home_Avg_GS', 'Home_Avg_GC', 'Home_Avg_ST', # Home team form
    'Away_Avg_GS', 'Away_Avg_GC', 'Away_Avg_ST'  # Away team form
]
features = df_cleaned[feature_columns]
target = df_cleaned['FTR']

# 2. Handle Categorical Data (One-Hot Encoding)
# This will convert 'HomeTeam' and 'AwayTeam' into numerical columns
final_features = pd.get_dummies(features, columns=['HomeTeam', 'AwayTeam'])

# We drop the first 50 rows because their rolling averages are not based on enough data
final_features = final_features.iloc[50:].reset_index(drop=True)
target = target.iloc[50:].reset_index(drop=True)


# 3. Splitting the Data (80% for training, 20% for testing)
X_train, X_test, y_train, y_test = train_test_split(
    final_features, target, test_size=0.2, random_state=42, stratify=target
)

# --- Display Results ---
print("Shape of our final feature set (X):", final_features.shape)
print("Shape of our training features (X_train):", X_train.shape)
print("Shape of our testing features (X_test):", X_test.shape)
print("\nAn example of our final features (X) after One-Hot Encoding:")
print(final_features.head())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['Date'] = pd.to_datetime(df_cleaned['Date'], format='%d/%m/%y')
  team_stats[home_team] = pd.concat([team_stats[home_team], new_home_stats], ignore_index=True)
  team_stats[away_team] = pd.concat([team_stats[away_team], new_away_stats], ignore_index=True)
  team_stats[home_team] = pd.concat([team_stats[home_team], new_home_stats], ignore_index=True)
  team_stats[away_team] = pd.concat([team_stats[away_team], new_away_stats], ignore_index=True)
  team_stats[home_team] = pd.concat([team_stats[home_team], new_home_stats], ignore_index=True)
  team_stats[away_team] = pd.concat([team_stats[away_team], new_away_stats], ignore_index=True)
  team_stats[home_team] = pd.concat([team_stats[home_team], new_home_stats], ignore_

Shape of our final feature set (X): (330, 49)
Shape of our training features (X_train): (264, 49)
Shape of our testing features (X_test): (66, 49)

An example of our final features (X) after One-Hot Encoding:
   B365H  B365D  B365A  Home_Avg_GS  Home_Avg_GC  Home_Avg_ST  Away_Avg_GS  \
0   3.50   3.40   2.10          0.6          2.6          5.4          1.2   
1   1.44   4.33   7.50          0.8          1.4          5.4          1.0   
2   2.75   3.20   2.63          1.4          1.2          6.4          0.8   
3   3.60   3.30   2.10          1.2          0.4          7.2          4.2   
4   1.83   3.50   4.50          1.4          1.6          5.0          0.4   

   Away_Avg_GC  Away_Avg_ST  HomeTeam_Arsenal  ...  AwayTeam_Man City  \
0          0.8         10.2             False  ...              False   
1          1.0          3.4             False  ...              False   
2          1.4          8.8             False  ...              False   
3          0.2         10.2   

##Building the model:

Logistic Regression:

Random Forest:

Gradient Boosting:

In [17]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score



# --- Model Training and Evaluation ---

# 1. Initialize the models
log_reg = LogisticRegression(max_iter=1000)
rand_forest = RandomForestClassifier(random_state=42)
grad_boost = GradientBoostingClassifier(random_state=42)

# 2. Train the models
log_reg.fit(X_train, y_train)
rand_forest.fit(X_train, y_train)
grad_boost.fit(X_train, y_train)

# 3. Make predictions on the test set
lr_preds = log_reg.predict(X_test)
rf_preds = rand_forest.predict(X_test)
gb_preds = grad_boost.predict(X_test)

# 4. Evaluate the models
lr_accuracy = accuracy_score(y_test, lr_preds)
rf_accuracy = accuracy_score(y_test, rf_preds)
gb_accuracy = accuracy_score(y_test, gb_preds)

print("--- Model Performance ---")
print(f"Logistic Regression Accuracy: {lr_accuracy:.2%}")
print(f"Random Forest Accuracy: {rf_accuracy:.2%}")
print(f"Gradient Boosting Accuracy: {gb_accuracy:.2%}")

--- Model Performance ---
Logistic Regression Accuracy: 46.97%
Random Forest Accuracy: 54.55%
Gradient Boosting Accuracy: 48.48%


##Conclusion

Random Forest did well...56 is a great percentage comparing football standing and things..... :)