# EPL Match Data Analysis (2000-2025)

**Dataset:** English Premier League Match Data 2000–2025  
**Objective:** Analyze trends in match results and explore factors influencing match outcomes.

## Research Questions
1. What are the trends in home vs. away wins over time?
2. How have goals per match evolved by season?
3. Can we build a simple model to predict match results based on historical data?

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
sns.set(style="whitegrid")

In [None]:
# Load data
df = pd.read_csv("dataset.csv")
df.head()

## Data Cleaning & Preparation

In [None]:
def clean_data(dataframe):
    """
    Cleans EPL dataset:
    - Converts date column
    - Fills missing numerical values
    - Strips whitespace from object columns
    """
    df = dataframe.copy()
    
    if 'Date' in df.columns:
        df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
    
    num_cols = df.select_dtypes(include=[np.number]).columns
    df[num_cols] = df[num_cols].fillna(0)
    
    obj_cols = df.select_dtypes(include=['object']).columns
    for col in obj_cols:
        df[col] = df[col].str.strip()
        
    return df

df = clean_data(df)
df.info()

## Summary Statistics

In [None]:
df.describe()

## Goals per Match Over Seasons

In [None]:
if 'Season' in df.columns:
    goals_per_season = df.groupby('Season')[['FTHG','FTAG']].sum()
    goals_per_season['TotalGoals'] = goals_per_season['FTHG'] + goals_per_season['FTAG']
    
    plt.figure(figsize=(12,6))
    sns.lineplot(data=goals_per_season, x=goals_per_season.index, y='TotalGoals', marker='o')
    plt.title('Total Goals per Season')
    plt.xlabel('Season')
    plt.ylabel('Goals')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

## Home vs Away Wins Distribution

In [None]:
if 'FTR' in df.columns:
    plt.figure(figsize=(6,4))
    sns.countplot(x='FTR', data=df, palette='Set2')
    plt.title('Match Results Distribution')
    plt.xlabel('FTR (H=Home Win, D=Draw, A=Away Win)')
    plt.ylabel('Count')
    plt.show()

## Predicting Match Outcome (Simple Model)

In [None]:
def build_model(dataframe):
    """
    Builds a logistic regression model predicting match result.
    """
    df_model = dataframe.copy()
    
    if 'FTR' not in df_model.columns:
        print("FTR column not found.")
        return
    
    df_model = df_model.dropna(subset=['FTR'])
    X = df_model[['FTHG','FTAG']]  # Features: goals scored
    y = df_model['FTR']
    
    y_encoded = y.map({'H':0,'D':1,'A':2})
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y_encoded, test_size=0.2, random_state=RANDOM_STATE
    )
    
    model = LogisticRegression(multi_class='multinomial', max_iter=500)
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    
    acc = accuracy_score(y_test, preds)
    cm = confusion_matrix(y_test, preds)
    
    print(f"Accuracy: {acc:.2f}")
    print("Confusion Matrix:")
    print(cm)
    
    return model

_ = build_model(df)

## Conclusions
- Home wins remain frequent across seasons.
- Goal totals fluctuate moderately.
- Simple models can predict match outcome with modest accuracy.

**Next steps:**
- Incorporate more advanced features (team strength, odds).
- Explore time-series forecasting.
- Deploy interactive dashboards.