# ‚öΩ Premier League Match Predictor (AI & ELO System)

Este notebook tem como objetivo criar um modelo de Machine Learning para prever resultados da Premier League.

1.  **Pandas**: Para manipula√ß√£o de dados.
2.  **Scikit-Learn**: Para os algoritmos de ML.
3.  **ELO System**: Um algoritmo din√¢mico para calcular a for√ßa relativa das equipas.

Imports e Configura√ß√£o

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Setup visual style
sns.set_style("whitegrid")
%matplotlib inline

## 1. Data Acquisition (Recolha de Dados)
Vamos buscar dados reais do `football-data.co.uk`. Vamos carregar v√°rias temporadas consecutivas para que o modelo tenha hist√≥rico suficiente para aprender padr√µes.

* **FTHG**: Full Time Home Goals
* **FTAG**: Full Time Away Goals
* **FTR**: Full Time Result (H=Home, D=Draw, A=Away)

In [None]:
def load_premier_league_data(start_year, end_year):
    base_url = "https://www.football-data.co.uk/mmz4281/{}/{}.csv"
    dfs = []
    
    print(f"Loading data from {start_year} to {end_year}...")
    
    for year in range(start_year, end_year + 1):
        # Format season string (e.g., 2019 -> "1920")
        season_str = f"{str(year)[-2:]}{str(year+1)[-2:]}"
        url = base_url.format(season_str, "E0") 
        
        try:
            df = pd.read_csv(url)
            df['Season_Start_Year'] = year 
            
            # Select essential columns
            cols = ['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR'] 
            available_cols = [c for c in cols if c in df.columns]
            df = df[available_cols]
            
            # Standardize Date
            df['Date'] = pd.to_datetime(df['Date'], dayfirst=True, errors='coerce')
            dfs.append(df)
        except Exception as e:
            print(f"Error loading {year}: {e}")
            
    full_df = pd.concat(dfs, ignore_index=True)
    full_df = full_df.dropna(subset=['Date', 'FTR'])
    return full_df.sort_values('Date').reset_index(drop=True)

# Load data from 2018 to 2024 (Last 6 seasons)
df = load_premier_league_data(2018, 2024)

# Show the first 5 rows to visualize data structure
display(df.head())
print(f"Total matches loaded: {len(df)}")

## 2. Feature Engineering: O Sistema ELO

O modelo n√£o sabe que o "Man City" √© forte e o "Sheffield" √© fraco. Precisamos de transformar nomes em n√∫meros.
Vamos implementar o **ELO Rating**:
* Todas as equipas come√ßam com **1500**.
* Ganhar a uma equipa forte d√° muitos pontos.
* Ganhar a uma equipa fraca d√° poucos pontos.

Isto cria uma m√©trica din√¢mica de "For√ßa Atual".

In [None]:
def update_elo(rating_home, rating_away, actual_result, k_factor=20):
    # Calculate Expected Score
    expected_home = 1 / (1 + 10 ** ((rating_away - rating_home) / 400))
    
    # Update Ratings
    new_rating_home = rating_home + k_factor * (actual_result - expected_home)
    new_rating_away = rating_away + k_factor * ((1 - actual_result) - (1 - expected_home))
    return new_rating_home, new_rating_away

# Dictionary to track current ratings
current_elo = {}
def get_elo(team):
    return current_elo.get(team, 1500)

# Create columns for the ELO *before* the match starts
df['HomeElo'] = 0.0
df['AwayElo'] = 0.0

# Loop through data chronologically
for index, row in df.iterrows():
    h_team = row['HomeTeam']
    a_team = row['AwayTeam']
    result = row['FTR']
    
    h_elo = get_elo(h_team)
    a_elo = get_elo(a_team)
    
    df.at[index, 'HomeElo'] = h_elo
    df.at[index, 'AwayElo'] = a_elo
    
    # Convert result to number (1=Win, 0.5=Draw, 0=Loss)
    if result == 'H': match_val = 1
    elif result == 'D': match_val = 0.5
    else: match_val = 0
        
    new_h, new_a = update_elo(h_elo, a_elo, match_val)
    current_elo[h_team] = new_h
    current_elo[a_team] = new_a

# Create the Difference Feature (Crucial for the model)
df['EloDiff'] = df['HomeElo'] - df['AwayElo']

# Check the data again
df.tail()

### Visualiza√ß√£o do ELO
Vamos ver visualmente a evolu√ß√£o de duas equipas ao longo dos anos. Isto ajuda a perceber se a nossa matem√°tica est√° a funcionar (ex: O City deve subir, equipas que descem de divis√£o devem cair).

In [None]:
# Let's plot the ELO history of specific teams
teams_to_plot = ['Man City', 'Arsenal', 'Sheffield United']

plt.figure(figsize=(12, 6))

for team in teams_to_plot:
    # Get all matches where the team played home or away
    team_matches = df[(df['HomeTeam'] == team) | (df['AwayTeam'] == team)].copy()
    
    # Extract the ELO they had after the match (approximate for visualization)
    # If they were home, use the updated HomeElo logic, etc.
    # For simplicity in plotting, we will just use the ELO recorded *before* their matches
    elo_values = []
    dates = []
    
    for idx, row in team_matches.iterrows():
        dates.append(row['Date'])
        if row['HomeTeam'] == team:
            elo_values.append(row['HomeElo'])
        else:
            elo_values.append(row['AwayElo'])
            
    plt.plot(dates, elo_values, label=team)

plt.title("Evolu√ß√£o do ELO Rating (2019-2024)")
plt.ylabel("ELO Rating")
plt.legend()
plt.show()

## 3. Prepara√ß√£o e Treino do Modelo
Aqui aplicamos o conceito que aprendeste nas aulas te√≥ricas: dividir em **Treino** e **Teste**.
Mas aten√ß√£o: Como √© uma s√©rie temporal, **n√£o podemos baralhar (shuffle)**. N√£o podemos usar um jogo de 2024 para treinar e prever um de 2023.

Vamos usar **Random Forest**. Imagina-o como um conjunto de centenas de √°rvores de decis√£o que "votam" no resultado. √â mais robusto que a Regress√£o Linear para dados complexos.

In [None]:
# 1. Define Features (X) and Target (y)
features = ['HomeElo', 'AwayElo', 'EloDiff']
target_map = {'A': 0, 'D': 1, 'H': 2} # Mapping classes to numbers
df['Target'] = df['FTR'].map(target_map)

# 2. Time-Series Split (Train on past, Test on future)
split_index = int(len(df) * 0.85) # Last 15% of games are for testing

train_df = df.iloc[:split_index]
test_df = df.iloc[split_index:]

X_train = train_df[features]
y_train = train_df['Target']
X_test = test_df[features]
y_test = test_df['Target']

# 3. Initialize and Train Model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# 4. Evaluate
preds = rf_model.predict(X_test)
acc = accuracy_score(y_test, preds)

print(f"Model Accuracy: {acc:.2%}")

### Matriz de Confus√£o
Vamos ver visualmente onde o modelo erra.
* Eixo Y: O que realmente aconteceu.
* Eixo X: O que o modelo previu.

In [None]:
cm = confusion_matrix(y_test, preds)
plt.figure(figsize=(6,5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Away', 'Draw', 'Home'], 
            yticklabels=['Away', 'Draw', 'Home'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

## 4. Aplica√ß√£o na "Vida Real"
Aqui est√° a fun√ß√£o final. Ela usa o dicion√°rio `current_elo` (que cont√©m os valores mais recentes ap√≥s o √∫ltimo jogo do dataset) para fazer previs√µes sobre jogos futuros.

In [None]:
def predict_upcoming_match(home_team, away_team):
    # Check if teams exist in our database
    if home_team not in current_elo or away_team not in current_elo:
        print("Error: One of the teams is not in the database.")
        return

    # Get latest ELOs
    h_elo = current_elo[home_team]
    a_elo = current_elo[away_team]
    elo_diff = h_elo - a_elo
    
    # Prepare data for model
    input_data = pd.DataFrame([[h_elo, a_elo, elo_diff]], columns=features)
    
    # Get probabilities
    probs = rf_model.predict_proba(input_data)[0]
    
    print(f"--- {home_team} vs {away_team} ---")
    print(f"Stats: {home_team} ELO: {h_elo:.0f} | {away_team} ELO: {a_elo:.0f}")
    print(f"Win Probability:")
    print(f"  üè† Home ({home_team}): {probs[2]*100:.1f}%")
    print(f"  ü§ù Draw:             {probs[1]*100:.1f}%")
    print(f"  ‚úàÔ∏è Away ({away_team}): {probs[0]*100:.1f}%")

# Test prediction
predict_upcoming_match("Liverpool", "Man City")
predict_upcoming_match("Fulham", "Chelsea")