## Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA) is a dimensionality reduction and classification technique used in machine learning. It aims to find the linear combination of features that best separate two or more classes. Unlike logistic regression, which focuses on estimating probabilities, LDA maximizes class separability by projecting data onto a lower-dimensional space.

### How It Works:
- **Assumption of Normality**: Assumes that the features follow a Gaussian distribution for each class.
- **Class Separation**: Finds a linear decision boundary by maximizing the ratio of between-class variance to within-class variance.
- **Bayes’ Theorem**: Uses Bayes' theorem to estimate the probability of a data point belonging to a class and assigns it to the class with the highest probability.
- **Dimensionality Reduction**: Reduces the number of features while retaining the most discriminative information.

### Advantages:
✅ **Handles Multi-Class Problems**: Unlike logistic regression, LDA naturally extends to multiple classes.  
✅ **Effective for Linearly Separable Data**: Works well when the class distributions have distinct means.  
✅ **Reduces Overfitting**: By projecting data onto lower dimensions, it can help prevent overfitting in high-dimensional datasets.  
✅ **Computationally Efficient**: Faster to train and evaluate compared to more complex models like Support Vector Machines (SVMs).  

### Disadvantages:
❌ **Assumption of Normality**: Performance may degrade if the feature distribution is highly non-Gaussian.  
❌ **Sensitive to Outliers**: Outliers can affect the mean and covariance estimates, leading to poor classification.  
❌ **Limited to Linear Boundaries**: Similar to logistic regression, it struggles with non-linear relationships unless extended with kernel methods.  
❌ **Requires Balanced Classes**: Works best when class distributions are approximately equal; otherwise, it may be biased toward the majority class.  


In [22]:
#downloading all the necesaary dependecies
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score, precision_score
from sklearn.model_selection import cross_val_score, KFold
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import GridSearchCV

In [23]:
%run ../Data/Data_Formatting.ipynb

In [24]:
%run ../Data/Ultimate_Hyperparameters.ipynb

In [25]:
%run ../Data/Parameters.ipynb

In [26]:
pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [27]:
#loading the training dataset 
train_path = Path("../Data/premierleague_team_data.csv")
matches = pd.read_csv(train_path)

#loading the testing data 
test_path = Path("../Data/premierleague_test_team_data.csv")
test_matches = pd.read_csv(test_path)

In [28]:
#loading the training dataset with rank
train_path = Path("../Data/premierleague_rank_team_data.csv")
new_matches = pd.read_csv(train_path)

#loading the testing data with rank
test_path = Path("../Data/premierleague_rank_test_team_data.csv")
new_test_matches = pd.read_csv(test_path)

In [29]:
process_data(matches, test_matches)

In [30]:
process_data(new_matches, new_test_matches)

### LDA using Baseline Predictors (refer /Data/Data_Formatting.ipynb)

In [31]:
def make_yearly_predictions_lda_base(Train, Test):
    # Convert 'Date' columns to datetime and sort data
    
    static_predictors =  parameters_base(Train,Test)

    # Train an LDA model
    lda_clf = LinearDiscriminantAnalysis(solver="lsqr",shrinkage=best_shrinkage)
    lda_clf.fit(Train[static_predictors], Train["Target"])

    results = []
    for year in range(Test['Date'].dt.year.min(), Test['Date'].dt.year.max() + 1):
        test_year = Test[Test['Date'].dt.year == year]
        if not test_year.empty:
            # Predict on test data
            preds = lda_clf.predict(test_year[static_predictors])

            # Calculate precision and accuracy
            precision = precision_score(test_year["Target"], preds, average="weighted", zero_division=1)
            accuracy = accuracy_score(test_year["Target"], preds)

            # Append results to list
            results.append({
                "Model": "Linear Discriminant Analysis",
                "Year": year,
                "Precision": precision,
                "Accuracy": accuracy
            })

    # Convert results to DataFrame
    results_df = pd.DataFrame(results)
    
    return results_df


### LDA using Baseline Predictors + Rolling Predictors (refer /Data/Data_Formatting.ipynb)

In [32]:
def make_yearly_predictions_lda_roll(Train, Test):
    # Convert 'Date' columns to datetime and sort data
     
    all_predictors = parameters_roll(Train,Test)
    Train = roll(Train)
    Test  = roll(Test)

    # Train an LDA model
    lda_clf = LinearDiscriminantAnalysis(solver="lsqr" ,shrinkage=best_shrinkage ) 
    lda_clf.fit(Train[ all_predictors ], Train["Target"])

    results = []
    for year in range(Test['Date'].dt.year.min(), Test['Date'].dt.year.max() + 1):
        test_year = Test[Test['Date'].dt.year == year]
        if not test_year.empty:
            # Predict on test data
            preds = lda_clf.predict(test_year[ all_predictors ])

            # Calculate precision and accuracy
            precision = precision_score(test_year["Target"], preds, average="weighted", zero_division=1)
            accuracy = accuracy_score(test_year["Target"], preds)

            # Append results to list
            results.append({
                "Model": "Linear Discriminant Analysis",
                "Year": year,
                "Precision": precision,
                "Accuracy": accuracy
            })

    # Convert results to DataFrame
    results_df = pd.DataFrame(results)
    
    return results_df


### LDA using  Full Feature Set (refer /Data/Data_Formatting.ipynb)

In [33]:
def make_yearly_predictions_lda_full(Train, Test):
    # Convert 'Date' columns to datetime and sort data

    all_predictors = parameters_full(Train,Test)
    Train = roll(Train)
    Test  = roll(Test)

    # Train an LDA model
    lda_clf = LinearDiscriminantAnalysis(solver="lsqr",shrinkage=best_shrinkage)
    lda_clf.fit(Train[ all_predictors ], Train["Target"])

    results = []
    for year in range(Test['Date'].dt.year.min(), Test['Date'].dt.year.max() + 1):
        test_year = Test[Test['Date'].dt.year == year]
        if not test_year.empty:
            # Predict on test data
            preds = lda_clf.predict(test_year[ all_predictors ])

            # Calculate precision and accuracy
            precision = precision_score(test_year["Target"], preds, average="weighted", zero_division=1)
            accuracy = accuracy_score(test_year["Target"], preds)

            # Append results to list
            results.append({
                "Model": "Linear Discriminant Analysis",
                "Year": year,
                "Precision": precision,
                "Accuracy": accuracy
            })

    # Convert results to DataFrame
    results_df = pd.DataFrame(results)
    
    return results_df