# Basic Classification Models: LDA, QDA, KNN, and Logistic Regression  

 In this section, we explore four fundamental classification models: **Linear Discriminant Analysis (LDA)**, **Quadratic Discriminant Analysis (QDA)**, **K-Nearest Neighbors (KNN)**, and **Logistic Regression**. These models serve as the foundation for many machine learning classification problems.
 
## 1. Linear Discriminant Analysis (LDA)  
LDA finds a linear combination of features that best separates classes, assuming normally distributed data with the same covariance.  

✅ Works well for linearly separable data  
❌ Struggles with non-linear class boundaries  

## 2. Quadratic Discriminant Analysis (QDA)  
QDA extends LDA by allowing each class to have its own covariance matrix, leading to quadratic decision boundaries.  

✅ More flexible than LDA  
❌ Requires more data, prone to overfitting  

## 3. K-Nearest Neighbors (KNN)  
KNN classifies a point based on the majority class of its **k** closest neighbors using a distance metric.  

✅ No assumptions about data distribution  
❌ Computationally expensive, sensitive to noise  

## 4. Logistic Regression  
Logistic Regression predicts class probabilities using a sigmoid function and is best for binary classification.  

✅ Simple, interpretable, works well for linear problems  
❌ Struggles with non-linearity unless features are transformed  

## Summary  
- **LDA/QDA**: Good for normally distributed data, LDA for linear, QDA for non-linear.  
- **KNN**: Flexible but slow on large datasets.  
- **Logistic Regression**: Best for simple binary classification with linear relationships.  



In [1]:
#downloading all the necesaary dependecies
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.metrics import accuracy_score, precision_score
from sklearn.model_selection import cross_val_score, KFold
from sklearn.model_selection import GridSearchCV

In [2]:
pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [3]:
%run ../Data/Data_Formatting.ipynb

In [4]:
%run ../Data/Ultimate_Hyperparameters.ipynb

In [5]:
%run ../Data/Parameters.ipynb

In [6]:
#loading the training dataset 
train_path = Path("../Data/premierleague_team_data.csv")
matches = pd.read_csv(train_path)

#loading the testing data 
test_path = Path("../Data/premierleague_test_team_data.csv")
test_matches = pd.read_csv(test_path)

In [7]:
#loading the training dataset with rank
train_path = Path("../Data/premierleague_rank_team_data.csv")
new_matches = pd.read_csv(train_path)

#loading the testing data with rank
test_path = Path("../Data/premierleague_rank_test_team_data.csv")
new_test_matches = pd.read_csv(test_path)

# Best Model Selection 

## Choosing the Best Overall Model

In our analysis, we have evaluated four different machine learning models: **Bagging, Gradient Boosting, Decision Tree, and Random Forest**. Each of these models has been tested on multiple years of data, and their performance has been measured using **accuracy** and **precision**.

### Why Do We Need to Choose the Best Model?

1. **Consistency Across Years**  
   Some models may perform well in certain years but not in others. Selecting the best overall model ensures that we choose the one that provides **consistent performance across multiple years** rather than excelling only in specific cases.

2. **Generalization to Future Data**  
   The goal is to make predictions on new, unseen data. A model that performs well across different years is more **robust and reliable** for future predictions.

3. **Maximizing Accuracy and Precision**  
   By comparing the performance of all models, we can **identify the model with the highest accuracy and precision**. This helps in minimizing errors and improving decision-making.

4. **Avoiding Overfitting**  
   Some models might perform exceptionally well on training data but fail on test data. By analyzing results across multiple years, we ensure that the chosen model is **not overfitting** and generalizes well.

### How Will We Choose the Best Model?
We will compare all models based on:
- **Accuracy** (how often the model makes correct predictions)  
- **Precision** (how reliable the model's positive predictions are)  

The model that consistently achieves the highest accuracy and precision across multiple years will be selected as the **best overall model**.

By following this approach, we ensure that our chosen model is the most reliable and effective for making predictions. 🚀


## Models with Baseline Predictors

In [8]:
# Run all models
def best_basic_model_baseline(A, B):
    KNN_results = make_yearly_predictions_knn_base(A, B)
    LDA_results =make_yearly_predictions_lda_base(A, B)
    LR_results = make_yearly_predictions_lr_base(A, B)
    QDA_results = make_yearly_predictions_qda_base(A, B)

    # Combine all results
    all_results = pd.concat([KNN_results, LDA_results, LR_results, QDA_results], ignore_index=True)

    # Find best accuracy & precision for each year
    best_per_year = all_results.loc[all_results.groupby("Year")["Accuracy"].idxmax()]
    best_per_year_precision = all_results.loc[all_results.groupby("Year")["Precision"].idxmax()]

    # Compute average precision and accuracy per model
    avg_results = all_results.groupby("Model")[["Accuracy", "Precision"]].mean().reset_index()

    # Find the best overall model based on highest average accuracy
    best_model = avg_results.loc[avg_results["Accuracy"].idxmax()]

    # Find the best overall model based on highest average precision
    best_precision_model = avg_results.loc[avg_results["Precision"].idxmax()]

    # Display results
    print("Best Model Per Year (by Accuracy):")
    print(best_per_year)

    print("\nBest Model Per Year (by Precision):")
    print(best_per_year_precision)

    print("\nOverall Best Model (by Accuracy):")
    print(f"Model: {best_model['Model']}, Avg Precision: {best_model['Precision']:.4f}, Avg Accuracy: {best_model['Accuracy']:.4f}")

    print("\nOverall Best Model (by Precision):")
    print(f"Model: {best_precision_model['Model']}, Avg Precision: {best_precision_model['Precision']:.4f}, Avg Accuracy: {best_precision_model['Accuracy']:.4f}")


## Models with Baseline Predictors + Rolling Predictors

In [9]:
# Run all models
def best_basic_model_rolling(A, B):
    KNN_results = make_yearly_predictions_knn_roll(A, B)
    LDA_results =make_yearly_predictions_lda_roll(A, B)
    LR_results = make_yearly_predictions_lr_roll(A, B)
    QDA_results = make_yearly_predictions_qda_roll(A, B)

    # Combine all results
    all_results = pd.concat([KNN_results, LDA_results, LR_results, QDA_results], ignore_index=True)

    # Find best accuracy & precision for each year
    best_per_year = all_results.loc[all_results.groupby("Year")["Accuracy"].idxmax()]
    best_per_year_precision = all_results.loc[all_results.groupby("Year")["Precision"].idxmax()]

    # Compute average precision and accuracy per model
    avg_results = all_results.groupby("Model")[["Accuracy", "Precision"]].mean().reset_index()

    # Find the best overall model based on highest average accuracy
    best_model = avg_results.loc[avg_results["Accuracy"].idxmax()]

    # Find the best overall model based on highest average precision
    best_precision_model = avg_results.loc[avg_results["Precision"].idxmax()]

    # Display results
    print("Best Model Per Year (by Accuracy):")
    print(best_per_year)

    print("\nBest Model Per Year (by Precision):")
    print(best_per_year_precision)

    print("\nOverall Best Model (by Accuracy):")
    print(f"Model: {best_model['Model']}, Avg Precision: {best_model['Precision']:.4f}, Avg Accuracy: {best_model['Accuracy']:.4f}")

    print("\nOverall Best Model (by Precision):")
    print(f"Model: {best_precision_model['Model']}, Avg Precision: {best_precision_model['Precision']:.4f}, Avg Accuracy: {best_precision_model['Accuracy']:.4f}")


## Models with Full Set Predictors

In [10]:
# Run all models
def best_basic_model_full(A, B):
    KNN_results = make_yearly_predictions_knn_full(A, B)
    LDA_results =make_yearly_predictions_lda_full(A, B)
    LR_results = make_yearly_predictions_lr_full(A, B)
    QDA_results = make_yearly_predictions_qda_full(A, B)

    # Combine all results
    all_results = pd.concat([KNN_results, LDA_results, LR_results, QDA_results], ignore_index=True)

    # Find best accuracy & precision for each year
    best_per_year = all_results.loc[all_results.groupby("Year")["Accuracy"].idxmax()]
    best_per_year_precision = all_results.loc[all_results.groupby("Year")["Precision"].idxmax()]

    # Compute average precision and accuracy per model
    avg_results = all_results.groupby("Model")[["Accuracy", "Precision"]].mean().reset_index()

    # Find the best overall model based on highest average accuracy
    best_model = avg_results.loc[avg_results["Accuracy"].idxmax()]

    # Find the best overall model based on highest average precision
    best_precision_model = avg_results.loc[avg_results["Precision"].idxmax()]

    # Display results
    print("Best Model Per Year (by Accuracy):")
    print(best_per_year)

    print("\nBest Model Per Year (by Precision):")
    print(best_per_year_precision)

    print("\nOverall Best Model (by Accuracy):")
    print(f"Model: {best_model['Model']}, Avg Precision: {best_model['Precision']:.4f}, Avg Accuracy: {best_model['Accuracy']:.4f}")

    print("\nOverall Best Model (by Precision):")
    print(f"Model: {best_precision_model['Model']}, Avg Precision: {best_precision_model['Precision']:.4f}, Avg Accuracy: {best_precision_model['Accuracy']:.4f}")


## Importance of Training Accuracy 

Training accuracy measures how well a machine learning model fits the training data. It is important to check training accuracy for the following reasons:

1. **Detecting Underfitting**  
   - If the training accuracy is **too low**, it means the model is **not learning enough** patterns from the data.  
   - This could be due to an **overly simple model**, insufficient features, or poor hyperparameters.

2. **Ensuring Model Competency**  
   - A model with **reasonable training accuracy** ensures that it has successfully learned meaningful patterns from the dataset.  
   - If the model cannot achieve high accuracy on the training data, it is unlikely to perform well on new data.

3. **Providing a Baseline for Comparison**  
   - Training accuracy helps us **compare** with testing accuracy to detect **overfitting**.  
   - If training accuracy is significantly higher than testing accuracy, the model might be **memorizing** rather than **generalizing**.

💡 **Key Insight**: While high training accuracy is desirable, it should not be the sole indicator of a good model. We must also check testing accuracy to ensure real-world performance.


## Models with Baseline Predictors

## Models with Baseline Predictors + Rolling Predictors

## Models with Full Set Predictors

## Importance of Checking Testing Accuracy

Testing accuracy measures how well a machine learning model performs on **unseen** data. It is crucial to check testing accuracy for the following reasons:

1. **Evaluating Generalization**  
   - The primary goal of machine learning is to create models that **generalize well** to new data.  
   - A high testing accuracy indicates that the model can make reliable predictions on unseen samples.

2. **Detecting Overfitting**  
   - If the training accuracy is high but the testing accuracy is low, it suggests **overfitting**.  
   - Overfitting occurs when the model learns **specific details** of the training data rather than general patterns, making it unreliable for new data.

3. **Validating Model Performance**  
   - A model is only useful if it performs well on real-world data.  
   - Testing accuracy gives us a **realistic expectation** of how the model will behave when deployed.

4. **Comparing Different Models**  
   - By evaluating testing accuracy across different models, we can select the best model for **real-world applications**.  
   - The model with the highest **testing accuracy and precision** is often the best choice.

💡 **Key Insight**: A good model should have **both high training and testing accuracy**. A balance between these ensures that the model is neither too simple (underfitting) nor too complex (overfitting).


## Models with Baseline Predictors

In [11]:
#best_model_baseline(matches,test_matches)

## Models with Baseline Predictors + Rolling Predictors

In [12]:
#best_model_rolling(matches,test_matches)

## Models with Full Set Predictors

In [13]:
#best_model_full(new_matches,new_test_matches)