In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split 
import time
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
import pickle
import matplotlib.pyplot as plt

## Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA) is a dimensionality reduction technique that is particularly used when you have labeled data and want to reduce the dimensionality of your feature space while preserving as much class discriminatory information as possible. Unlike PCA, which is an unsupervised method focusing on variance, LDA is supervised and focuses on maximizing the separation between multiple classes.

### Key Concepts:

**Goal:** 
LDA aims to find a new space that maximizes the separation (or distance) between different classes while minimizing the spread (variance) within each class.

**Discriminants:** 
LDA computes the directions (linear combinations of features) that best separate the classes.

### Steps to Apply LDA:

1. **Compute the within-class scatter matrix and between-class scatter matrix.**
2. **Compute the eigenvalues and eigenvectors of the matrix formed by the inverse of the within-class scatter matrix multiplied by the between-class scatter matrix.**
3. **Sort the eigenvalues and choose the top n eigenvectors.**
4. **Project the original data onto the new space formed by these n eigenvectors.**

### When to Use LDA:

- **When the number of observations is greater than the number of features, and you want to reduce dimensionality while maintaining class separability.**
- **It works best when you have labeled data with multiple classes.**


In [3]:
# Importing the dataset
dataset = pd.read_csv('prep.csv')

In [4]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Step 1: Load the dataset
dataset = pd.read_csv('prep.csv')

# Step 2: Preprocess the data
# Handle categorical columns (use LabelEncoder or OneHotEncoding depending on your data)
labelencoder = LabelEncoder()
# Assuming the target column is the last one (modify if necessary)
y = dataset.iloc[:, -1].values  # Target variable
X = dataset.iloc[:, :-1].values  # Features

# If you have categorical variables in X, apply label encoding
# Example: Apply encoding to columns that are categorical, e.g., the second column
for i in range(X.shape[1]):
    if X[:, i].dtype == 'object':  # Check for categorical columns
        X[:, i] = labelencoder.fit_transform(X[:, i])

# Step 3: Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Scale the features

# Step 4: Apply LDA to reduce dimensionality
lda = LDA(n_components=2)  # You can change n_components based on your dataset
X_lda = lda.fit_transform(X_scaled, y)

# Step 5: Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_lda, y, test_size=0.2, random_state=42)

# Step 6: Initialize classifiers
models = {
    'Logistic': LogisticRegression(),
    'SVMl': SVC(kernel='linear'),
    'SVMnl': SVC(kernel='rbf'),
    'KNN': KNeighborsClassifier(n_neighbors=5),
    'Navie': GaussianNB(),
    'Decision': DecisionTreeClassifier(),
    'Random': RandomForestClassifier()
}

# Step 7: Evaluate each model
results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    results[name] = accuracy

# Step 8: Convert results to a DataFrame and print
results_df = pd.DataFrame([results], index=['LDA'])
print("Model Performance with LDA:")
print(results_df)


ValueError: n_components cannot be larger than min(n_features, n_classes - 1).

The error `ValueError: n_components cannot be larger than min(n_features, n_classes - 1)` occurs because when applying Linear Discriminant Analysis (LDA), the number of components you specify (`n_components`) cannot exceed the minimum of the number of features (`n_features`) or the number of classes minus one (`n_classes - 1`).

### Explanation of the Error:
LDA is a supervised dimensionality reduction technique, and its maximum number of components is limited by:
- The number of features (columns) in the dataset.
- The number of classes (unique values) in the target variable `y`.
- Specifically, the maximum number of components is `min(n_features, n_classes - 1)`.

### Solution:
- **Check the number of classes:** If you have a small number of unique classes in your target variable `y`, the maximum possible value for `n_components` will be limited.
- **Adjust the number of components:** Set `n_components` to be no greater than `min(n_features, n_classes - 1)`.

### Steps to Fix the Issue:
1. Check the number of unique classes in `y`.
2. Set `n_components` to the appropriate number based on the number of features and classes.


In [5]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Step 1: Load the dataset
dataset = pd.read_csv('prep.csv')

# Step 2: Preprocess the data
labelencoder = LabelEncoder()
y = dataset.iloc[:, -1].values  # Target variable
X = dataset.iloc[:, :-1].values  # Features

# Encode categorical variables if needed
for i in range(X.shape[1]):
    if X[:, i].dtype == 'object':  # Check for categorical columns
        X[:, i] = labelencoder.fit_transform(X[:, i])

# Step 3: Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Scale the features

# Step 4: Check number of classes and features
n_classes = len(set(y))
n_features = X_scaled.shape[1]
max_components = min(n_features, n_classes - 1)

print(f"Number of features: {n_features}")
print(f"Number of classes: {n_classes}")
print(f"Maximum components for LDA: {max_components}")

# Step 5: Apply LDA to reduce dimensionality
lda = LDA(n_components=max_components)  # Use the calculated max_components
X_lda = lda.fit_transform(X_scaled, y)

# Step 6: Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_lda, y, test_size=0.2, random_state=42)

# Step 7: Initialize classifiers
models = {
    'Logistic': LogisticRegression(),
    'SVMl': SVC(kernel='linear'),
    'SVMnl': SVC(kernel='rbf'),
    'KNN': KNeighborsClassifier(n_neighbors=5),
    'Navie': GaussianNB(),
    'Decision': DecisionTreeClassifier(),
    'Random': RandomForestClassifier()
}

# Step 8: Evaluate each model
results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    results[name] = accuracy

# Step 9: Convert results to a DataFrame and print
results_df = pd.DataFrame([results], index=['LDA'])
print("Model Performance with LDA:")
print(results_df)


Number of features: 24
Number of classes: 2
Maximum components for LDA: 1
Model Performance with LDA:
     Logistic   SVMl  SVMnl    KNN  Navie  Decision  Random
LDA     0.975  0.975  0.975  0.975  0.975    0.9625  0.9625


| Number of features         | 24   |
|----------------------------|------|
| Number of classes          | 2    |
| Maximum components for LDA | 1    |

## Model Performance with LDA:

| Model     | Logistic | SVM (Linear) | SVM (Non-linear) | KNN    | Naive Bayes | Decision Tree | Random Forest |
|-----------|----------|--------------|------------------|--------|-------------|---------------|---------------|
| **LDA**   | 0.975    | 0.975        | 0.975            | 0.975  | 0.975       | 0.9625        | 0.9625        |


## Key Changes:
- Determine the maximum number of components (`max_components`) based on `min(n_features, n_classes - 1)`.
- Set `n_components` dynamically based on the above calculation.

### Example Output:
If your dataset has:
- 10 features.
- 3 classes (i.e., the target variable `y` has 3 unique values), then the maximum number of components for LDA will be `min(10, 3-1) = 2`.


## Important Points:
- If `n_classes - 1` is smaller than `n_features`, you'll need to use that as the upper limit for `n_components`.
- The accuracy of the models may vary depending on the dataset and the transformations applied (such as LDA).
- This should resolve the issue you're encountering and allow you to proceed with LDA.
