In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split 
import time
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
import pickle
import matplotlib.pyplot as plt

In [2]:
def selectkbest(indep_X,dep_Y,n):
        test = SelectKBest(score_func=chi2, k=n)
        fit1= test.fit(indep_X,dep_Y)
        selectk_features = fit1.transform(indep_X)
        return selectk_features
    

Explanation of the Code
The code defines a Python function selectkbest that uses feature selection to select the top n features from a dataset based on their statistical relevance to the target variable (dep_Y). Here's a detailed breakdown of the function:

def selectkbest(indep_X, dep_Y, n):

indep_X: Input features (independent variables).
dep_Y: Target variable (dependent variable).
n: The number of top features to select.

1) Initialize SelectKBest:

test = SelectKBest(score_func=chi2, k=n)

SelectKBest: A feature selection method from sklearn.feature_selection. It scores each feature individually for relevance to the target.
score_func=chi2: Uses the Chi-Square test to measure the dependence between features and the target. Suitable for categorical or non-negative data.
k=n: Specifies how many of the top features to select.

2) Fit the Model:

fit1 = test.fit(indep_X, dep_Y)

3) Transform the Data:

   selectk_features = fit1.transform(indep_X)
   
fit1.transform(): Reduces the dataset (indep_X) to only the top n selected features based on the scores computed in the previous step.

4) Return Selected Features:

 return selectk_features
 
Outputs the reduced dataset containing only the selected top n features.


# Function: `selectkbest`

This function performs feature selection using the Chi-Square test to identify and retain the most relevant features from the dataset.

## Parameters:
- `indep_X`: DataFrame or array of independent variables (features).
- `dep_Y`: Series or array of the target variable.
- `n`: Integer specifying the number of top features to select.

## Function Logic:
1. **Initialize SelectKBest**:
   - Uses `SelectKBest` from `sklearn.feature_selection`.
   - The scoring function is `chi2`, which calculates the Chi-Square scores of the features relative to the target variable.
   - `k=n` specifies how many top features to retain.

2. **Fit the Feature Selector**:
   - `fit1 = test.fit(indep_X, dep_Y)` computes Chi-Square scores for all features.

3. **Transform the Dataset**:
   - `selectk_features = fit1.transform(indep_X)` selects the top `n` features from the original dataset.

4. **Return Selected Features**:
   - The function returns the dataset reduced to the `n` most relevant features.

## Example Usage:
```python
from sklearn.feature_selection import SelectKBest, chi2

# Assuming indep_X and dep_Y are already defined
selected_features = selectkbest(indep_X, dep_Y, 5)  # Select top 5 features
print(selected_features)


This function is commonly used in **preprocessing pipelines** to reduce the dimensionality of datasets and improve model performance.


In [3]:
def split_scalar(indep_X,dep_Y):
        X_train, X_test, y_train, y_test = train_test_split(indep_X, dep_Y, test_size = 0.25, random_state = 0)
        sc = StandardScaler()
        X_train = sc.fit_transform(X_train)
        X_test = sc.transform(X_test)    
        return X_train, X_test, y_train, y_test

### Explanation of the `split_scalar` Function

The `split_scalar` function is designed to perform two primary tasks: splitting the dataset into training and testing sets, and standardizing the feature data for better model performance.

#### Function Components:

1. **Splitting the Dataset:**
   ```python
   X_train, X_test, y_train, y_test = train_test_split(indep_X, dep_Y, test_size=0.25, random_state=0)


This step divides the dataset into training and testing subsets.
- **indep_X (independent variables or features)** and **dep_Y (dependent variable or target)** are split.
- **test_size=0.25** means 25% of the data is allocated to the test set, and the remaining 75% is used for training.
- **random_state=0** ensures the split is reproducible.


#### Step 2: Standardizing the Data:



```python
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)



- **Standardization** ensures that the data is scaled to have a mean of 0 and a standard deviation of 1.
- **StandardScaler** is used for this purpose.
- **fit_transform** is applied to the training data to compute the scaling parameters (mean and variance) and then scale the training data.
- **transform** applies the same scaling parameters to the test data to ensure consistency.




### Step 3: Return Values

```python
return X_train, X_test, y_train, y_test

**Outputs:**
- **X_train:** Scaled training features.
- **X_test:** Scaled testing features.
- **y_train:** Training labels.
- **y_test:** Testing labels.

### Purpose:

**Data Splitting:**
- Provides separate datasets for training and testing, enabling evaluation of model performance on unseen data.

**Feature Scaling:**
- Improves the performance of models sensitive to feature magnitudes, such as logistic regression, SVM, and neural networks.



In [4]:
def cm_prediction(classifier,X_test):
     y_pred = classifier.predict(X_test)
        
        # Making the Confusion Matrix
     from sklearn.metrics import confusion_matrix
     cm = confusion_matrix(y_test, y_pred)
        
     from sklearn.metrics import accuracy_score 
     from sklearn.metrics import classification_report 
        #from sklearn.metrics import confusion_matrix
        #cm = confusion_matrix(y_test, y_pred)
        
     Accuracy=accuracy_score(y_test, y_pred )
        
     report=classification_report(y_test, y_pred)
     return  classifier,Accuracy,report,X_test,y_test,cm


### `cm_prediction` Function Details

The `cm_prediction` function evaluates a trained classifier by making predictions on the test data and generating metrics such as a confusion matrix, accuracy score, and classification report.

---

#### Function Code:
```python
def cm_prediction(classifier, X_test):
    y_pred = classifier.predict(X_test)
    
    # Making the Confusion Matrix
    from sklearn.metrics import confusion_matrix
    cm = confusion_matrix(y_test, y_pred)
    
    # Calculating Accuracy and Generating Classification Report
    from sklearn.metrics import accuracy_score 
    from sklearn.metrics import classification_report 

    Accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    
    return classifier, Accuracy, report, X_test, y_test, cm


### Step 1: Make Predictions

```python
y_pred = classifier.predict(X_test)

**Input:**
- **classifier:** A trained machine learning model.
- **X_test:** Test feature set.

**Process:**
- The classifier makes predictions (`y_pred`) for the test features.



### Step 2: Compute the Confusion Matrix

```python
from sklearn.metrics import confusion_matrix
confusion_mtx = confusion_matrix(y_test, y_pred)
print(confusion_mtx)

**Confusion Matrix:**
- A table that describes the performance of the classification model by comparing the actual and predicted labels.

**Input:**
- **y_test:** True labels for the test set.
- **y_pred:** Predicted labels.

**Output:**
- **cm:** Confusion matrix.



### Step 3: Calculate Accuracy

```python
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')


**Accuracy Score:** Measures the percentage of correct predictions.

**Input:**
- **y_test:** True labels.
- **y_pred:** Predicted labels.

**Output:**
- **Accuracy:** A float representing the model's accuracy.


### Step 4: Generate Classification Report

```python
from sklearn.metrics import classification_report
report = classification_report(y_test, y_pred)
print(report)

**Classification Report:** Summarizes precision, recall, F1-score, and support for each class.

**Output:**
- **report:** A string containing detailed classification metrics.




### Step 5: Return Results


```python
return classifier, Accuracy, report, X_test, y_test, cm

**Outputs:**
- **classifier:** The input model.
- **Accuracy:** Model accuracy on the test set.
- **report:** Detailed classification metrics.
- **X_test:** Test features (for reference).
- **y_test:** True test labels (for reference).
- **cm:** Confusion matrix.

    


**Purpose:**

**Evaluation:**
- Provides a comprehensive evaluation of the model's performance.

**Metrics:**
- Confusion matrix for analyzing correct and incorrect predictions.
- Accuracy score for an overall performance snapshot.
- Classification report for class-specific metrics like precision, recall, and F1-score.


In [5]:
def logistic(X_train,y_train,X_test):       
        # Fitting K-NN to the Training set
        from sklearn.linear_model import LogisticRegression
        classifier = LogisticRegression(random_state = 0)
        classifier.fit(X_train, y_train)
        classifier,Accuracy,report,X_test,y_test,cm=cm_prediction(classifier,X_test)
        return  classifier,Accuracy,report,X_test,y_test,cm     

In [6]:
def svm_linear(X_train,y_train,X_test):
                
        from sklearn.svm import SVC
        classifier = SVC(kernel = 'linear', random_state = 0)
        classifier.fit(X_train, y_train)
        classifier,Accuracy,report,X_test,y_test,cm=cm_prediction(classifier,X_test)
        return  classifier,Accuracy,report,X_test,y_test,cm

In [7]:
def svm_NL(X_train,y_train,X_test):
                
        from sklearn.svm import SVC
        classifier = SVC(kernel = 'rbf', random_state = 0)
        classifier.fit(X_train, y_train)
        classifier,Accuracy,report,X_test,y_test,cm=cm_prediction(classifier,X_test)
        return  classifier,Accuracy,report,X_test,y_test,cm

In [8]:
def Navie(X_train,y_train,X_test):       
        # Fitting K-NN to the Training set
        from sklearn.naive_bayes import GaussianNB
        classifier = GaussianNB()
        classifier.fit(X_train, y_train)
        classifier,Accuracy,report,X_test,y_test,cm=cm_prediction(classifier,X_test)
        return  classifier,Accuracy,report,X_test,y_test,cm   

In [9]:
def knn(X_train,y_train,X_test):
           
        # Fitting K-NN to the Training set
        from sklearn.neighbors import KNeighborsClassifier
        classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
        classifier.fit(X_train, y_train)
        classifier,Accuracy,report,X_test,y_test,cm=cm_prediction(classifier,X_test)
        return  classifier,Accuracy,report,X_test,y_test,cm

In [10]:
def Decision(X_train,y_train,X_test):
        
        # Fitting K-NN to the Training set
        from sklearn.tree import DecisionTreeClassifier
        classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
        classifier.fit(X_train, y_train)
        classifier,Accuracy,report,X_test,y_test,cm=cm_prediction(classifier,X_test)
        return  classifier,Accuracy,report,X_test,y_test,cm  

In [11]:
def random(X_train,y_train,X_test):
        
        # Fitting K-NN to the Training set
        from sklearn.ensemble import RandomForestClassifier
        classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
        classifier.fit(X_train, y_train)
        classifier,Accuracy,report,X_test,y_test,cm=cm_prediction(classifier,X_test)
        return  classifier,Accuracy,report,X_test,y_test,cm

In [12]:
def selectk_Classification(acclog,accsvml,accsvmnl,accknn,accnav,accdes,accrf): 
    
    dataframe=pd.DataFrame(index=['ChiSquare'],columns=['Logistic','SVMl','SVMnl','KNN','Navie','Decision','Random'])
    for number,idex in enumerate(dataframe.index):      
        dataframe['Logistic'][idex]=acclog[number]       
        dataframe['SVMl'][idex]=accsvml[number]
        dataframe['SVMnl'][idex]=accsvmnl[number]
        dataframe['KNN'][idex]=accknn[number]
        dataframe['Navie'][idex]=accnav[number]
        dataframe['Decision'][idex]=accdes[number]
        dataframe['Random'][idex]=accrf[number]
    return dataframe

### `selectk_Classification` Function Details

The `selectk_Classification` function organizes classification accuracy scores for different models into a structured DataFrame for analysis.

---

#### Function Code:
```python
def selectk_Classification(acclog, accsvml, accsvmnl, accknn, accnav, accdes, accrf): 
    
    dataframe = pd.DataFrame(index=['ChiSquare'], columns=['Logistic', 'SVMl', 'SVMnl', 'KNN', 'Navie', 'Decision', 'Random'])
    for number, idex in enumerate(dataframe.index):      
        dataframe['Logistic'][idex] = acclog[number]       
        dataframe['SVMl'][idex] = accsvml[number]
        dataframe['SVMnl'][idex] = accsvmnl[number]
        dataframe['KNN'][idex] = accknn[number]
        dataframe['Navie'][idex] = accnav[number]
        dataframe['Decision'][idex] = accdes[number]
        dataframe['Random'][idex] = accrf[number]
    return dataframe


### Explanation: Purpose

This function creates a **DataFrame** to organize classification accuracies of multiple machine learning models (e.g., Logistic Regression, SVM, KNN, etc.) for a specified feature selection method, such as **ChiSquare**.


###  Input Parameters

The function takes accuracy scores for different classification models as input:
- **acclog:** Accuracy of Logistic Regression.
- **accsvml:** Accuracy of Linear SVM.
- **accsvmnl:** Accuracy of Non-Linear SVM.
- **accknn:** Accuracy of KNN.
- **accnav:** Accuracy of Naive Bayes.
- **accdes:** Accuracy of Decision Tree.
- **accrf:** Accuracy of Random Forest.


### Steps in Function:

**Step 1:** Create an Empty DataFrame

```python
dataframe = pd.DataFrame(index=['ChiSquare'], columns=['Logistic', 'SVMl', 'SVMnl', 'KNN', 'Navie', 'Decision', 'Random'])

**The index** represents the feature selection method (e.g., ChiSquare).
**The columns** represent classification models (e.g., Logistic, SVM, KNN, etc.).
**Initially**, the DataFrame is empty.

                                                 


### Step 2: Populate the DataFrame

```python
for number, idex in enumerate(dataframe.index):      
    dataframe['Logistic'][idex] = acclog[number]       
    dataframe['SVMl'][idex] = accsvml[number]
    dataframe['SVMnl'][idex] = accsvmnl[number]
    dataframe['KNN'][idex] = accknn[number]
    dataframe['Navie'][idex] = accnav[number]
    dataframe['Decision'][idex] = accdes[number]
    dataframe['Random'][idex] = accrf[number]

- Iterates through the index of the DataFrame.
- Assigns the corresponding accuracy values for each classification model into their respective columns.


### Step 3: Return the DataFrame

```python
return dataframe

**Returns:** 
- The filled DataFrame for further analysis or visualization.


### Output:

A **DataFrame** where:

- **Rows** represent feature selection methods (e.g., ChiSquare).  
- **Columns** represent classification models with their respective accuracies.


In [14]:
dataset1=pd.read_csv("prep.csv",index_col=None)

df2=dataset1

df2 = pd.get_dummies(df2, drop_first=True)

indep_X=df2.drop('classification_yes', 1)
dep_Y=df2['classification_yes']


TypeError: DataFrame.drop() takes from 1 to 2 positional arguments but 3 were given

The error occurs because the `DataFrame.drop()` method in pandas no longer accepts `1` or `0` as positional arguments for the `axis` parameter in modern versions of pandas. Instead, you should explicitly specify `axis=1` for dropping columns or `axis=0` for dropping rows.


Here's the corrected version of the above code:

In [36]:
# Load the dataset
dataset1 = pd.read_csv("prep.csv", index_col=None)

# Create a copy of the dataset
df2 = dataset1

# Perform one-hot encoding with drop_first=True
df2 = pd.get_dummies(df2, dtype = int,drop_first=True)

# Separate independent variables (X) and dependent variable (Y)
indep_X = df2.drop('classification_yes', axis=1)  # Specify axis=1 explicitly
dep_Y = df2['classification_yes']

# Check the results
print(indep_X.head())
print(dep_Y.head())


   age         bp   al   su         bgr         bu        sc         sod  \
0  2.0  76.459948  3.0  0.0  148.112676  57.482105  3.077356  137.528754   
1  3.0  76.459948  2.0  0.0  148.112676  22.000000  0.700000  137.528754   
2  4.0  76.459948  1.0  0.0   99.000000  23.000000  0.600000  138.000000   
3  5.0  76.459948  1.0  0.0  148.112676  16.000000  0.700000  138.000000   
4  5.0  50.000000  0.0  0.0  148.112676  25.000000  0.600000  137.528754   

        pot       hrmo  ...  rbc_normal  pc_normal  pcc_present  ba_present  \
0  4.627244  12.518156  ...           1          0            0           0   
1  4.627244  10.700000  ...           1          1            0           0   
2  4.400000  12.000000  ...           1          1            0           0   
3  3.200000   8.100000  ...           1          1            0           0   
4  4.627244  11.800000  ...           1          1            0           0   

   htn_yes  dm_yes  cad_yes  appet_yes  pe_yes  ane_yes  
0        0

### Explanation of Changes:
- **axis=1:** Explicitly specifying `axis=1` tells pandas to drop a column. The earlier `1` was being interpreted as a positional argument, which caused the error.
- **General Syntax Update:** Modern pandas versions require clearer argument handling to avoid ambiguity.


In [37]:
df2

Unnamed: 0,age,bp,al,su,bgr,bu,sc,sod,pot,hrmo,...,pc_normal,pcc_present,ba_present,htn_yes,dm_yes,cad_yes,appet_yes,pe_yes,ane_yes,classification_yes
0,2.000000,76.459948,3.0,0.0,148.112676,57.482105,3.077356,137.528754,4.627244,12.518156,...,0,0,0,0,0,0,1,1,0,1
1,3.000000,76.459948,2.0,0.0,148.112676,22.000000,0.700000,137.528754,4.627244,10.700000,...,1,0,0,0,0,0,1,0,0,1
2,4.000000,76.459948,1.0,0.0,99.000000,23.000000,0.600000,138.000000,4.400000,12.000000,...,1,0,0,0,0,0,1,0,0,1
3,5.000000,76.459948,1.0,0.0,148.112676,16.000000,0.700000,138.000000,3.200000,8.100000,...,1,0,0,0,0,0,1,0,1,1
4,5.000000,50.000000,0.0,0.0,148.112676,25.000000,0.600000,137.528754,4.627244,11.800000,...,1,0,0,0,0,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
394,51.492308,70.000000,0.0,0.0,219.000000,36.000000,1.300000,139.000000,3.700000,12.500000,...,1,0,0,0,0,0,1,0,0,1
395,51.492308,70.000000,0.0,2.0,220.000000,68.000000,2.800000,137.528754,4.627244,8.700000,...,1,0,0,1,1,0,1,0,1,1
396,51.492308,70.000000,3.0,0.0,110.000000,115.000000,6.000000,134.000000,2.700000,9.100000,...,1,0,0,1,1,0,0,0,0,1
397,51.492308,90.000000,0.0,0.0,207.000000,80.000000,6.800000,142.000000,5.500000,8.500000,...,1,0,0,1,1,0,1,0,1,1


In [38]:
kbest=selectkbest(indep_X,dep_Y,6)       

acclog=[]
accsvml=[]
accsvmnl=[]
accknn=[]
accnav=[]
accdes=[]
accrf=[]

In [39]:
kbest

array([[3.00000000e+00, 1.48112676e+02, 5.74821053e+01, 3.07735602e+00,
        3.88689024e+01, 8.40819113e+03],
       [2.00000000e+00, 1.48112676e+02, 2.20000000e+01, 7.00000000e-01,
        3.40000000e+01, 1.23000000e+04],
       [1.00000000e+00, 9.90000000e+01, 2.30000000e+01, 6.00000000e-01,
        3.40000000e+01, 8.40819113e+03],
       ...,
       [3.00000000e+00, 1.10000000e+02, 1.15000000e+02, 6.00000000e+00,
        2.60000000e+01, 9.20000000e+03],
       [0.00000000e+00, 2.07000000e+02, 8.00000000e+01, 6.80000000e+00,
        3.88689024e+01, 8.40819113e+03],
       [0.00000000e+00, 1.00000000e+02, 4.90000000e+01, 1.00000000e+00,
        5.30000000e+01, 8.50000000e+03]])

In [40]:
# We could see from the above Kbest gives the best 4:

# Such as "bgr	bu	sc" 


In [41]:
# This below code (X_train, X_test, y_train, y_test=split_scalar(kbest,dep_Y)  ) is for entering the best k values, before creating the model.

In [42]:
X_train, X_test, y_train, y_test=split_scalar(kbest,dep_Y)   
    
        
classifier,Accuracy,report,X_test,y_test,cm=logistic(X_train,y_train,X_test)
acclog.append(Accuracy)

classifier,Accuracy,report,X_test,y_test,cm=svm_linear(X_train,y_train,X_test)  
accsvml.append(Accuracy)
    
classifier,Accuracy,report,X_test,y_test,cm=svm_NL(X_train,y_train,X_test)  
accsvmnl.append(Accuracy)
    
classifier,Accuracy,report,X_test,y_test,cm=knn(X_train,y_train,X_test)  
accknn.append(Accuracy)
    
classifier,Accuracy,report,X_test,y_test,cm=Navie(X_train,y_train,X_test)  
accnav.append(Accuracy)
    
classifier,Accuracy,report,X_test,y_test,cm=Decision(X_train,y_train,X_test)  
accdes.append(Accuracy)
    
classifier,Accuracy,report,X_test,y_test,cm=random(X_train,y_train,X_test)  
accrf.append(Accuracy)
    
result=selectk_Classification(acclog,accsvml,accsvmnl,accknn,accnav,accdes,accrf)



You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  dataframe['Logistic'][idex]=acclog[number]
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame

In [22]:
# This one is for the K value --> 5

result

Unnamed: 0,Logistic,SVMl,SVMnl,KNN,Navie,Decision,Random
ChiSquare,0.94,0.94,0.95,0.89,0.83,0.96,0.95


In [26]:
# This one is for the K value --> 4

result

Unnamed: 0,Logistic,SVMl,SVMnl,KNN,Navie,Decision,Random
ChiSquare,0.85,0.82,0.83,0.86,0.79,0.89,0.89


In [29]:
# This one is for the K value --> 3

result

Unnamed: 0,Logistic,SVMl,SVMnl,KNN,Navie,Decision,Random
ChiSquare,0.82,0.82,0.82,0.85,0.8,0.84,0.83


In [32]:
# This one is for the K value --> 6

result

Unnamed: 0,Logistic,SVMl,SVMnl,KNN,Navie,Decision,Random
ChiSquare,0.95,0.96,0.96,0.93,0.89,0.97,0.97


In [None]:
# From the above results, we we decide that K=6 gives best result--> 97 for Decision and Random Forest.

# Further, we can use mode and ensamble learning also to conclude the best model.


# To Determine the Best Model Using Mode and Ensemble Techniques

## Step 1: Analyze with Mode
The **mode** refers to the most frequently occurring value or choice. In this context, the mode can identify the classification model that performs well most consistently across feature selection techniques.

### Process:
- Extract the accuracy scores for all models.
- Identify the model(s) with the highest accuracy for each row (feature selection method).
- Find the **mode** of the models with the highest accuracy.

### For the Provided Output:

| Logistic | SVMl  | SVMnl | KNN   | Navie | Decision | Random |
|----------|-------|-------|-------|-------|----------|--------|
| 0.95     | 0.96  | 0.96  | 0.93  | 0.89  | 0.97     | 0.97   |

The **Decision** and **Random Forest** models have the highest accuracy (0.97), so they are likely candidates for the best models.

### Result:
The **mode** approach suggests **Decision Tree** and **Random Forest** as the best-performing models.


### Step 2: Apply Ensemble Techniques

Ensemble techniques combine predictions from multiple models to achieve better overall performance. The two most common approaches are:

**Voting Ensemble:**
- Combine predictions of all models (e.g., Logistic Regression, SVM, etc.).
- Use a majority vote or weighted vote based on accuracy scores.

**Example:**
- Logistic = Correct Prediction
- SVMl = Correct Prediction
- SVMnl = Incorrect Prediction
- KNN = Correct Prediction
- Navie = Incorrect Prediction
- Decision = Correct Prediction
- Random = Correct Prediction
- With majority vote, the ensemble predicts the majority outcome.

**Averaging Ensemble:**
- Compute the average probability predicted by each model for each class.
- Use the class with the highest averaged probability as the final prediction.


In [33]:
from sklearn.ensemble import VotingClassifier

# Example: Combine the top-performing models
ensemble_model = VotingClassifier(
    estimators=[
        ('decision', decision_tree_model),
        ('random', random_forest_model)
    ],
    voting='soft'  # 'soft' uses predicted probabilities; 'hard' uses predicted labels
)

# Train ensemble model
ensemble_model.fit(X_train, y_train)

# Evaluate accuracy
ensemble_accuracy = ensemble_model.score(X_test, y_test)
print("Ensemble Model Accuracy:", ensemble_accuracy)


NameError: name 'decision_tree_model' is not defined

The error `NameError: name 'decision_tree_model' is not defined` indicates that the variable `decision_tree_model` is not defined in your code. This is because you need to define or train the `decision_tree_model` and `random_forest_model` before passing them to the `VotingClassifier`.

### Here’s how you can resolve the issue:

**Steps to Fix the Error:**
1. **Define and Train the Models:**
   - Make sure you have imported the required classifiers and trained the `decision_tree_model` and `random_forest_model` on the training data (`X_train`, `y_train`) before using them in the ensemble.


In [34]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.metrics import accuracy_score

# Step 1: Initialize individual models
decision_tree_model = DecisionTreeClassifier(random_state=0)
random_forest_model = RandomForestClassifier(random_state=0)

# Step 2: Train the models
decision_tree_model.fit(X_train, y_train)
random_forest_model.fit(X_train, y_train)

# Step 3: Create the VotingClassifier (Ensemble)
ensemble_model = VotingClassifier(
    estimators=[
        ('decision', decision_tree_model),
        ('random', random_forest_model)
    ],
    voting='soft'  # Use 'soft' for predicted probabilities, 'hard' for predicted labels
)

# Step 4: Train the ensemble model
ensemble_model.fit(X_train, y_train)

# Step 5: Evaluate the ensemble model
ensemble_accuracy = ensemble_model.score(X_test, y_test)
print("Ensemble Model Accuracy:", ensemble_accuracy)


Ensemble Model Accuracy: 0.97


### The Result: Ensemble Model Accuracy: 0.97

The result **`Ensemble Model Accuracy: 0.97`** indicates that the ensemble model achieved 97% accuracy on the test data.

---

### Explanation of the Result:

#### High Accuracy:
- The ensemble model, combining predictions from both the `DecisionTreeClassifier` and the `RandomForestClassifier`, performed exceptionally well on the test set.
- A 97% accuracy means that the ensemble model correctly predicted the labels for 97% of the test samples.

#### Ensemble Benefits:
- By combining the strengths of the `DecisionTreeClassifier` and `RandomForestClassifier`, the ensemble model reduces individual model biases and variances.
- Using `voting='soft'` allows the model to leverage probabilities, making it more robust and likely leading to better performance than individual models.

---

### What Contributed to the High Accuracy:
1. **Dataset Quality:** 
   - Features in the training and testing sets may be well-processed (e.g., scaled and balanced).
2. **Feature Selection:**
   - If the dataset underwent feature selection (like `SelectKBest`), it would have reduced noise and improved performance.
3. **Random Forest's Strength:**
   - The Random Forest classifier is inherently strong at generalizing well to unseen data.
4. **Complementary Nature of Models:**
   - The Decision Trees and Random Forest models complement each other, improving the overall performance.

---

### Next Steps for Evaluation:
1. **Confusion Matrix:**
   - Check for detailed insights into classification performance, such as false positives and false negatives.
2. **Classification Report:**
   - Generate a report to understand precision, recall, and F1-score for individual classes.
3. **Cross-Validation:**
   - Validate the model using cross-validation to ensure the high accuracy is not due to overfitting.
