In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

#TO Encode, Scale and split data
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

#The Models we are going to use
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

#To make a print_score function
from sklearn.model_selection import cross_val_predict, cross_val_score
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score

In [2]:
# Load the data and preprocess it
dataset =  pd.read_csv("agaricus-lepiota.csv")

In [3]:
#Want to rename the colums, so its easier to evaluate
dataset.columns = ['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor', 'gill-attachment',
                   'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring',
                   'stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type',
                   'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat']


In [4]:
#Label Encoding, so that the cells contain corresponding number for the character, and replaces it. 
Encoder = LabelEncoder() 
for col in dataset.columns:
    dataset[col] = Encoder.fit_transform(dataset[col])

In [5]:
dataset.head()
#dataset.info() #Shows that every row has non-null values

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,0,5,2,9,1,0,1,0,0,4,...,2,7,7,0,2,1,4,3,2,1
1,0,0,2,8,1,3,1,0,0,5,...,2,7,7,0,2,1,4,3,2,3
2,1,5,3,8,1,6,1,0,1,5,...,2,7,7,0,2,1,4,2,3,5
3,0,5,2,3,0,5,1,1,0,4,...,2,7,7,0,2,1,0,3,0,1
4,0,5,3,9,1,0,1,0,0,5,...,2,7,7,0,2,1,4,2,2,1


In [6]:
# Want to get a quick overview of how many mushrooms are poisonous or not. We can see that there are roughly the same
# amount of poisonous and edible
#sns.countplot(x='class', data=dataset)

Then we want to generate a heatmap to visualize the correlation between the attributes.

We want to use the heatmap to identify which features are strongly correlated with the target variable ('class') and with each other. This can help us determine which features we should drop in order to create a more accurate model.

In [7]:
#sns.heatmap(dataset.corr())

As the heatmap show above, white cells shows a high correlation, and if the attribute is strongly correlated with the target variable 'class' it is not a useful feature for the prediction and should be dropped. Also, we can see that some attributes are highly correleted with eachother, we would need to choose to drop one of them to avoid multicollinearity.

In [8]:
#Need to drop some coloms because of the logical rules in the dataset.
#This is features that are the most indicative, and we would therefore drop these before running the models. 
drop_features = ['odor', 'spore-print-color', 'habitat', 'stalk-shape','gill-size','gill-spacing','bruises',
                 'gill-color','stalk-root','ring-type','stalk-surface-below-ring','stalk-surface-above-ring',
                'population', 'cap-color']
dataset = dataset.drop(drop_features, axis=1)

In [9]:
X=dataset.drop('class',axis=1) #Predictors 'p'="class", This is what we are trying to predict.
y=dataset['class'] #Response, the data we have to work with, is then the rest of the dataset

In [10]:
#scalar = StandardScaler()
#X = pd.DataFrame(scalar.fit_transform(X), columns = X.columns)

In [11]:
#Splitting the dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=38)

We want to use KNN, DecisionTreeclassifier and RandomForrestClassifier.
The reason for each is:

We want to use KNN because its a "lazy" algorithm that is easy to implement and interpret. KNN is very efficient and accurate for small datasets, and we think therefore this is a good fit for our dataset. We are going to use gridsearch to get the most optimal number of neighbours.  

Random Forrest is a useful algorithm for the mushroom dataset because it can handle large amounts of data and noisy or missing data. Since the mushroom dataset contains many different attributes the Random Forest can be used to identify which attributes are most important for classification by creating multiple decision trees and combining their results.

Decision Tree is well-suited for the mushroom dataset since it can handle non-linear classification problems and can capture the complex interactions between the features. Additionally, decision trees are easy to interpret, which can be useful for understanding which features are most important for classification.

Overall, the combination of these three algorithms provides a good balance between accuracy, interpretability, and robustness for the classification task.

In [12]:
#Grid search to find best parameter
knn_gs = KNN()
param_grid = {'n_neighbors': range(1, 31)}  # Define the range of neighbors to test
grid_search = GridSearchCV(knn_gs, param_grid, cv=5)
grid_search.fit(X_train,y_train)
best_param = grid_search.best_params_['n_neighbors']# get the best parameter

#KNN
knn = KNN(n_neighbors=best_param)
knn.fit(X_train,y_train)

#DecisionTreeClassifier
dtc = DecisionTreeClassifier(max_depth=5)
dtc.fit(X_train,y_train)

#RandomForrestClassifier
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

In [13]:
#A function to print a classification report for the different models.
def print_score(classifier,X_train,y_train,X_test,y_test, name):
    print("Training results for", name, ":\n")
    print('Classification Report:\n{}\n'.format(classification_report(y_train,classifier.predict(X_train))))

In [14]:
print_score(knn,X_train,y_train,X_test,y_test,"KNN")
print_score(dtc,X_train,y_train,X_test,y_test,"DTC")
print_score(rfc,X_train,y_train,X_test,y_test,"RFC")

Training results for KNN :

Classification Report:
              precision    recall  f1-score   support

           0       0.88      0.76      0.82      2945
           1       0.78      0.89      0.83      2741

    accuracy                           0.82      5686
   macro avg       0.83      0.83      0.82      5686
weighted avg       0.83      0.82      0.82      5686


Training results for DTC :

Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.52      0.67      2945
           1       0.65      0.96      0.78      2741

    accuracy                           0.73      5686
   macro avg       0.79      0.74      0.72      5686
weighted avg       0.80      0.73      0.72      5686


Training results for RFC :

Classification Report:
              precision    recall  f1-score   support

           0       0.88      0.79      0.83      2945
           1       0.80      0.88      0.84      2741

    accuracy               

c) Our best performing model is the RandomForrest classifier, as the average score is 84%. We would still not trust the model as there is still 20% that is wrong, and that can be dangerous. The dataset was originally sufficient because it would give a score of 100% for all three models. To really test the models we would need to decrease the numbers of colums and avoid the logical rules of the dataset.

The decision tree classifier is a tree-like model that splits the data based on the most significant feature until a prediction can be made. We think the random forrest classifier perform best because it is an ensemble of decision trees, where each tree makes a prediction and the final prediction is made by combining the predictions of all trees. Therefore, the RFC is even more precise than the dtc.