In [1]:
import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

## Importing The Dataframe
Our preprocessed dataframe can be imported from the csv that we had previously created in the last assignment.

In [2]:
dfm = pd.read_csv('https://github.com/alu-potato/IS362_Assignment13/raw/master/dfm.csv', index_col = 0)
dfm

Unnamed: 0,poisonous,odor,habitat
0,1,7,4
1,0,0,0
2,0,1,2
3,1,7,4
4,0,6,0
...,...,...,...
8119,0,6,1
8120,0,6,1
8121,0,6,1
8122,1,3,1


## Processing the Predictor Columns
Scikit does not technically work with categorical data, so if we leave our predictor columns as is they will be run through the estimator as numeric predictors as shown below. By creating dummy columns for each predictor column utilizing one hot encoder we are able to run this through our estimator now. I have chosen the two columns as individual predictors and then also decided to test how well both columns together do as a predictor.

In [3]:
dfm.loc[:,'odor']

0       7
1       0
2       1
3       7
4       6
       ..
8119    6
8120    6
8121    6
8122    3
8123    6
Name: odor, Length: 8124, dtype: int64

In [4]:
encode = OneHotEncoder()
X = encode.fit_transform(dfm.loc[:, ['odor']])
X2 = encode.fit_transform(dfm.loc[:, ['habitat']])
X3 = encode.fit_transform(dfm.drop('poisonous', axis = 'columns'))
y = dfm['poisonous']

In [5]:
print(X)

  (0, 7)	1.0
  (1, 0)	1.0
  (2, 1)	1.0
  (3, 7)	1.0
  (4, 6)	1.0
  (5, 0)	1.0
  (6, 0)	1.0
  (7, 1)	1.0
  (8, 7)	1.0
  (9, 0)	1.0
  (10, 1)	1.0
  (11, 0)	1.0
  (12, 0)	1.0
  (13, 7)	1.0
  (14, 6)	1.0
  (15, 6)	1.0
  (16, 6)	1.0
  (17, 7)	1.0
  (18, 7)	1.0
  (19, 7)	1.0
  (20, 0)	1.0
  (21, 7)	1.0
  (22, 1)	1.0
  (23, 0)	1.0
  (24, 1)	1.0
  :	:
  (8099, 6)	1.0
  (8100, 6)	1.0
  (8101, 8)	1.0
  (8102, 6)	1.0
  (8103, 6)	1.0
  (8104, 6)	1.0
  (8105, 6)	1.0
  (8106, 6)	1.0
  (8107, 6)	1.0
  (8108, 3)	1.0
  (8109, 6)	1.0
  (8110, 6)	1.0
  (8111, 6)	1.0
  (8112, 6)	1.0
  (8113, 3)	1.0
  (8114, 5)	1.0
  (8115, 6)	1.0
  (8116, 8)	1.0
  (8117, 3)	1.0
  (8118, 4)	1.0
  (8119, 6)	1.0
  (8120, 6)	1.0
  (8121, 6)	1.0
  (8122, 3)	1.0
  (8123, 6)	1.0


## Estimator/Model Type
We are predicting simply if a mushroom is or isn't poisonous which is an on/off or 0/1 situation. This makes it prime for using the logistic regression predictor model on it.

In [6]:
lr = LogisticRegression()

## Prediction Accuracy
We use cross_val_score to simply check the accuracy of our logistical regression model as it creates test cases that are then compared against our actual data for accuracy.

In [7]:
cross_val_score(lr, X, y, cv=5, scoring='accuracy').mean()

0.9852301629405078

In [8]:
cross_val_score(lr, X2, y, cv=5, scoring='accuracy').mean()

0.6667727927245168

In [9]:
cross_val_score(lr, X3, y, cv=5, scoring='accuracy').mean()

0.9572741947707465

## Conclusion
As we predicted previously in the preprocessing step, odor is a very significant predictor for the toxicity of a mushroom. In fact, it seems that it can be used with a 98.5% accuracy rate when put into a logistical regression model which is very, very accurate. Habitat was not able to compete as a predictor but still was decently accurate at 66.7%. What surprised me the most though was the fact that the two predictor variables together were actually less accurate at predicting than just odor itself. This maybe a case of the significantly lower accuracy of using habitat to tell toxicity bringing down the very high accuracy of the odor predictor. To go further in analyzing this, it would be interesting to create an algorithm that takes a picture of a mushroom and then breaks down the predictor variables in order to predict if it is toxic or not. Although, things like odor and habitat would need to be manually input.