# Final Project

### Calvin Warner, DK Yuan

## Progress Report 2


The research question for our project is can we use statistical learning methods to accurately predict if a mushroom is edible or poisonous.

As a primer for the paper we did some simple data cleaning and organizing to prepare for further analysis. This report is a collection of preliminary analyses. 

In [1]:
import csv
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.model_selection import train_test_split as tts
import math

fn = "mushrooms.csv"

df = pd.read_csv(fn)

df.head(6)

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g
5,e,x,y,y,t,a,f,c,b,n,...,s,w,w,p,w,o,p,k,n,g


In [2]:
df.isnull().sum()

class                       0
cap-shape                   0
cap-surface                 0
cap-color                   0
bruises                     0
odor                        0
gill-attachment             0
gill-spacing                0
gill-size                   0
gill-color                  0
stalk-shape                 0
stalk-root                  0
stalk-surface-above-ring    0
stalk-surface-below-ring    0
stalk-color-above-ring      0
stalk-color-below-ring      0
veil-type                   0
veil-color                  0
ring-number                 0
ring-type                   0
spore-print-color           0
population                  0
habitat                     0
dtype: int64

There appears to be no null values, but a quick look at the documentation the data is marked for missingness with a `?`. So we must go through the data to get rid of all the question marks, so that does not become a class. 

In [3]:
df = df.replace('?', np.NaN)

In [4]:
df['stalk-root'].unique()

array(['e', 'c', 'b', 'r', nan], dtype=object)

**Ok**, now lets delete all the rows that contain a null value.

In [5]:
df = df.dropna()

Lets check to make sure that deleted rows.

In [6]:
df['stalk-root'].unique()

array(['e', 'c', 'b', 'r'], dtype=object)

Looks good! Now we can encode the labels to be numeric so we can run logistic regression.

In [7]:
le=LabelEncoder()
for col in df.columns:
    df[col] = le.fit_transform(df[col])
 
df.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,1,5,2,4,1,6,1,0,1,2,...,2,5,5,0,0,1,3,1,3,5
1,0,5,2,7,1,0,1,0,0,2,...,2,5,5,0,0,1,3,2,2,1
2,0,0,2,6,1,3,1,0,0,3,...,2,5,5,0,0,1,3,2,2,3
3,1,5,3,6,1,6,1,0,1,3,...,2,5,5,0,0,1,3,1,3,5
4,0,5,2,3,0,5,1,1,0,2,...,2,5,5,0,0,1,0,2,0,1


Seperating the data into Y and X for analysis. 

In [8]:
x = df.iloc[:,1:23]
y = df.iloc[:, 0]
x.head()
y.head()

0    1
1    0
2    0
3    1
4    0
Name: class, dtype: int64

In [9]:
x.describe()

Unnamed: 0,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
count,5644.0,5644.0,5644.0,5644.0,5644.0,5644.0,5644.0,5644.0,5644.0,5644.0,...,5644.0,5644.0,5644.0,5644.0,5644.0,5644.0,5644.0,5644.0,5644.0,5644.0
mean,3.420269,1.627215,4.272856,0.564139,3.564848,0.996811,0.181432,0.124734,3.536853,0.510276,...,1.619419,3.949681,3.932672,0.0,0.001417,1.014883,2.096386,1.21545,3.71297,1.236003
std,1.659641,1.336497,1.838018,0.495913,1.765806,0.056388,0.38541,0.330447,2.281428,0.499939,...,0.72162,1.526058,1.525402,0.0,0.037625,0.1656,1.192716,1.059125,1.328741,1.597981
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2.0,0.0,3.0,0.0,2.0,1.0,0.0,0.0,2.0,0.0,...,1.0,3.0,3.0,0.0,0.0,1.0,1.0,0.0,3.0,0.0
50%,5.0,2.0,4.0,1.0,5.0,1.0,0.0,0.0,4.0,1.0,...,2.0,5.0,5.0,0.0,0.0,1.0,3.0,1.0,4.0,1.0
75%,5.0,3.0,6.0,1.0,5.0,1.0,0.0,0.0,6.0,1.0,...,2.0,5.0,5.0,0.0,0.0,1.0,3.0,2.0,5.0,1.0
max,5.0,3.0,7.0,1.0,6.0,1.0,1.0,1.0,8.0,1.0,...,3.0,6.0,6.0,0.0,1.0,2.0,3.0,5.0,5.0,5.0


Lets take a look at the correlations between the variables.

In [10]:
corr = df.corr()
corr.head(3)

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
class,1.0,0.053155,0.046859,0.150741,-0.435562,-0.455566,-0.071945,-0.26416,0.215289,-0.318339,...,-0.363604,-0.317244,-0.308613,,0.047921,0.008615,-0.2152,-0.507034,0.203882,0.297412
cap-shape,0.053155,1.0,-0.068688,-0.056421,-0.097782,1.1e-05,0.002963,0.051432,0.103443,-0.006431,...,-0.02978,-0.030272,-0.0297,,-0.043596,-0.118191,-0.02942,-0.062007,0.048283,-0.057451
cap-surface,0.046859,-0.068688,1.0,0.093355,0.22859,-0.108697,-0.058104,-0.204448,-0.042356,0.073668,...,0.162293,0.045018,0.046099,,0.038702,0.044289,0.190188,0.054747,-0.00039,0.106918


The number of observations in each class.

In [11]:
print(df.groupby('class').size())

class
0    3488
1    2156
dtype: int64


Split the data into training and test sets.

In [12]:
xTrain, xTest, yTrain, yTest = tts(x,y,test_size=0.2,random_state=1)

Create a Logistical Regression model using all the data.

In [13]:
modelLr = LogisticRegression()

lRegModel = modelLr.fit(xTrain,yTrain)

In [14]:
yProb = modelLr.predict_proba(xTest)[:,1] ## Positive class prediction probabilities

yPred = np.where(yProb > 0.5, 1, 0) # This will make the probabilities into class predictions

Confusion Matrix for predictions.

In [15]:
confusionMatrix = metrics.confusion_matrix(yTest,yPred)
confusionMatrix

array([[703,   6],
       [ 34, 386]])