## Project 4
### Predictive Analysis Using Scikit-learn

In [23]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

We start by importing the mushroom dataset as a pandas DataFrame object:

In [24]:
names = ['poisonous', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor',
         'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape',
         'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring',
         'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number', 'ring-type',
         'spore-print-color', 'population', 'habitat']

df = pd.read_csv('C:/Users/cscam/mushroom/agaricus-lepiota.data', names = names)
df

Unnamed: 0,poisonous,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,e,k,s,n,f,n,a,c,b,y,...,s,o,o,p,o,o,p,b,c,l
8120,e,x,s,n,f,n,a,c,b,y,...,s,o,o,p,n,o,p,b,v,l
8121,e,f,s,n,f,n,a,c,b,n,...,s,o,o,p,o,o,p,b,c,l
8122,p,k,y,n,f,y,f,c,n,b,...,k,w,w,p,w,o,e,w,v,l


Rather than analyse the entire dataset, we will focus on three columns: edible/poisonous, odor, and cap color.

In [25]:
new_df = df[['poisonous', 'odor', 'cap-color']]
new_df

Unnamed: 0,poisonous,odor,cap-color
0,p,p,n
1,e,a,y
2,e,l,w
3,p,p,w
4,e,n,g
...,...,...,...
8119,e,n,n
8120,e,n,n
8121,e,n,n
8122,p,y,n


Notice that the values are represented as characters. We will need to convert these to numerics in order to process them with scikit-learn:

In [26]:
new_df = new_df.astype('category')
new_df.dtypes

poisonous    category
odor         category
cap-color    category
dtype: object

In [27]:
labelencoder = LabelEncoder()

for column in new_df.columns:
    new_df[column] = labelencoder.fit_transform(new_df[column])
    
new_df.head()

Unnamed: 0,poisonous,odor,cap-color
0,1,6,4
1,0,0,9
2,0,3,8
3,1,6,8
4,0,5,3


Now that we have the data presented in numerical format, we can create the training and test data for our model:

In [29]:
y = new_df['poisonous'].values
x = new_df.drop(['poisonous'], axis=1).values
x_train, x_test, y_train, y_test = train_test_split(x,y,random_state=42,test_size=0.2)

We will use a logistic regression model to predect the probability of events. First we need to fit the model with data:

In [31]:
logreg = LogisticRegression()
logreg.fit(x_train, y_train)

LogisticRegression()

Now we can run the predict method:

In [21]:
results = logreg.predict(x)

results

array([0, 1, 0, ..., 0, 0, 0])

Finally, we can use our test values to check the accuracy of our test:

In [30]:
logreg.score(x_test, y_test)

0.656