## Project 4 - Predictive Analysis using scikit-learn
Adama Konate - IS 362

Task: 
- Start with the mushroom data in the pandas DataFrame that you constructed in your “Assignment – Preprocessing Data with sci-kit learn.”
- Use scikit-learn to determine which of the two predictor columns that you selected (odor and one other column of your choice) most accurately predicts whether or not a mushroom is poisonous. There is an additional challenge here—to use scikit-learn’s predictive classifiers, you’ll want to convert each of your two (numeric categorical) predictor columns into a set of columns. See for one approach pandas get_dummies() method.
- Clearly state your conclusions along with any recommendations for further analysis.

In [19]:
# Importing required modules
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import sklearn.model_selection
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split as ttsplit

%matplotlib inline

Attribute Information:
1. **cap-shape**:                bell=b,conical=c,convex=x,flat=f,                                                knobbed=k,sunken=s
2. **cap-surface**:              fibrous=f,grooves=g,scaly=y,smooth=s
3. **cap-color**:                brown=n,buff=b,cinnamon=c,gray=g,green=r,                                        pink=p,purple=u,red=e,white=w,yellow=y
4. **bruises?**:                 bruises=t,no=f
5. **odor**:                     almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s
6. **gill-attachment**:          attached=a,descending=d,free=f,notched=n
7. **gill-spacing**:             close=c,crowded=w,distant=d
8. **gill-size**:                broad=b,narrow=n
9. **gill-color**:               black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y
10. **stalk-shape**:             enlarging=e,tapering=t
11. **stalk-root**:              bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=?
12. **stalk-surface-above-ring**:fibrous=f,scaly=y,silky=k,smooth=s
13. **stalk-surface-below-ring**:fibrous=f,scaly=y,silky=k,smooth=s
14. **stalk-color-above-ring**:  brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
15. **stalk-color-below-ring**:  brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
16. **veil-type**:               partial=p,universal=u
17. **veil-color**:              brown=n,orange=o,white=w,yellow=y
18. **ring-number**:             none=n,one=o,two=t
19. **ring-type**:               cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z
20. **spore-print-color**:       black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y
21. **population**:              abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y
22. **habitat**:                 grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d



In [20]:
#Reading CSV and converting to dataframe
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data', 
                  sep = ',', 
                  header = None, 
                  usecols = [0,3,5], 
                  names = ["Mushroom_Class", "Cap_Color", "Odor"])
data.head(5)

Unnamed: 0,Mushroom_Class,Cap_Color,Odor
0,p,n,p
1,e,y,a
2,e,w,l
3,p,w,p
4,e,g,n


In [21]:
## Converting data to numeric values
data.replace(to_replace={"Mushroom_Class":{'p':1, 'e':0}}, inplace = True)
data.replace(to_replace={"Cap_Color":{'n':0, 'b':1, 'c':2, 'g':3, 'r':4, 'p':5, 'u':6, 'e':7, 'w':8, 'y':9}}, inplace=True)
data.replace(to_replace={"Odor":{'a':0, 'l':1, 'c':2, 'y':3, 'f':4, 'm':5, 'n':6, 'p':7, 's':8}}, inplace=True)
data.head(5)

Unnamed: 0,Mushroom_Class,Cap_Color,Odor
0,1,0,7
1,0,9,0
2,0,8,1
3,1,8,7
4,0,3,6


In [22]:
# Counting edible and poisionus mushrooms
count = data['Mushroom_Class'].value_counts()
count

0    4208
1    3916
Name: Mushroom_Class, dtype: int64

In [23]:
# Converting two predictor into dummy variables
m_color = pd.Series(data['Cap_Color'])
c = pd.get_dummies(m_color)

odor = pd.Series(data['Odor'])
o = pd.get_dummies(odor)

# Combining both into a new column
mushroom_data = pd.concat([c, o, data['Mushroom_Class']], axis = 1)

mushroom_data.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,0.1,1.1,2.1,3.1,4.1,5.1,6.1,7.1,8.1,Mushroom_Class
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1
1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1
4,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0


In [24]:
# counting data shapes. result = 8124 rows and 20 columns
mushroom_data.shape

(8124, 20)

In [25]:
x = mushroom_data.iloc[:, :-1].values # defining x values for training model
y = mushroom_data.iloc[:, 1].values # defining y value for training model
X_train, X_test, Y_train, Y_test = ttsplit(x,y, random_state=1)

In [26]:
print(X_train.shape)
print(X_test.shape)

(6093, 19)

(2031, 19)


In [27]:
print(Y_train.shape)
print(Y_test.shape)

(6093,)

(2031,)


In [28]:
lr = sklearn.linear_model.LinearRegression()
lr.fit(X_train, Y_train)
y_pred = lr.predict(X_test)
t = [1,0]
p = [1,0]
print(sklearn.metrics.mean_absolute_error(t,p))
print(sklearn.metrics.mean_squared_error(t, p))
print(np.sqrt(sklearn.metrics.mean_squared_error(t, p)))

0.0

0.0

0.0


In [29]:
print(np.sqrt(metrics.mean_squared_error(Y_test, y_pred)))

2.040336575497588e-15


In [30]:
# Train and Test with "CAP COLOR" feature
X = mushroom_data.iloc[:, 0:9].values
Y = mushroom_data.iloc[:, 1].values

X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y, random_state=1)
lr.fit(X_train, Y_train)
Y_pred = lr.predict(X_test)

print(np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))

2.133205248004342e-15


In [32]:
# Train and Test with "ODOR" feature 
X = mushroom_data.iloc[:, 10:18].values
Y = mushroom_data.iloc[:, 1].values

X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y, random_state=1)
lr.fit(X_train, Y_train)
Y_pred = lr.predict(X_test)

print(np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))

0.14928001774973673


We can use the ODOR feature to predict edible or poisonous mushroom because the square root is less than COLOR error. And less error means a better or more accurate prediction.