# Project 4

## Sci-Kit Learn & Predictive Analysis

### Task:

1. Start with the mushroom data in the pandas DataFrame that you constructed in your “Assignment – Preprocessing Data with sci-kit learn.”
2. Use scikit-learn to determine which of the two predictor columns that you selected (odor and one other column of your choice) most accurately predicts whether or not a mushroom is poisonous. There is an additional challenge here—to use scikit-learn’s predictive classifiers, you’ll want to convert each of your two (numeric categorical) predictor columns into a set of columns. See for one approach pandas get_dummies() method.
3. Clearly state your conclusions along with any recommendations for further analysis.

##### Import library modules:

In [1]:
import pandas as pd
import sklearn.model_selection
import sklearn.linear_model
from sklearn import metrics
import numpy as np

###### Create pandas DataFrame that reads mushroom data for specific columns:

In [2]:
mushroom_df = pd.read_table('https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data',sep=',', header=None, usecols=[0,5,9,22], names=["Edibility","Odor","Color","Habitat"])
mushroom_df

Unnamed: 0,Edibility,Odor,Color,Habitat
0,p,p,k,u
1,e,a,k,g
2,e,l,n,m
3,p,p,n,u
4,e,n,k,g
5,e,a,n,g
6,e,a,g,m
7,e,l,n,m
8,p,p,p,g
9,e,a,g,m


##### Utilize Python to transform letters to into numbers using an iterative loop and dictionary:

In [3]:
columns = [mushroom_df.Edibility,mushroom_df.Odor,mushroom_df.Color,mushroom_df.Habitat]
column_names = ["Edibility","Odor","Color","Habitat"]
transform_dict = {}
counter = 0
column_counter = 0
column_names_counter = 0
mushroom_num = pd.DataFrame()

for n in range(4):
    for odorType in columns[column_counter]:
        if odorType not in transform_dict:
            transform_dict[odorType] = counter
            counter += 1


    old_values = list(transform_dict.keys())
    new_values = list(transform_dict.values())
    
    mushroom_num[column_counter] = mushroom_df[[column_names_counter]].replace(old_values,new_values)
    
    column_counter += 1
    column_names_counter += 1
    counter = 0
    transform_dict = {}
    
mushroom_num.columns = column_names
mushroom_num

Unnamed: 0,Edibility,Odor,Color,Habitat
0,0,0,0,0
1,1,1,0,1
2,1,2,1,2
3,0,0,1,0
4,1,3,0,1
5,1,1,1,1
6,1,1,2,2
7,1,2,1,2
8,0,0,3,1
9,1,1,2,2


###### Convert odor, color, and habitat variables into dummy/indicator variables:

In [5]:
odor = pd.Series(mushroom_num['Odor'])
o = pd.get_dummies(odor)

color = pd.Series(mushroom_num['Color'])
c = pd.get_dummies(color)

habitat = pd.Series(mushroom_num['Habitat'])
h = pd.get_dummies(habitat)

##### Combine odor, color, and habitat columns into a new column:

In [8]:
mushroom_col = pd.concat([o, c, h, mushroom_num['Edibility']], axis=1)
cols = list(mushroom_col.iloc[:, :-1])

##### Define x and y values and create the training model:

In [10]:
X = mushroom_num.iloc[:, :-1].values
Y = mushroom_num.iloc[:, 1].values

X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y, random_state=1)

##### Utilize linear regression to predict y value with a test variable and use sci-kit learn to predict true and predictive output:

In [11]:
linreg = sklearn.linear_model.LinearRegression()
linreg.fit(X_train, Y_train)
Y_pred = linreg.predict(X_test)
t = [1, 0]
p = [1, 0]

print(sklearn.metrics.mean_absolute_error(t, p))
print(sklearn.metrics.mean_squared_error(t, p))
print(np.sqrt(sklearn.metrics.mean_squared_error(t, p)))

0.0
0.0
0.0


##### Calculate root mean squared error for the data to figure out margin of error:

In [12]:
print(np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))

2.45803998903e-14


Since the root mean squared error is closer to zero, the data set can predict edibility accurately.

##### Removing odor to determine whether or not it can predict edibility:

In [27]:
X = mushroom_col.iloc[:, 0:9].values
Y = mushroom_col.iloc[:, 1].values

X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y, random_state=1)
linreg.fit(X_train, Y_train)
Y_pred = linreg.predict(X_test)

print(np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))

4.26760382915e-15


Since the margin of error for odor is not closer to zero, then odor seems more important in predicting edibility for mushrooms.

###### Removing color to determine whether or not it can predict edibility:

In [26]:
X = mushroom_col.iloc[:, 10:21].values
Y = mushroom_col.iloc[:, 1].values

X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y, random_state=1)
linreg.fit(X_train, Y_train)
Y_pred = linreg.predict(X_test)

print(np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))

0.212831744045


Since the margin of error without the color is not closer to zero, then color does seem more important in predicting edibility for mushrooms.

###### Removing habitat to determine whether or not it can predict edibility:

In [25]:
X = mushroom_col.iloc[:, 22:28].values
Y = mushroom_col.iloc[:, 1].values

X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y, random_state=1)
linreg.fit(X_train, Y_train)
Y_pred = linreg.predict(X_test)

print(np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))

0.201059892437


Since the margin of error without habitat is not closer to zero, then habitat does seem more important in predicting edibility for mushrooms. The margin of error is also less than the margin of error without color and slightly less accurate in predicting edibility. 

### Final Analysis:

Compared to the root squared mean error of the whole data set, color seems more important in predicting edbility of mushrooms. In the same method, other columns can be compared by calculating root squared mean error and see which column is closer to zero and which column is not as closer to zero. In this case, a column that is closer to zero will not be affected if it was removed in predicting edibility. On the other hand, a column that is not as close to zero will affect predicting edibility. Other implications may be further analyzed such whether or not there are other methods to determine which columns is more accurate in predicting edibility. Other built-in statistical analysis functions as well as visualizations may assist in exploring this analysis as well. 