# Data Processing with Scikit Learn
## A continuation of Week 13 data with machine learning
### Anthony Paveglio
---
This project will use the machine learning python library _scikit-learn_ to study the data presented in Week 13 regarding mushroom attributes. These attributes include physical attributes as well as the classification whether or not the mushroom is poisonous or edible. In week 13 I made comparisons of the data between mushroom odor, color, habtats and their classification (Poisonous, edible) to attempt to discover any trends. For example, mushrooms with a specific odor may always be poisonous.

In [52]:
import sklearn
import sklearn.preprocessing
import sklearn.linear_model
import sklearn.model_selection
import pandas
import numpy
import seaborn
import matplotlib

# 1. Importing mushroom data
We must first import the mushroom data file as well as append the correct headers.
_Only the first 10 rows are displayed to conserve visible space_

In [53]:
#All of the column headers found in agaricus-lepiota.names
agaricusLepiotaHeaders = ['class','cap-shape','cap-surface','cap-color','bruises',
                         'odor','gill-attachment','gill-spacing','gill-size', 'gill-color',
                         'stalk-shape','stalk-root','stalk-surface-above-ring','stalk-surface-below-ring', 
                          'stalk-color-above-ring','stalk-color-below-ring','veil-type','veil-color',
                         'ring-number','ring-type','spore-print-color','population','habitat']

#The data in this file is comma seperated even though the extension is a generic .data
agaricusLepiotaData = pandas.read_csv('agaricus-lepiota.data', names=agaricusLepiotaHeaders)

agaricusLepiotaData.head(10)

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g
5,e,x,y,y,t,a,f,c,b,n,...,s,w,w,p,w,o,p,k,n,g
6,e,b,s,w,t,a,f,c,b,g,...,s,w,w,p,w,o,p,k,n,m
7,e,b,y,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,s,m
8,p,x,y,w,t,p,f,c,n,p,...,s,w,w,p,w,o,p,k,v,g
9,e,b,s,y,t,a,f,c,b,g,...,s,w,w,p,w,o,p,k,s,m


# 2. Converting to numeric
## More efficent approach than my week 13 submission

In [54]:
binaryCodedData = pandas.get_dummies(agaricusLepiotaData)

binaryCodedData.head(10)

Unnamed: 0,class_e,class_p,cap-shape_b,cap-shape_c,cap-shape_f,cap-shape_k,cap-shape_s,cap-shape_x,cap-surface_f,cap-surface_g,...,population_s,population_v,population_y,habitat_d,habitat_g,habitat_l,habitat_m,habitat_p,habitat_u,habitat_w
0,0,1,0,0,0,0,0,1,0,0,...,1,0,0,0,0,0,0,0,1,0
1,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
2,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,0,1,0,0,0,0,0,1,0,0,...,1,0,0,0,0,0,0,0,1,0
4,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
5,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
6,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
7,1,0,1,0,0,0,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
8,0,1,0,0,0,0,0,1,0,0,...,0,1,0,0,1,0,0,0,0,0
9,1,0,1,0,0,0,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0


# 3. Linear Regression
## Algorithim training and predictions
This section will train a linear regression model to determine if a mushroom is poisonous or edible based on training and testing data.

In [55]:
linearRegModel = sklearn.linear_model.LinearRegression()

The data will be split into:

- mushroomClass: Contains all the attribute columns with collected data.
- mushroomAttributes: Which contains the columns for the class of mushroom (poisonous or edible).

In [56]:
mushroomClass = binaryCodedData.iloc[:, 0:2]
mushroomAttributes = binaryCodedData.iloc[:, 2:]

The easiest way to create a training set of data and a test set of data is to use the following function:

    train_test_split()
    
This will create a subset of data for training containing the attributes and the classes. This will also create a subset of data to test the machine learning prediction and the answers. We can use the answers created by this function to compare the predictions to the actual answers. This will give us an idea of how accurate the linear regression algorithim is.

In [57]:
mushroomAttributesTrain, mushroomAttributesTest, mushroomClassTrain, mushroomClassTest = sklearn.model_selection.train_test_split( 
    mushroomAttributes,
    mushroomClass,
    test_size=0.33, 
    random_state=42)

Now that the training and test data sets were created, the _.fit()_ and _.predict()_ functions can be used for the linear regression model with these data sets. The fit function will train the algorithim what to look for based on the mushroom attributes and the classes. The predict function will attempt to determine each mushrooms class based on the data passed to it. An effective training set should yield an accurate prediction of whether or not a mushroom is poisonous or edible.

In [58]:
trainingResults = linearRegModel.fit(mushroomAttributesTrain, mushroomClassTrain)

learningResults = linearRegModel.predict(mushroomAttributesTest)

For easy viewing, lets structure this data side by side in a data frame.

- class_e: Actual value from data set of mushroom class (1 if edible, 0 if false).
- class_p: Actual value from data set of mushroom class (1 if poisonous, 0 if false).
- Prediction_Edible: Prediction from linear regression algorithim
- Prediction_Poisonous: Prediction from linear regression algorithim

In [59]:
columnNames = ['Prediction_Edible', 'Prediction_Poisonous']

learningDataStructured = pandas.DataFrame(learningResults, columns=columnNames)
learningDataStructured['Actual Value: Edible'] = mushroomClassTest['class_e'].values
learningDataStructured['Actual Value: Poisonous'] = mushroomClassTest['class_p'].values

display(learningDataStructured.head(10))

Unnamed: 0,Prediction_Edible,Prediction_Poisonous,Actual Value: Edible,Actual Value: Poisonous
0,1.0,3.330669e-16,1,0
1,1.998401e-15,1.0,0,1
2,2.442491e-15,1.0,0,1
3,1.0,-9.992007e-16,1,0
4,1.998401e-15,1.0,0,1
5,1.776357e-15,1.0,0,1
6,-2.220446e-16,1.0,0,1
7,-2.220446e-16,1.0,0,1
8,1.0,3.330669e-16,1,0
9,1.0,6.661338e-16,1,0


In the above dataset any predictions with a 1 indicated the algorithim chose that class, if the answer is 0 or an incredibly small number close to zero then the algorithim did not pick that choice. The answers in the indicated value colums should match the prediction columns.

# 4. Attribute coefficients

Using the initial _trainingResults_ when previously using _.fit()_ to train the algorithim to identify mushroom types further information can be extracted, such as coefficients. These coefficents will indicate if there is a strong positive or negative correlation between each attribute and whether or not the mushroom is edible or poisonous. 

In [60]:
coefficentData = pandas.DataFrame(
    trainingResults.coef_, columns=x.columns, index=['coefficients_edible', 'coefficients_poisonous'])

display(coefficentData)

Unnamed: 0,cap-shape_b,cap-shape_c,cap-shape_f,cap-shape_k,cap-shape_s,cap-shape_x,cap-surface_f,cap-surface_g,cap-surface_s,cap-surface_y,...,population_s,population_v,population_y,habitat_d,habitat_g,habitat_l,habitat_m,habitat_p,habitat_u,habitat_w
coefficients_edible,-3.391873e-15,1.31839e-14,-2.969847e-15,-1.44329e-15,-3.774758e-15,-2.164935e-15,-0.010582,-0.010582,-0.010582,-0.010582,...,0.058787,0.058787,0.058787,-0.115716,-0.115716,-0.115716,-0.115716,-0.115716,-0.115716,0.384279
coefficients_poisonous,2.421824e-15,-9.65894e-15,1.776357e-15,1.137979e-15,3.587408e-15,1.44329e-15,0.009724,0.009724,0.009724,0.009724,...,-0.054019,-0.054019,-0.054019,0.11201,0.11201,0.11201,0.11201,0.11201,0.11201,-0.387187


A positive number or a negative number should indicate a strong correlation (or strong negative correlation). However these numbers are incredibly small with many of them expressed in scientific notitation to the -15th power. Lets sort the table to find some stronger coefficents.

### Strongest coefficents for poisonous attributes

In [61]:
display(coefficentData.sort_values('coefficients_poisonous', axis=1, ascending=False))

Unnamed: 0,spore-print-color_r,odor_c,ring-type_l,spore-print-color_w,stalk-root_e,ring-number_o,stalk-root_?,stalk-surface-above-ring_y,bruises_t,stalk-surface-below-ring_y,...,odor_n,spore-print-color_u,spore-print-color_k,spore-print-color_y,spore-print-color_b,spore-print-color_o,spore-print-color_n,ring-type_f,bruises_f,ring-number_t
coefficients_edible,-1.817621,-1.589715,-0.821284,-0.817621,-0.747698,-0.50855,-0.497695,-0.422643,-0.184292,-0.177776,...,0.410285,0.432382,0.432382,0.432382,0.432382,0.432382,0.432382,0.46522,0.565706,0.570276
coefficients_poisonous,1.816965,1.582949,0.820207,0.816965,0.740652,0.509379,0.49025,0.418179,0.19956,0.178864,...,-0.417051,-0.433437,-0.433437,-0.433437,-0.433437,-0.433437,-0.433437,-0.456658,-0.550038,-0.569221


### Strongest coefficents for edible attributes

In [62]:
display(coefficentData.sort_values('coefficients_edible', axis=1, ascending=False))

Unnamed: 0,ring-number_t,bruises_f,ring-type_f,spore-print-color_n,spore-print-color_o,spore-print-color_b,spore-print-color_y,spore-print-color_u,spore-print-color_k,odor_n,...,spore-print-color_h,bruises_t,stalk-surface-above-ring_y,stalk-root_?,ring-number_o,stalk-root_e,spore-print-color_w,ring-type_l,odor_c,spore-print-color_r
coefficients_edible,0.570276,0.565706,0.46522,0.432382,0.432382,0.432382,0.432382,0.432382,0.432382,0.410285,...,-0.18295,-0.184292,-0.422643,-0.497695,-0.50855,-0.747698,-0.817621,-0.821284,-1.589715,-1.817621
coefficients_poisonous,-0.569221,-0.550038,-0.456658,-0.433437,-0.433437,-0.433437,-0.433437,-0.433437,-0.433437,-0.417051,...,0.172429,0.19956,0.418179,0.49025,0.509379,0.740652,0.816965,0.820207,1.582949,1.816965


There are very strong indicators that a mushroom is **poisonous** juding by the above positive coefficents.

- spore-print-color_r (+1.81)
- odor_c (+1.58)
- ring-type_l (+0.82)
- spore-print-color_w (+0.81)
- stalk-root_e (+0.74)

It also appears that there are somewhat strong indicators of a mushroom that is **edible**, such as the following:

- ring-number_t (+0.57)
- bruises_f (+0.56)
- ring-type_f (+0.46)
- spore-print-color_n (+0.43)
    - Also spore print color type o, b, y, u, and k. (All +0.43)

### Remarks
Examining the above coefficents indicates that various properties can indicate if a mushroom is poisonous such as certain types of spore colors, odors detected, rings, and roots. There are also indicators of a mushroom being edible which include ring types, bruises, and spore colors as noted above. Given the accuracy of the test and training of the linear model initially before examining the coefficents these must be accurate indicators of mushrooms that are poisonous or edible based on the provided data set.