# **Applied Machine Learning Midterm Project**

GitHub link to Jupyter notebook: https://github.com/evandobler98/ml_midterm/blob/main/ml_midterm.ipynb

GitHub link to my peer review: https://github.com/evandobler98/ml_midterm/blob/main/peer_review.md

## **Author:** Evan Dobler
## **Date:** 4/17/2025

## **Introduction:** 
In the mushroom dataset in this project, we will predict whether a mushroom is edible or poisonous based on characteristics. These characterstics include cap-shape, cap-surface, cap-color, bruises, odor, gill-attachment, gill-spacing, gill-size, and gill-color.

In [32]:
# Import Libraries
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from ucimlrepo import fetch_ucirepo 


#### Section 1. Import and Inspect the Data

1.1 Load the dataset and display the first 10 rows.

In [33]:
# Import dataset 
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"
df = pd.read_csv(url, header=None)
print(df.head(10))



  0  1  2  3  4  5  6  7  8  9   ... 13 14 15 16 17 18 19 20 21 22
0  p  x  s  n  t  p  f  c  n  k  ...  s  w  w  p  w  o  p  k  s  u
1  e  x  s  y  t  a  f  c  b  k  ...  s  w  w  p  w  o  p  n  n  g
2  e  b  s  w  t  l  f  c  b  n  ...  s  w  w  p  w  o  p  n  n  m
3  p  x  y  w  t  p  f  c  n  n  ...  s  w  w  p  w  o  p  k  s  u
4  e  x  s  g  f  n  f  w  b  k  ...  s  w  w  p  w  o  e  n  a  g
5  e  x  y  y  t  a  f  c  b  n  ...  s  w  w  p  w  o  p  k  n  g
6  e  b  s  w  t  a  f  c  b  g  ...  s  w  w  p  w  o  p  k  n  m
7  e  b  y  w  t  l  f  c  b  n  ...  s  w  w  p  w  o  p  n  s  m
8  p  x  y  w  t  p  f  c  n  p  ...  s  w  w  p  w  o  p  k  v  g
9  e  b  s  y  t  a  f  c  b  g  ...  s  w  w  p  w  o  p  k  s  m

[10 rows x 23 columns]


In [34]:
# Add column names to the data 
columns = [
    "class", "cap-shape", "cap-surface", "cap-color", "bruises", "odor",
    "gill-attachment", "gill-spacing", "gill-size", "gill-color", "stalk-shape",
    "stalk-root", "stalk-surface-above-ring", "stalk-surface-below-ring",
    "stalk-color-above-ring", "stalk-color-below-ring", "veil-type", "veil-color",
    "ring-number", "ring-type", "spore-print-color", "population", "habitat"
]
df.columns = columns
df.head(10)

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g
5,e,x,y,y,t,a,f,c,b,n,...,s,w,w,p,w,o,p,k,n,g
6,e,b,s,w,t,a,f,c,b,g,...,s,w,w,p,w,o,p,k,n,m
7,e,b,y,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,s,m
8,p,x,y,w,t,p,f,c,n,p,...,s,w,w,p,w,o,p,k,v,g
9,e,b,s,y,t,a,f,c,b,g,...,s,w,w,p,w,o,p,k,s,m


1.2 Check for missing values and display summary statistics.

In [35]:
# Check for missing values
df.isnull().sum()

class                       0
cap-shape                   0
cap-surface                 0
cap-color                   0
bruises                     0
odor                        0
gill-attachment             0
gill-spacing                0
gill-size                   0
gill-color                  0
stalk-shape                 0
stalk-root                  0
stalk-surface-above-ring    0
stalk-surface-below-ring    0
stalk-color-above-ring      0
stalk-color-below-ring      0
veil-type                   0
veil-color                  0
ring-number                 0
ring-type                   0
spore-print-color           0
population                  0
habitat                     0
dtype: int64

In [36]:
# Summary Statistics
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   class                     8124 non-null   object
 1   cap-shape                 8124 non-null   object
 2   cap-surface               8124 non-null   object
 3   cap-color                 8124 non-null   object
 4   bruises                   8124 non-null   object
 5   odor                      8124 non-null   object
 6   gill-attachment           8124 non-null   object
 7   gill-spacing              8124 non-null   object
 8   gill-size                 8124 non-null   object
 9   gill-color                8124 non-null   object
 10  stalk-shape               8124 non-null   object
 11  stalk-root                8124 non-null   object
 12  stalk-surface-above-ring  8124 non-null   object
 13  stalk-surface-below-ring  8124 non-null   object
 14  stalk-color-above-ring  

Reflection 1: What do you notice about the dataset? Are there any data issues?

One thing I noticed about this dataset is just how many different columns there are. It will be tough to decide which ones make the largest impact on predicting outcomes. The data looks clean and ready to go. 

#### Section 2: Data Exploration and Preparation

2.1 Explore Data Patterns and Distributions