## **Fundamentals of Machine Learning: Final project**

This dataset contains simulated mushrooms for binary classification into edible and poisonous and it is available at the following link:

https://archive.ics.uci.edu/dataset/848/secondary+mushroom+dataset

## Dataset import and Data cleaning

Import libraries:

In [70]:
import pandas as pd
import numpy as np
import re

In [71]:
# Load CSV file into DataFrame
pd.set_option('display.max_columns', None)
mushrooms_train_df = pd.read_csv('primary_data.csv', delimiter=';')

In [72]:
mushrooms_train_df.head(20)

Unnamed: 0,family,name,class,cap-diameter,cap-shape,Cap-surface,cap-color,does-bruise-or-bleed,gill-attachment,gill-spacing,gill-color,stem-height,stem-width,stem-root,stem-surface,stem-color,veil-type,veil-color,has-ring,ring-type,Spore-print-color,habitat,season
0,Amanita Family,Fly Agaric,p,"[10, 20]","[x, f]","[g, h]","[e, o]",[f],[e],,[w],"[15, 20]","[15, 20]",[s],[y],[w],[u],[w],[t],"[g, p]",,[d],"[u, a, w]"
1,Amanita Family,Panther Cap,p,"[5, 10]","[p, x]",[g],[n],[f],[e],,[w],"[6, 10]","[10, 20]",,[y],[w],[u],[w],[t],[p],,[d],"[u, a]"
2,Amanita Family,False Panther Cap,p,"[10, 15]","[x, f]",,"[g, n]",[f],[e],,[w],"[10, 12]","[10, 20]",,,[w],[u],[w],[t],"[e, g]",,[d],"[u, a]"
3,Amanita Family,The Blusher,e,"[5, 15]","[x, f]",,[n],[t],,,[w],"[7, 15]","[10, 25]",[b],,[w],[u],[w],[t],[g],,[d],"[u, a]"
4,Amanita Family,Death Cap,p,"[5, 12]","[x, f]",[h],[r],[f],,[c],[w],"[10, 12]","[10, 20]",,,[w],[u],[w],[t],"[g, p]",,[d],"[u, a]"
5,Amanita Family,False Death Cap,e,"[4, 9]",[x],,"[w, y]",[f],[e],,[w],"[5, 7]","[10, 15]",[b],,"[w, y]",[u],"[y, w]",[t],[g],,[d],"[u, a]"
6,Amanita Family,Destroying Angel,p,"[5, 10]",[b],[t],[w],[f],[e],[c],[w],"[10, 15]","[10, 15]",,[y],[w],[u],[w],[t],"[l, e]",,[d],"[u, a]"
7,Amanita Family,Tawny Grisette,e,"[4, 8]","[c, x]","[h, t]",[n],[f],[e],,[w],"[10, 15]","[10, 15]",,[s],"[w, n]",[u],[w],[f],[f],,[d],"[u, a]"
8,Lepiota Family,Parasol Mushroom,e,"[10, 25]","[p, f]",[y],"[w, n]",[f],,,[w],"[15, 35]","[15, 25]",[s],,[n],,,[t],[m],,"[m, d]","[u, a]"
9,Lepiota Family,Shaggy Parasol,e,"[12, 18]",[x],"[e, y]",[n],[t],[e],,[w],"[8, 12]","[15, 20]",,,[w],,,[t],,,"[g, d]","[u, a]"


I can already see that there are a lot of NaN values that I'm going to explore later

In [73]:
# dataset dimensions
mushrooms_train_df.shape

(173, 23)

In [74]:
# informations about variables
mushrooms_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173 entries, 0 to 172
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   family                173 non-null    object
 1   name                  173 non-null    object
 2   class                 173 non-null    object
 3   cap-diameter          173 non-null    object
 4   cap-shape             173 non-null    object
 5   Cap-surface           133 non-null    object
 6   cap-color             173 non-null    object
 7   does-bruise-or-bleed  173 non-null    object
 8   gill-attachment       145 non-null    object
 9   gill-spacing          102 non-null    object
 10  gill-color            173 non-null    object
 11  stem-height           173 non-null    object
 12  stem-width            173 non-null    object
 13  stem-root             27 non-null     object
 14  stem-surface          65 non-null     object
 15  stem-color            173 non-null    ob

I can see that the variables are all categoricals


the vairiables 'cap-diameter', ' stem.height', 'stem-width' are numeric ranges and I want to convert them to numeric variables by averaging them but keeping the null values unchanged


In [75]:
def from_interval_to_value(value):
    if pd.isna(value):  # keep Nan
        return np.nan
    if '[' in value and ']' in value:  # check if it's an interval
        numbers = [float(x) for x in value.strip('[]').split(',')]  # Convert numbers
        return sum(numbers) / len(numbers)  # mean
    return value  #return original value if it's not an interval

In [76]:
columns_with_intervals = ['cap-diameter', 'stem-height', 'stem-width']
for col in columns_with_intervals:
    mushrooms_train_df[col] = mushrooms_train_df[col].apply(from_interval_to_value)

In [77]:
mushrooms_train_df.head(20).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
family,Amanita Family,Amanita Family,Amanita Family,Amanita Family,Amanita Family,Amanita Family,Amanita Family,Amanita Family,Lepiota Family,Lepiota Family,Lepiota Family,Tricholoma Family,Tricholoma Family,Tricholoma Family,Tricholoma Family,Tricholoma Family,Tricholoma Family,Tricholoma Family,Tricholoma Family,Tricholoma Family
name,Fly Agaric,Panther Cap,False Panther Cap,The Blusher,Death Cap,False Death Cap,Destroying Angel,Tawny Grisette,Parasol Mushroom,Shaggy Parasol,Stinking Parasol,Saffron Parasol,The Deceiver,Amethyst Deceiver,Wood Blewit,Field Blewit,Clouded Agaric,Club-footed Funnel Cap,Common Funnel Cap,Aniseed Funnel Cap
class,p,p,p,e,p,e,p,e,e,e,p,p,e,e,e,e,e,p,e,e
cap-diameter,15.0,7.5,12.5,10.0,8.5,6.5,7.5,6.0,17.5,15.0,3.5,3.5,2.5,2.5,10.0,8.5,14.0,6.0,6.0,4.5
cap-shape,"[x, f]","[p, x]","[x, f]","[x, f]","[x, f]",[x],[b],"[c, x]","[p, f]",[x],"[b, f]",[x],"[f, s]",[x],"[f, s]","[x, f, s]","[x, f, s]","[x, f]",[s],[x]
Cap-surface,"[g, h]",[g],,,[h],,[t],"[h, t]",[y],"[e, y]",[y],[y],[y],,,,[e],,,
cap-color,"[e, o]",[n],"[g, n]",[n],[r],"[w, y]",[w],[n],"[w, n]",[n],"[e, n, p, w]","[y, n]",[n],"[b, u]","[l, u, g, n]","[g, n]","[g, n]","[g, n]",[n],"[l, r, w]"
does-bruise-or-bleed,[f],[f],[f],[t],[f],[f],[f],[f],[f],[t],[f],[f],[f],[f],[f],[f],[f],[f],[f],[f]
gill-attachment,[e],[e],[e],,,[e],[e],[e],,[e],[e],[a],"[a, d]",,[s],[s],"[a, d]",[d],[d],"[a, d]"
gill-spacing,,,,,[c],,[c],,,,[c],,[d],[d],[c],[c],[c],[c],[c],



I decided to convert the intervals to numerical values in order to work better, so from having an interval with min and max I have only one value which is the mean.



In [78]:
mushrooms_train_df.columns

Index(['family', 'name', 'class', 'cap-diameter', 'cap-shape', 'Cap-surface',
       'cap-color', 'does-bruise-or-bleed', 'gill-attachment', 'gill-spacing',
       'gill-color', 'stem-height', 'stem-width', 'stem-root', 'stem-surface',
       'stem-color', 'veil-type', 'veil-color', 'has-ring', 'ring-type',
       'Spore-print-color', 'habitat', 'season'],
      dtype='object')

Cap-surface has an upper case--> i'll make it lower


replace - with _

In [79]:
mushrooms_train_df.columns = mushrooms_train_df.columns.map(lambda x: x.lower().replace('-', '_'))

In [80]:
#check columns' names
mushrooms_train_df.columns

Index(['family', 'name', 'class', 'cap_diameter', 'cap_shape', 'cap_surface',
       'cap_color', 'does_bruise_or_bleed', 'gill_attachment', 'gill_spacing',
       'gill_color', 'stem_height', 'stem_width', 'stem_root', 'stem_surface',
       'stem_color', 'veil_type', 'veil_color', 'has_ring', 'ring_type',
       'spore_print_color', 'habitat', 'season'],
      dtype='object')


In columns where I have only one value in the intervals, I transform the content by removing the square brackets and keeping only the value, to do this I deifnish and then apply the function:


In [81]:
def clean_range_values(x):
    # Check if the value is NaN, if so return it as is
    if pd.isna(x):
        return x

    # Check if the value contains square brackets
    match = re.match(r"\[([^\]]+)\]", str(x))
    if match:
        # If there is only one value inside the brackets, return it without brackets
        values = match.group(1).split(',')  # Split the value by commas if needed
        if len(values) == 1:  # Only remove brackets if there is a single value
            return values[0]

    # Return the value unchanged if no match or if there are multiple values inside the brackets
    return x


I do the copy of the original df and I try to apply the function on it too see the result

In [82]:
mushrooms_train_df1 = mushrooms_train_df.copy()
mushroom_train_df1_cleaned = mushrooms_train_df1.applymap(clean_range_values)
mushroom_train_df1_cleaned.head(20)

  mushroom_train_df1_cleaned = mushrooms_train_df1.applymap(clean_range_values)


Unnamed: 0,family,name,class,cap_diameter,cap_shape,cap_surface,cap_color,does_bruise_or_bleed,gill_attachment,gill_spacing,gill_color,stem_height,stem_width,stem_root,stem_surface,stem_color,veil_type,veil_color,has_ring,ring_type,spore_print_color,habitat,season
0,Amanita Family,Fly Agaric,p,15.0,"[x, f]","[g, h]","[e, o]",f,e,,w,17.5,17.5,s,y,w,u,w,t,"[g, p]",,d,"[u, a, w]"
1,Amanita Family,Panther Cap,p,7.5,"[p, x]",g,n,f,e,,w,8.0,15.0,,y,w,u,w,t,p,,d,"[u, a]"
2,Amanita Family,False Panther Cap,p,12.5,"[x, f]",,"[g, n]",f,e,,w,11.0,15.0,,,w,u,w,t,"[e, g]",,d,"[u, a]"
3,Amanita Family,The Blusher,e,10.0,"[x, f]",,n,t,,,w,11.0,17.5,b,,w,u,w,t,g,,d,"[u, a]"
4,Amanita Family,Death Cap,p,8.5,"[x, f]",h,r,f,,c,w,11.0,15.0,,,w,u,w,t,"[g, p]",,d,"[u, a]"
5,Amanita Family,False Death Cap,e,6.5,x,,"[w, y]",f,e,,w,6.0,12.5,b,,"[w, y]",u,"[y, w]",t,g,,d,"[u, a]"
6,Amanita Family,Destroying Angel,p,7.5,b,t,w,f,e,c,w,12.5,12.5,,y,w,u,w,t,"[l, e]",,d,"[u, a]"
7,Amanita Family,Tawny Grisette,e,6.0,"[c, x]","[h, t]",n,f,e,,w,12.5,12.5,,s,"[w, n]",u,w,f,f,,d,"[u, a]"
8,Lepiota Family,Parasol Mushroom,e,17.5,"[p, f]",y,"[w, n]",f,,,w,25.0,20.0,s,,n,,,t,m,,"[m, d]","[u, a]"
9,Lepiota Family,Shaggy Parasol,e,15.0,x,"[e, y]",n,t,e,,w,10.0,17.5,,,w,,,t,,,"[g, d]","[u, a]"


it worked--> CAN IT BE OKAY?????????? IN CASE WE CAN DO IT ON THE ORIGINAL DATASET

In [83]:
mushrooms_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173 entries, 0 to 172
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   family                173 non-null    object 
 1   name                  173 non-null    object 
 2   class                 173 non-null    object 
 3   cap_diameter          173 non-null    float64
 4   cap_shape             173 non-null    object 
 5   cap_surface           133 non-null    object 
 6   cap_color             173 non-null    object 
 7   does_bruise_or_bleed  173 non-null    object 
 8   gill_attachment       145 non-null    object 
 9   gill_spacing          102 non-null    object 
 10  gill_color            173 non-null    object 
 11  stem_height           173 non-null    float64
 12  stem_width            173 non-null    float64
 13  stem_root             27 non-null     object 
 14  stem_surface          65 non-null     object 
 15  stem_color            1

In [84]:
print(sum(mushrooms_train_df.isna().sum()))
print(mushrooms_train_df.duplicated().sum())

871
0


there are no duplictaes

check null values:

In [85]:
mushrooms_train_df.isnull().sum()

Unnamed: 0,0
family,0
name,0
class,0
cap_diameter,0
cap_shape,0
cap_surface,40
cap_color,0
does_bruise_or_bleed,0
gill_attachment,28
gill_spacing,71


numeric variables don't have any null value

Based on these informations:<br>
- there are a lot of Nan values in the columns **stem_root**, **stem_surface**,**veil_type**, **veil_color**, **spore_orint_color** --> WHAT CAN I DO? DROP? MAYBE THEY'RE USEFUL FOR NEXT ANALYSIS
- **replace** the null values related the the others objects variables with their **mode** that is the value that occurs the  most in the variable, so it doesn't change the distirbution a lot


In [86]:
cap_surface_mode = mushrooms_train_df['cap_surface'].mode()[0]
mushrooms_train_df['cap_surface'] = mushrooms_train_df['cap_surface'].fillna(cap_surface_mode)
gill_attachment_mode = mushrooms_train_df['gill_attachment'].mode()[0]
mushrooms_train_df['gill_attachment'] = mushrooms_train_df['gill_attachment'].fillna(gill_attachment_mode)
gill_spacing_mode = mushrooms_train_df['cap_surface'].mode()[0]
mushrooms_train_df['gill_spacing'] = mushrooms_train_df['gill_spacing'].fillna(gill_spacing_mode)
ring_type_mode = mushrooms_train_df['ring_type'].mode()[0]
mushrooms_train_df['ring_type'] = mushrooms_train_df['ring_type'].fillna(ring_type_mode)


In [87]:
mushrooms_train_df.isnull().sum()

Unnamed: 0,0
family,0
name,0
class,0
cap_diameter,0
cap_shape,0
cap_surface,0
cap_color,0
does_bruise_or_bleed,0
gill_attachment,0
gill_spacing,0
