# Assignment 4: Pipelines and Hyperparameter Tuning (32 total marks)
### Due: November 22 at 11:59pm

### Name: Ehsan Liaqat

### In this assignment, you will be putting together everything you have learned so far. You will need to find your own dataset, do all the appropriate preprocessing, test different supervised learning models and evaluate the results. More details for each step can be found below.

### You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Import Libraries

In [197]:
import numpy as np
import pandas as pd
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix, make_scorer, f1_score

#pd.set_option('display.max_rows',None)
pd.set_option('display.max_columns',None)


## Step 1: Data Input (4 marks)

Import the dataset you will be using. You can download the dataset onto your computer and read it in using pandas, or download it directly from the website. Answer the questions below about the dataset you selected. 

To find a dataset, you can use the resources listed in the notes. The dataset can be numerical, categorical, text-based or mixed. If you want help finding a particular dataset related to your interests, please email the instructor.

**You cannot use a dataset that was used for a previous assignment or in class**

In [198]:
# Import dataset (1 mark)

from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
mushroom = fetch_ucirepo(id=73) 
  
# data (as pandas dataframes) 
X = mushroom.data.features 
y = mushroom.data.targets 
  
# metadata 
#print(mushroom.metadata) 
  
# variable information 
#print(mushroom.variables) 

print(X.shape)
print(y.shape)
print(type(X))
print(type(y))
print("")
print(X.dtypes)
print(y.dtypes)

df = pd.concat([X, y], axis='columns')
df.head()

(8124, 22)
(8124, 1)
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>

cap-shape                   object
cap-surface                 object
cap-color                   object
bruises                     object
odor                        object
gill-attachment             object
gill-spacing                object
gill-size                   object
gill-color                  object
stalk-shape                 object
stalk-root                  object
stalk-surface-above-ring    object
stalk-surface-below-ring    object
stalk-color-above-ring      object
stalk-color-below-ring      object
veil-type                   object
veil-color                  object
ring-number                 object
ring-type                   object
spore-print-color           object
population                  object
habitat                     object
dtype: object
poisonous    object
dtype: object


Unnamed: 0,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,stalk-root,stalk-surface-above-ring,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat,poisonous
0,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u,p
1,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g,e
2,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m,e
3,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u,p
4,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g,e


### Questions (3 marks)

1. (1 mark) What is the source of your dataset?
I used UC Irvine website as the source of my dataset. Mushroom. (1987). UCI Machine Learning Repository. https://doi.org/10.24432/C5959T. It corresponds to 23 species of gilled mushrooms.
1. (1 mark) Why did you pick this particular dataset?
The dataset is good training a model on determining which mushrooms are poisonous and which are edible, which can prevent food sickness and injuries to consumers who consume mushrooms in their foods. Develops models can help us detect poisonous mushrooms effectively.
1. (1 mark) Was there anything challenging about finding a dataset that you wanted to use?
Some datasets are not clearly labelled and some have excessive missing values, which makes using them hard.

*ANSWER HERE*

## Step 2: Data Processing (5 marks)

The next step is to process your data. Implement the following steps as needed.

In [199]:
# Clean data (if needed)

df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   cap-shape                 8124 non-null   object
 1   cap-surface               8124 non-null   object
 2   cap-color                 8124 non-null   object
 3   bruises                   8124 non-null   object
 4   odor                      8124 non-null   object
 5   gill-attachment           8124 non-null   object
 6   gill-spacing              8124 non-null   object
 7   gill-size                 8124 non-null   object
 8   gill-color                8124 non-null   object
 9   stalk-shape               8124 non-null   object
 10  stalk-root                5644 non-null   object
 11  stalk-surface-above-ring  8124 non-null   object
 12  stalk-surface-below-ring  8124 non-null   object
 13  stalk-color-above-ring    8124 non-null   object
 14  stalk-color-below-ring  

In [200]:

print("Total Nulls in data: ", df.isnull().sum().sum())

print("Total Null % By Each Column: ")

column_nulls = df.isnull().sum() / len(df)
print(column_nulls*100)

# print(df['stalk-root'])

Total Nulls in data:  2480
Total Null % By Each Column: 
cap-shape                    0.000000
cap-surface                  0.000000
cap-color                    0.000000
bruises                      0.000000
odor                         0.000000
gill-attachment              0.000000
gill-spacing                 0.000000
gill-size                    0.000000
gill-color                   0.000000
stalk-shape                  0.000000
stalk-root                  30.526834
stalk-surface-above-ring     0.000000
stalk-surface-below-ring     0.000000
stalk-color-above-ring       0.000000
stalk-color-below-ring       0.000000
veil-type                    0.000000
veil-color                   0.000000
ring-number                  0.000000
ring-type                    0.000000
spore-print-color            0.000000
population                   0.000000
habitat                      0.000000
poisonous                    0.000000
dtype: float64


In [201]:
# df.drop(columns='stalk-root', inplace=True)
df.head()

Unnamed: 0,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,stalk-root,stalk-surface-above-ring,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat,poisonous
0,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u,p
1,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g,e
2,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m,e
3,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u,p
4,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g,e


In [None]:
# Implement preprocessing steps. Remember to use ColumnTransformer if more than one preprocessing method is needed

In [202]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   cap-shape                 8124 non-null   object
 1   cap-surface               8124 non-null   object
 2   cap-color                 8124 non-null   object
 3   bruises                   8124 non-null   object
 4   odor                      8124 non-null   object
 5   gill-attachment           8124 non-null   object
 6   gill-spacing              8124 non-null   object
 7   gill-size                 8124 non-null   object
 8   gill-color                8124 non-null   object
 9   stalk-shape               8124 non-null   object
 10  stalk-root                5644 non-null   object
 11  stalk-surface-above-ring  8124 non-null   object
 12  stalk-surface-below-ring  8124 non-null   object
 13  stalk-color-above-ring    8124 non-null   object
 14  stalk-color-below-ring  

In [203]:
df = df.astype({col: 'category' for col in df.columns})
df.info()

# print(df.columns)

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer,  make_column_selector(dtype_include="category")), 
        ("onehot", OneHotEncoder(sparse_output=False), make_column_selector(dtype_include="category"))])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   cap-shape                 8124 non-null   category
 1   cap-surface               8124 non-null   category
 2   cap-color                 8124 non-null   category
 3   bruises                   8124 non-null   category
 4   odor                      8124 non-null   category
 5   gill-attachment           8124 non-null   category
 6   gill-spacing              8124 non-null   category
 7   gill-size                 8124 non-null   category
 8   gill-color                8124 non-null   category
 9   stalk-shape               8124 non-null   category
 10  stalk-root                5644 non-null   category
 11  stalk-surface-above-ring  8124 non-null   category
 12  stalk-surface-below-ring  8124 non-null   category
 13  stalk-color-above-ring    8124 non-null   catego

### Questions (2 marks)

1. (1 mark) Were there any missing/null values in your dataset? If yes, how did you replace them and why? If no, describe how you would've replaced them and why.
Yes there were 30% nulls in the stalk-root column. I use imputation with most frequent to populate them. Imputation is a best practice as it preserves the proportion and doesn't introduce new bias into the data.
2. (1 mark) What type of data do you have? What preprocessing methods would you have to apply based on your data types?
I have a categorical data, most columns are categories and I needed to apply one hot enconding to preprocess the data. We have 22 columns that have categorical values. The columns indicate the data about the mushroom in a categorical format with the target being poisonous or edible.

*ANSWER HERE*

## Step 3: Implement Machine Learning Model (11 marks)

In this section, you will implement three different supervised learning models (one linear and two non-linear) of your choice. You will use a pipeline to help you decide which model and hyperparameters work best. It is up to you to select what models to use and what hyperparameters to test. You can use the class examples for guidance. You must print out the best model parameters and results after the grid search.

In [214]:
# Implement pipeline and grid search here. Can add more code blocks if necessary

X = df.drop(columns='poisonous')
y = df['poisonous']
print(X.shape)
print(y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y,
                            test_size=0.4, stratify=y,random_state=42)


(8124, 22)
(8124,)


In [219]:
pipe_LR = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(max_iter=400))])

pipe_RF = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier())])

pipe_SVM = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', SVC())])

param_grid_LR = {'classifier__C': [0.1, 1.0, 10.0],
                 'classifier__solver': ['newton-cholesky','liblinear', 'saga']}

param_grid_RF = {'classifier__n_estimators': [10, 100, 250],
              'classifier__max_depth': [5, 10, 15]}

param_grid_SVM = {'classifier__C': [0.01, 1, 10, 100],
              'classifier__gamma': [1]}


In [220]:
scores = {
    'accuracy': make_scorer(accuracy_score),
    'f1_score': make_scorer(f1_score, pos_label='p')
}

In [221]:
grid_fit_LR = GridSearchCV(pipe_LR, param_grid_LR, cv=5, scoring=scores, refit='f1_score')
grid_fit_RF = GridSearchCV(pipe_RF, param_grid_RF, cv=5, scoring=scores, refit='f1_score')
grid_fit_SVM = GridSearchCV(pipe_SVM, param_grid_SVM, cv=5, scoring=scores, refit='f1_score')

grid_fit_LR.fit(X_train, y_train)
print("\nFinal Logistic Regression Results\n")
print("Accuracy:\n", grid_fit_LR.cv_results_['mean_test_accuracy'])
print("F1:\n", grid_fit_LR.cv_results_['mean_test_f1_score'])
print("\nBest Parameters using F1:\n", grid_fit_LR.best_params_)

grid_fit_RF.fit(X_train, y_train)
print("\nFinal Random Forest Results\n")
print("Accuracy:\n", grid_fit_RF.cv_results_['mean_test_accuracy'])
print("F1:\n", grid_fit_RF.cv_results_['mean_test_f1_score'])
print("\nBest Parameters using F1:\n", grid_fit_RF.best_params_)

grid_fit_SVM.fit(X_train, y_train)
print("\nFinal SVM Results\n")
print("Accuracy:\n", grid_fit_SVM.cv_results_['mean_test_accuracy'])
print("F1:\n", grid_fit_SVM.cv_results_['mean_test_f1_score'])
print("\nBest Parameters using F1:\n", grid_fit_SVM.best_params_)



Final Logistic Regression Results

Accuracy:
 [0.9985641  0.9985641  0.9985641  0.99938462 0.99938462 0.99938462
 0.99958974 0.99979487 0.99979487]
F1:
 [0.99850496 0.99850496 0.99850496 0.99935966 0.99935966 0.99935966
 0.99957356 0.99978701 0.99978701]

Best Parameters using F1:
 {'classifier__C': 10.0, 'classifier__solver': 'liblinear'}

Final Random Forest Results

Accuracy:
 [0.9924092  0.99487074 0.99384489 0.99979487 1.         0.99979487
 0.99958974 1.         1.        ]
F1:
 [0.99205891 0.99464873 0.99357003 0.99978701 1.         0.99978701
 0.99957356 1.         1.        ]

Best Parameters using F1:
 {'classifier__max_depth': 10, 'classifier__n_estimators': 100}

Final SVM Results

Accuracy:
 [0.51805507 0.99425536 0.995281   0.995281  ]
F1:
 [0.         0.99399549 0.99507193 0.99507193]

Best Parameters using F1:
 {'classifier__C': 10, 'classifier__gamma': 1}


### Questions (5 marks)

1. (1 mark) Do you need regression or classification models for your dataset?
I needed to use classification for the data because the target is a category of poisonous or edible mushroom.
1. (2 marks) Which models did you select for testing and why?
I used SVM, linear regression and random forest as these three models are the ones we discussed in class and random forest is effective at modelling, in addition to SVM. The linear regression was compared against the other two. The linear regression outputs the probability using a threshold and it is not sensistive to anomalies. The random forest was used as it creates trees using multiple features of different depths randomly, and it prevents overfitting. SVM model is good for complex models with hyperplanes and great for comparison.
1. (2 marks) Which model worked the best? Does this make sense based on the theory discussed in the course and the context of your dataset?
The linear regression model appear to work the best, and random forest is almost just as good. All models overall had very high f1 and accuracy scores. The linear regression model is good in simple and linear dataset like the mushroom dataset. It is easy to apply and quick and sometimes prone to overfitting.

*ANSWER HERE*

## Step 4: Validate Model (6 marks)

Use the testing set to calculate the testing accuracy for the best model determined in Step 3.

In [222]:
# Calculate testing accuracy (1 mark)

test_accuracy_LR = grid_fit_LR.score(X_test, y_test)
test_accuracy_RF = grid_fit_RF.score(X_test, y_test)
test_accuracy_SVM = grid_fit_SVM.score(X_test, y_test)

results = {'Model': ['Logistic Regression','Random Forest','SVM'],
           'Test Accuracy/f1 score': [test_accuracy_LR, test_accuracy_RF, test_accuracy_SVM],
           'Mean Test Accuracy': [grid_fit_LR.cv_results_['mean_test_accuracy'], grid_fit_RF.cv_results_['mean_test_accuracy'], grid_fit_SVM.cv_results_['mean_test_accuracy']]}

pd.DataFrame(results)

Unnamed: 0,Model,Test Accuracy/f1 score,Mean Test Accuracy
0,Logistic Regression,1.0,"[0.9985641025641027, 0.9985641025641027, 0.998..."
1,Random Forest,1.0,"[0.9924092033907229, 0.9948707418522613, 0.993..."
2,SVM,0.999042,"[0.5180550729216027, 0.9942553572368767, 0.995..."



### Questions (5 marks)

1. (1 mark) Which accuracy metric did you choose? 
I used the f1 accuracy metric because it applies both accuracy and recall in its calculation and it is good to determine model performance in classification.
1. (1 mark) How do these results compare to those in part 3? Did this model generalize well?
The model results are comparable to part 3, it appears that the model can generalize very effectively, because the scores are near 1.
1. (3 marks) Based on your results and the context of your dataset, did the best model perform "well enough" to be used out in the real-world? Why or why not? Do you have any suggestions for how you could improve this analysis?
Based on the model performance using the dataset, the model performed excellently and can be used in real world applications. Before testing, it appeared that the model was overfitting, however after more testing it seems the model can very effectively generalize. The f1 scores are near 1. In order to further improve the analysis, we need to test it on a more diversified data to ensure it still can perform as good, and perhaps a large dataset so we can ensure its robustness.

*ANSWER HERE*

## Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
Mushroom. (1987). UCI Machine Learning Repository. https://doi.org/10.24432/C5959T.
1. In what order did you complete the steps?
As stated in the assignment, I completed them sequentially.
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
I used chatgpt to explain a few concepts like transforming the columns and processing the data before modelling it.
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?
Yes, I had some errors when I process the data but the model wouldn't run. I have to look into error details to see why it wasn't working and figure it out from there. Also my models were initially had many grid search parameters which meant it was very time consuming to get any results so i had to simplify it.

*DESCRIBE YOUR PROCESS HERE*

## Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,

It was easy to follow through assignment.
- found interesting, confusing, challenging, motivating
while working on this assignment.

Hard to visualize when we are doing preprocessing and enconding and a lot of changes happen to the data during that and then we fit.


*ADD YOUR THOUGHTS HERE*