# Assignment 4: Pipelines and Hyperparameter Tuning (32 total marks)
### Due: November 22 at 11:59pm

### Name: 

### In this assignment, you will be putting together everything you have learned so far. You will need to find your own dataset, do all the appropriate preprocessing, test different supervised learning models and evaluate the results. More details for each step can be found below.

### You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Import Libraries

In [1]:
import numpy as np
import pandas as pd

## Step 1: Data Input (4 marks)

Import the dataset you will be using. You can download the dataset onto your computer and read it in using pandas, or download it directly from the website. Answer the questions below about the dataset you selected. 

To find a dataset, you can use the resources listed in the notes. The dataset can be numerical, categorical, text-based or mixed. If you want help finding a particular dataset related to your interests, please email the instructor.

**You cannot use a dataset that was used for a previous assignment or in class**

In [2]:
# Import dataset (1 mark)

from sklearn.model_selection import train_test_split

df = pd.read_csv('train.csv')
print(df.shape)
df.head()


(8693, 14)


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


### Questions (3 marks)

1. (1 mark) What is the source of your dataset?
1. (1 mark) Why did you pick this particular dataset?
1. (1 mark) Was there anything challenging about finding a dataset that you wanted to use?

*ANSWER HERE*

1. The source of my dataset is Kaggle, the files were downloaded from the link: https://www.kaggle.com/competitions/spaceship-titanic/data?select=test.csv The files for train hat I loaded in using pandas will be provided with the assignment submission. 
2. I picked this particular dataset because I enjoy classification problems more (true, false); also, there are numerous categories to process such as categorical (Cryosleep, HomePlanet, VIP, etc.). There are also null values which need to be cleaned, so we can explore numerous data processing such as encoding and null value removal, and scaling. 
3. It was a little challenging because most of the class examples had examples of supervised learning, and it was hard finding datasets that were not unsupervised learning in Kaggle. However, I was able to understand that I could use the training set to model a supervised learning machine learning model! Since the assignment specifies: "test different supervised learning models", I was able to find a suitable dataset which I was interested in. 

## Step 2: Data Processing (5 marks)

The next step is to process your data. Implement the following steps as needed.

In [3]:
# Clean data (if needed)

# Check if the 'ID' column has duplicate values since it is the primary key
is_duplicate = df['PassengerId'].duplicated() #since there are none, don't need to clean duplicates
duplicate_count = is_duplicate.sum()
print(f"Number of duplicate values in 'ID' column: {duplicate_count}")

print(df.isnull().sum()) #since there are quite a bit of nulls, we will need to clean it up. 

#dropping rows with missing values for the following;
df = df.dropna(subset=['Name', 'Destination', 'HomePlanet', 'CryoSleep', 'Cabin', 'VIP']) 

#dropping columns which provide no useful correlation of being transported, i.e. Name 
df = df.drop('Name', axis=1)

# Filling missing values in specific columns with the mean of each column
columns_to_fill = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

for column in columns_to_fill:
    df[column].fillna(df[column].mean(), inplace=True)

print(df.isnull().sum()) #no nulls, cleaned up! 
print(df.shape) ##Shape was originally (8693, 14), so around ~900 rows were deleted


Number of duplicate values in 'ID' column: 0
PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64
PassengerId     0
HomePlanet      0
CryoSleep       0
Cabin           0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
Transported     0
dtype: int64
(7559, 13)


In [4]:
# Implement preprocessing steps. Remember to use ColumnTransformer if more than one preprocessing method is needed

#preprocessing of 'Cabin' as the values are too unique for any reliable encoding, however cabin may have a significant value
#towards if people are transported. As a result, will need to preprocess this column using count encoding: 

frequency_map = df['Cabin'].value_counts().to_dict()
df['Cabin_encoded'] = df['Cabin'].map(frequency_map)
df = df.drop('Cabin', axis=1) #dropping Cabin now 

from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import RobustScaler, OneHotEncoder

ct = ColumnTransformer(
    [("scaling", RobustScaler(), ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']),
     ("onehot", OneHotEncoder(), ['Destination', 'HomePlanet', 'CryoSleep', 'VIP'])])

df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Cabin_encoded
0,0001_01,Europa,False,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False,1
1,0002_01,Earth,False,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True,1
2,0003_01,Europa,False,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,2
3,0003_02,Europa,False,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,2
4,0004_01,Earth,False,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True,1


### Questions (2 marks)

1. (1 mark) Were there any missing/null values in your dataset? If yes, how did you replace them and why? If no, describe how you would've replaced them and why.
2. (1 mark) What type of data do you have? What preprocessing methods would you have to apply based on your data types?

*ANSWER HERE*
1. Yes, there were missing/null values. I dropped the row for missing values of 'Name', 'Destination', 'HomePlanet', 'CryoSleep', 'Cabin', 'VIP'. For name, it is mostly unique so filling with average does not make sense. Destination, HomePlanet, CryoSleep, Cabin and VIP also were dropped because there isn't really a clear majoriy for these categorized features, so filling with average also does not make sense. For columns 'Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', I filled missing values with the average mean of the respective columns as it is a continuous numerical value, so using the average would be a suitable way to replace. I also dropped the column 'Name' as it provides no meaningful correlation, and introduces high cardinality for the encoder. Lastly, I used count encoding to encode the column Cabin because there were too many unique values introduced. The value of this column is the amount of appearances in the cabin which can correlate to if people are transported, so I used count encoding and dropped the original column of 'Cabin'.

2. I have both numerical and categorical types of data (features). This means that I will need to employ scaling, and I chose to use RobustScaler as upon a look, there are definitely data points or outliers different than the rest for these features. I will also have to encode the categorical features using OneHotEncoder as they are unordered. 


## Step 3: Implement Machine Learning Model (11 marks)

In this section, you will implement three different supervised learning models (one linear and two non-linear) of your choice. You will use a pipeline to help you decide which model and hyperparameters work best. It is up to you to select what models to use and what hyperparameters to test. You can use the class examples for guidance. You must print out the best model parameters and results after the grid search.

In [5]:


# Implement pipeline and grid search here. Can add more code blocks if necessary
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

logistic_pipeline = Pipeline([
    ('preprocessor', ct),
    ('logistic', LogisticRegression(solver='lbfgs'))  # Specify the solver explicitly
])

random_forest_pipeline = Pipeline([
    ('preprocessor', ct),
    ('random_forest', RandomForestClassifier())
])

svm_pipeline = Pipeline([
    ('preprocessor', ct),
    ('svm', SVC())
])

# Define parameter grids for grid search
logistic_params = {
    'logistic__C': [0.1, .05, 1],
    'logistic__max_iter': [250, 500, 1000]  # Increase the maximum number of iterations
}

random_forest_params = {
    'random_forest__n_estimators': [2, 5, 10],
    'random_forest__max_depth': [None, 2, 5],
    'random_forest__min_samples_split': [2, 5, 10]
}

svm_params = {
    'svm__C': [0.1, 1, 10],
    'svm__kernel': ['linear', 'rbf']
}

# get all columns apart from transported for the features
X = df.drop(columns='Transported')
y = df['Transported']
print(X.shape)
print(y.shape)

# split dataframe and transported
X_train, X_test, y_train, y_test = train_test_split(X, y,
                            test_size=0.3, stratify=y,random_state=0)

print(X_train.shape)
print(y_train.shape)
X_train.head()

# Perform grid search with cross-validation
logistic_grid = GridSearchCV(logistic_pipeline, logistic_params, cv=5)
logistic_grid.fit(X_train, y_train)

rf_grid = GridSearchCV(random_forest_pipeline, random_forest_params, cv=5)
rf_grid.fit(X_train, y_train)

svm_grid = GridSearchCV(svm_pipeline, svm_params, cv=5)
svm_grid.fit(X_train, y_train)


(7559, 12)
(7559,)
(5291, 12)
(5291,)
Best params:
{'logistic__C': 0.05, 'logistic__max_iter': 250}

Best cross-validation score: 0.79
Test-set score: 0.78


In [23]:
print('Logistic Regression')
print("Best params:\n{}\n".format(logistic_grid.best_params_))
print("Best cross-validation score: {:.2f}".format(logistic_grid.best_score_))
print('\n')

print('Random Forests')
print("Best params:\n{}\n".format(rf_grid.best_params_))
print("Best cross-validation score: {:.2f}".format(rf_grid.best_score_))
print('\n')

print("Support Vector Machine (Classification)")
print("Best params:\n{}\n".format(svm_grid.best_params_))
print("Best cross-validation score: {:.2f}".format(svm_grid.best_score_))




Logistic Regression
Best params:
{'logistic__C': 0.05, 'logistic__max_iter': 250}

Best cross-validation score: 0.79


Random Forests
Best params:
{'random_forest__max_depth': None, 'random_forest__min_samples_split': 10, 'random_forest__n_estimators': 10}

Best cross-validation score: 0.78


Support Vector Machine (Classification)
Best params:
{'svm__C': 1, 'svm__kernel': 'rbf'}

Best cross-validation score: 0.79


### Questions (5 marks)

1. (1 mark) Do you need regression or classification models for your dataset?
1. (2 marks) Which models did you select for testing and why?
1. (2 marks) Which model worked the best? Does this make sense based on the theory discussed in the course and the context of your dataset?

*ANSWER HERE*
1. This dataset needs classification models because the target is either a true or false; there is no numerical value associated with the result.
2. I selected the linear classification model of linear regression, as well as the non-linear classification models of random forest and SVC. These were all chosen respectively as a result of classification being the target result, and since the data is a binary classification problem. 
3. The model that worked the best was support vector machines and logistic regression based on the cross validcation score. Within the context of my dataset which is a binary classification problem, and my data is around ~8000 entries which is not large. I think my dataset prefers a simpler model which is less complex as given by the parameters of logistic regression, using a C value of 0.05 and iteration of 250 which means more generalization. Because the data prefers less complex models, random forests gave the worst score. However, I think SVC was able to also have a higher score as it works well on a variety of datasets. Based on theory I believe SVM should be the best performer since we are able to tune the parameters with grid search, but I think in reality logistic regression kept up because of the simplicity of the data. 

## Step 4: Validate Model (6 marks)

Use the testing set to calculate the testing accuracy for the best model determined in Step 3.

In [22]:
# Calculate testing accuracy (1 mark)

from sklearn.metrics import f1_score
# Logistic Regression
logistic_predictions = logistic_grid.predict(X_test)
f1_logistic = f1_score(y_test, logistic_predictions)

# Random Forest
rf_predictions = rf_grid.predict(X_test)
f1_rf = f1_score(y_test, rf_predictions)

# Support Vector Machine
svm_predictions = svm_grid.predict(X_test)
f1_svm = f1_score(y_test, svm_predictions)

print("F1 Score - Logistic Regression: {:.2f}".format(f1_logistic))
print("F1 Score - Random Forest: {:.2f}".format(f1_rf))
print("F1 Score - Support Vector Machine: {:.2f}".format(f1_svm))

F1 Score - Logistic Regression: 0.78
F1 Score - Random Forest: 0.76
F1 Score - Support Vector Machine: 0.80



### Questions (5 marks)

1. (1 mark) Which accuracy metric did you choose? 
1. (1 mark) How do these results compare to those in part 3? Did this model generalize well?
1. (3 marks) Based on your results and the context of your dataset, did the best model perform "well enough" to be used out in the real-world? Why or why not? Do you have any suggestions for how you could improve this analysis?

*ANSWER HERE*
1. I used the accuracy metric of F1 score because it combines precision and recall into a single value, providing a balance between these two metrics. Because it is a binary classification, the true and false is the result; it is important to capture the false positives and true negatives as a result.

2. The results are fairly close, differing only by 0.1-0.2. This model as a result generalized well, but there is also the possibility of underfitting as the validation and training scores are similar. There is no overfitting as the test score is not significantly lower than the training set. I believe however it generalized well because using the gridsearch, it would have preferred to get the most complex c in logistic regression (1 vs 0.05) and also 10 instead of 1 for SVM. 

3. I believe for the context of my dataset, it did perform "well enough". There is no glaring signs of overfitting or underfitting based on the comparison of F1 score of the test set to the cross validation score of the training set. When we had gridsearch, it did not reach for the least generalization, indicating that the model is generalizing well. The model is able to consistently hit a similar score to the training dataset upon finding new datasets, at an accuracy of 80% using the SVM. If I was worried being transported to a different planet, I would definitely take my chances on this model! A suggestion that I could do to improve this analysis is get more data storage so that I can test more model parameters, as I had to wait ~30 min for the one on the notebook to work, even reducing the parameters as much as possible. I could also test out some other models such as gradient boosted decision trees. All in all, I am quite proud of this model analysis.

## Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

1. I sourced my code using examples provided in class, such as ApplyPipelines located and Imputation Example2 both found in D2L: https://d2l.ucalgary.ca/d2l/le/content/543310/Home?itemIdentifier=TOC . 

2. I completed the steps as provided by this assignment. The steps were laid out very organized, I also did some pre-testing of the splitting of the dataset to confirm supervised learning. 

3. I did not use generative AI to modify the code at all. However, I used it to inform the methods that needed to be called, such as RobustScaler() and other methods such as count encoding. 

4. A challenge I had was trying to find a suitable dataset as well as trying to get used to performing a full machine learning (supervised) model all on my own. It was definitely a unique assignment where you are kind of left to your devices, so it was hard finding a dataset and thinking about which machine learning models I wanted to implement. I think what helped me to be successful was following the Imputation Example2 document on D2L to understand the processes I needed to follow. 

## Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challenging, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

I really enjoyed how this assignment gives the creative freedom to the student. It was definitely a new challenge to employ all the things I learned to a real-case machine learning model. There is direction to follow, however a lot of the steps such as preprocessing, cleaning, dataset picking and machine learning model implementation is left to the student so it was very enjoyable. I think I learned a lot more from this assignment than the previous ones, and I really enjoyed it as a result. This was a motivating assignment because it translated the class to the real world, and kind of enlightened me to the possibilities of using this class knowledge for implementing actual machine learning models on my own!