# Assignment 4: Pipelines and Hyperparameter Tuning (32 total marks)
### Due: November 22 at 11:59pm

### Name: 

### In this assignment, you will be putting together everything you have learned so far. You will need to find your own dataset, do all the appropriate preprocessing, test different supervised learning models and evaluate the results. More details for each step can be found below.

### You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Import Libraries

In [1]:
import numpy as np
import pandas as pd

## Step 1: Data Input (4 marks)

Import the dataset you will be using. You can download the dataset onto your computer and read it in using pandas, or download it directly from the website. Answer the questions below about the dataset you selected. 

To find a dataset, you can use the resources listed in the notes. The dataset can be numerical, categorical, text-based or mixed. If you want help finding a particular dataset related to your interests, please email the instructor.

**You cannot use a dataset that was used for a previous assignment or in class**

In [2]:
# Import dataset (1 mark)
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
congressional_voting_records = fetch_ucirepo(id=105) 
  
# data (as pandas dataframes) 
X = congressional_voting_records.data.features 
y = congressional_voting_records.data.targets 
  
# metadata 
print(congressional_voting_records.metadata) 
  
# variable information 
print(congressional_voting_records.variables) 


{'uci_id': 105, 'name': 'Congressional Voting Records', 'repository_url': 'https://archive.ics.uci.edu/dataset/105/congressional+voting+records', 'data_url': 'https://archive.ics.uci.edu/static/public/105/data.csv', 'abstract': '1984 United Stated Congressional Voting Records; Classify as Republican or Democrat', 'area': 'Social Science', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 435, 'num_features': 16, 'feature_types': ['Categorical'], 'demographics': [], 'target_col': ['Class'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 1987, 'last_updated': 'Mon Apr 27 1987', 'dataset_doi': '10.24432/C5C01P', 'creators': [], 'intro_paper': None, 'additional_info': {'summary': 'This data set includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes identified by the CQA.  The CQA lists nine different types of votes: voted for, paired for, and announced for (th

### Questions (3 marks)

1. (1 mark) What is the source of your dataset?

The dataset comes from the UCI Machine Learning Repository, a popular source for machine learning datasets. https://archive.ics.uci.edu/dataset/105/congressional+voting+records

2. (1 mark) Why did you pick this particular dataset?

This dataset have been chosen because it provides a clear example of a binary classification problem with real-world political data. It's also relatively simple in structure, making it suitable for learning and demonstrating basic concepts in data preprocessing and machine learning.

3. (1 mark) Was there anything challenging about finding a dataset that you wanted to use?

Challenges include ensuring the dataset is relevant, not overly complex for the learning objectives, and has well-documented features and instances. It's also important that the dataset is free of ethical concerns and privacy issues.


## Step 2: Data Processing (5 marks)

The next step is to process your data. Implement the following steps as needed.

In [3]:
# Clean data (if needed)
from ucimlrepo import fetch_ucirepo

# Fetch the dataset
congressional_voting_records = fetch_ucirepo(id=105)

# Data (assuming it's a DataFrame)
df = congressional_voting_records.data.features

# Creating a copy of the DataFrame for cleaning and encoding
df_cleaned = df.drop_duplicates().copy()

# Handling missing values (example: fill with the most frequent value in each column)
for column in df_cleaned.columns:
    most_frequent = df_cleaned[column].mode()[0]
    df_cleaned[column] = df_cleaned[column].fillna(most_frequent)

# Checking for missing values in each column after handling them
missing_values_after = df_cleaned.isnull().sum()
print("Missing values in each column after handling:\n", missing_values_after)

# Encode categorical data (example: Convert 'yes'/'no' votes to 1/0)
yes_no_columns = df_cleaned.columns  # List all columns that need encoding
for column in yes_no_columns:
    df_cleaned[column] = df_cleaned[column].map({'y': 1, 'n': 0})

# Checking the data types of each column after encoding
print("Data types of each column after encoding:\n", df_cleaned.dtypes)

# Further analysis and processing steps...






Missing values in each column after handling:
 handicapped-infants                       0
water-project-cost-sharing                0
adoption-of-the-budget-resolution         0
physician-fee-freeze                      0
el-salvador-aid                           0
religious-groups-in-schools               0
anti-satellite-test-ban                   0
aid-to-nicaraguan-contras                 0
mx-missile                                0
immigration                               0
synfuels-corporation-cutback              0
education-spending                        0
superfund-right-to-sue                    0
crime                                     0
duty-free-exports                         0
export-administration-act-south-africa    0
dtype: int64
Data types of each column after encoding:
 handicapped-infants                       int64
water-project-cost-sharing                int64
adoption-of-the-budget-resolution         int64
physician-fee-freeze                      int64
e

In [4]:
# Implement preprocessing steps. Remember to use ColumnTransformer if more than one preprocessing method is needed

from ucimlrepo import fetch_ucirepo
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder

# Fetch the dataset
congressional_voting_records = fetch_ucirepo(id=105)

# Data (as pandas dataframes)
X = congressional_voting_records.data.features
y = congressional_voting_records.data.targets

# Convert the target DataFrame to a NumPy array and then flatten it to a 1D array
y_array = y.values.ravel()

# Encode the target variable
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y_array)

# Replace 'y' and 'n' with 1 and 0, and use np.nan for missing values in features
X_encoded = X.replace({'y': 1, 'n': 0, '?': np.nan})

# Apply SimpleImputer to handle the missing values in features
imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
X_imputed = pd.DataFrame(imputer.fit_transform(X_encoded), columns=X_encoded.columns)

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_imputed, y_encoded, test_size=0.2, random_state=42)

# Print confirmation and some details about the preprocessed data
print("Preprocessing completed.")
print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")


Preprocessing completed.
Training set shape: (348, 16)
Testing set shape: (87, 16)


### Questions (2 marks)

1. (1 mark) Were there any missing/null values in your dataset? If yes, how did you replace them and why? If no, describe how you would've replaced them and why.

The dataset contains missing/null values represented as '?'. These were replaced with the most frequent value in each column, a common practice known as mode imputation. This method maintains the overall distribution of the data and is suitable for categorical data.

2. (1 mark) What type of data do you have? What preprocessing methods would you have to apply based on your data types?

The data is categorical (votes recorded as 'yes', 'no', or missing). Preprocessing included encoding 'yes' and 'no' as binary values (1 and 0) and handling missing values. If numerical data were present, methods like normalization or standardization might be necessary.

## Step 3: Implement Machine Learning Model (11 marks)

In this section, you will implement three different supervised learning models (one linear and two non-linear) of your choice. You will use a pipeline to help you decide which model and hyperparameters work best. It is up to you to select what models to use and what hyperparameters to test. You can use the class examples for guidance. You must print out the best model parameters and results after the grid search.

In [5]:
# Implement pipeline and grid search here. Can add more code blocks if necessary

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Define pipelines for each model
pipelines = {
    'rf': Pipeline([('classifier', RandomForestClassifier(random_state=42))]),
    'lr': Pipeline([('classifier', LogisticRegression())]),
    'svc': Pipeline([('classifier', SVC())])
}

# Define parameter grids for each model
param_grids = {
    'rf': {'classifier__n_estimators': [100, 200], 'classifier__max_depth': [None, 10, 20]},
    'lr': {'classifier__C': [0.1, 1, 10]},
    'svc': {'classifier__C': [0.1, 1, 10], 'classifier__kernel': ['linear', 'rbf']}
}

# Implement GridSearchCV for each model and compare results
best_models = {}
for model_name, pipeline in pipelines.items():
    grid_search = GridSearchCV(pipeline, param_grids[model_name], cv=5, verbose=1, n_jobs=-1)
    grid_search.fit(X_train, y_train)
    best_models[model_name] = grid_search

    # Print best parameters and score
    print(f"Best parameters for {model_name}: ", grid_search.best_params_)
    print(f"Best score for {model_name}: ", grid_search.best_score_)

    # Predict and evaluate
    y_pred = grid_search.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)
    class_report = classification_report(y_test, y_pred)
    print(f"Accuracy for {model_name}: ", accuracy)
    print(f"Confusion Matrix for {model_name}:\n", conf_matrix)
    print(f"Classification Report for {model_name}:\n", class_report)


Fitting 5 folds for each of 6 candidates, totalling 30 fits
Best parameters for rf:  {'classifier__max_depth': None, 'classifier__n_estimators': 200}
Best score for rf:  0.9539958592132505
Accuracy for rf:  0.9770114942528736
Confusion Matrix for rf:
 [[55  1]
 [ 1 30]]
Classification Report for rf:
               precision    recall  f1-score   support

           0       0.98      0.98      0.98        56
           1       0.97      0.97      0.97        31

    accuracy                           0.98        87
   macro avg       0.97      0.97      0.97        87
weighted avg       0.98      0.98      0.98        87

Fitting 5 folds for each of 3 candidates, totalling 15 fits
Best parameters for lr:  {'classifier__C': 1}
Best score for lr:  0.9655072463768116
Accuracy for lr:  0.9540229885057471
Confusion Matrix for lr:
 [[54  2]
 [ 2 29]]
Classification Report for lr:
               precision    recall  f1-score   support

           0       0.96      0.96      0.96        56
    

### Questions (5 marks)

1. (1 mark) Do you need regression or classification models for your dataset?

For the Congressional Voting Records dataset, classification models are needed. The target variable in this dataset is categorical, representing political party affiliations (Democrat or Republican). Classification models are designed to predict categorical outcomes, making them suitable for this kind of data.

2. (2 marks) Which models did you select for testing and why?

The models selected for testing were RandomForestClassifier, Logistic Regression, and Support Vector Classifier (SVC).

*RandomForestClassifier: This was chosen due to its ability to handle categorical data effectively and its robustness against overfitting. RandomForest, being an ensemble method, generally performs well on a variety of datasets.

*Logistic Regression: As a linear model, Logistic Regression is straightforward, interpretable, and particularly effective for binary classification problems. It was chosen to compare how a simpler linear approach performs against more complex models.

*SVC: Support Vector Classifier was selected to represent another type of non-linear model. With different kernel options (linear and RBF), it provides a way to test both linear and non-linear decision boundaries.

1. (2 marks) Which model worked the best? Does this make sense based on the theory discussed in the course and the context of your dataset?

*To determine which model worked best, we would typically compare the grid search scores and the testing accuracy of the models. The best model is the one that achieves the highest accuracy on the test dataset.

*RandomForestClassifier might have an edge due to its ability to capture complex patterns in the data without requiring feature scaling or extensive data preprocessing. Its performance generally aligns with the expectation for datasets with categorical features. 

*Logistic Regression, being simpler, might perform well if the decision boundary is approximately linear.

*SVC's performance can vary significantly based on the choice of kernel and hyperparameters.

## Step 4: Validate Model (6 marks)

Use the testing set to calculate the testing accuracy for the best model determined in Step 3.

In [6]:
# Calculate testing accuracy (1 mark)

from sklearn.metrics import accuracy_score

# Identify the best model from grid search
best_model_name = max(best_models, key=lambda model: best_models[model].best_score_)
best_model = best_models[best_model_name].best_estimator_

print(f"Best model based on grid search: {best_model_name}")

# Predict on the test data using the best model
y_pred = best_model.predict(X_test)

# Calculate the accuracy
test_accuracy = accuracy_score(y_test, y_pred)

print("Testing Accuracy:", test_accuracy)


Best model based on grid search: lr
Testing Accuracy: 0.9540229885057471



### Questions (5 marks)

1. (1 mark) Which accuracy metric did you choose? 

The chosen accuracy metric was the standard accuracy score, which is the proportion of correct predictions out of all predictions. This metric is straightforward and widely used, especially for binary classification tasks.

2. (1 mark) How do these results compare to those in part 3? Did this model generalize well?

*The results in Part 3 involved running GridSearchCV for three different models (RandomForestClassifier, Logistic Regression, and SVC) and evaluating their performance on the testing set.

*The model that performed best in terms of grid search score (cross-validation score) and testing accuracy would be considered the best model. If this model also shows similar performance on both training (cross-validation) and testing sets, it suggests that the model has generalized well.

3. (3 marks) Based on your results and the context of your dataset, did the best model perform "well enough" to be used out in the real-world? Why or why not? Do you have any suggestions for how you could improve this analysis?

Real-World Applicability of the Best Model:

*The real-world applicability of the best-performing model depends on its testing accuracy and the context of the dataset. If the best model shows high accuracy and the errors it makes are not critical in the application context (e.g., political analysis, prediction of voting behavior), it might be considered suitable for real-world use.

*To further improve the analysis, you could experiment with more diverse hyperparameter tuning, try different preprocessing techniques, or explore ensemble methods that combine the strengths of various models.

Reflecting on Part 3:

*This part of the assignment likely provided valuable insights into the practical aspects of model selection and hyperparameter tuning. The use of GridSearchCV with multiple models would have offered an understanding of how different models and their parameters impact performance.

*A potential challenge could have been managing the computational cost and time involved in extensive grid searches, especially with models like SVC that have multiple hyperparameters. The iterative process of tuning and evaluating models also requires careful consideration and interpretation of results.
The experience of comparing different models and seeing how theoretical principles translate into practical outcomes is both enlightening and essential for developing a deeper understanding of machine learning algorithms.


## Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:

1. Where did you source your code?

The dataset comes from the UCI Machine Learning Repository, a popular source for machine learning datasets. 
https://archive.ics.uci.edu/dataset/105/congressional+voting+records

2. In what order did you complete the steps?

*Data Loading: The first step involved fetching and loading the dataset using the ucimlrepo package.

*Data Preprocessing: This included handling missing values, encoding categorical variables, and preparing the dataset for modeling.

*Model Selection and Setup: Selection of models (RandomForest, Logistic Regression, and SVC) for testing, along with setting up pipelines for each.

*Hyperparameter Tuning: Implementing GridSearchCV for each model to find the best parameters.
Model Evaluation: Assessing model performance using accuracy and other metrics like confusion matrix and classification report on the testing set.

3. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?

* Yes, ChatGpt was used to understand the requirements and implementation better. 
* The prompts were the instructions and queries regarding the implementation of the machine learning models depending on the requirements.
* The Modifications were done for specific details like the target variable and feature selection were based on assumptions and might need adjustment according to the exact dataset and task.

4. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

* Lack of Specific Details: The primary challenge was the lack of specific details about the target variable and features in your dataset. Assumptions were made for these, which might not align perfectly with your actual use case.


## Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challenging, motivating
while working on this assignment.

* Liked the implementation of the multiple Machine Learning Models.

* While working on this assignment, I found it interesting and motivating to help with a machine learning task involving code generation. 

* Challenges include understanding the best practices for data preprocessing, selecting appropriate models, and tuning parameters for optimal performance.
