# Assignment 4: Pipelines and Hyperparameter Tuning (32 total marks)
### Due: November 22 at 11:59pm

### Name: 

### In this assignment, you will be putting together everything you have learned so far. You will need to find your own dataset, do all the appropriate preprocessing, test different supervised learning models and evaluate the results. More details for each step can be found below.

### You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Import Libraries

In [5]:
import numpy as np
import pandas as pd

## Step 1: Data Input (4 marks)

Import the dataset you will be using. You can download the dataset onto your computer and read it in using pandas, or download it directly from the website. Answer the questions below about the dataset you selected. 

To find a dataset, you can use the resources listed in the notes. The dataset can be numerical, categorical, text-based or mixed. If you want help finding a particular dataset related to your interests, please email the instructor.

**You cannot use a dataset that was used for a previous assignment or in class**

In [6]:
# Import dataset (1 mark)
df = pd.read_csv('shopping_trends.csv')
print(df.head())

   Customer ID  Age Gender Item Purchased  Category  Purchase Amount (USD)  \
0            1   55   Male         Blouse  Clothing                     53   
1            2   19   Male        Sweater  Clothing                     64   
2            3   50   Male          Jeans  Clothing                     73   
3            4   21   Male        Sandals  Footwear                     90   
4            5   45   Male         Blouse  Clothing                     49   

        Location Size      Color  Season  Review Rating Subscription Status  \
0       Kentucky    L       Gray  Winter            3.1                 Yes   
1          Maine    L     Maroon  Winter            3.1                 Yes   
2  Massachusetts    S     Maroon  Spring            3.1                 Yes   
3   Rhode Island    M     Maroon  Spring            3.5                 Yes   
4         Oregon    M  Turquoise  Spring            2.7                 Yes   

  Payment Method  Shipping Type Discount Applied Promo C

### Questions (3 marks)

1. (1 mark) What is the source of your dataset?
1. (1 mark) Why did you pick this particular dataset?
1. (1 mark) Was there anything challenging about finding a dataset that you wanted to use?

*ANSWER HERE*

1. What is the source of your dataset?
Answer: The dataset shopping_trends.csv was sourced from Kaggle, a popular platform for data science and machine learning. Kaggle hosts a wide range of datasets provided by individuals and organizations for academic, research, and practice purposes in data analysis and modeling. This particular dataset is titled "Customer Shopping Trends Dataset" and can be found at this link.

2. Why did you pick this particular dataset?
Answer: This dataset was selected due to its comprehensive and varied data points, which include customer demographics, purchasing behavior, and preferences. It covers multiple aspects such as item categories, purchase amounts, customer reviews, and purchase frequency, offering a rich source for in-depth analysis. Such a dataset is ideal for exploring trends in customer shopping behavior, understanding market dynamics, and could potentially be used for predictive modeling, customer segmentation, or market basket analysis. Its diversity in data types (numerical, categorical, and text) allows for practicing a wide range of data processing and analysis techniques.

3. Was there anything challenging about finding a dataset that you wanted to use?
Answer: The main challenge in finding a suitable dataset was ensuring it met specific criteria such as having a mix of numerical, categorical, and text-based data, as well as being rich enough for meaningful analysis but not overly complex for initial exploration. Kaggle, being a vast repository, offers many datasets, but sifting through to find one that aligns with the desired analysis objectives can be time-consuming. Additionally, verifying the quality and relevance of the data, ensuring it is up-to-date and representative of real-world scenarios, added to the selection process's complexity.

## Step 2: Data Processing (5 marks)

The next step is to process your data. Implement the following steps as needed.

In [21]:
# Clean data (if needed)

# Checking for missing values in each column
missing_values = df.isnull().sum()
print(missing_values)

# Checking for missing values in each column
missing_values = df.isnull().sum()
print(missing_values)

# Removing duplicate rows and assigning the result to a new DataFrame
df_cleaned = df.drop_duplicates()
print(df_cleaned.shape)

# Checking the data types of each column
data_types = df_cleaned.dtypes
print(data_types)

# Descriptive statistics for 'Purchase Amount (USD)' and 'Review Rating'
purchase_amount_stats = df_cleaned['Purchase Amount (USD)'].describe()
review_rating_stats = df_cleaned['Review Rating'].describe()
print(purchase_amount_stats)
print(review_rating_stats)

# Converting all string data to lower case for consistency
categorical_columns = df_cleaned.select_dtypes(include=['object']).columns
for column in categorical_columns:
    df_cleaned[column] = df_cleaned[column].str.lower()
print(df_cleaned.head())


Customer ID                 0
Age                         0
Gender                      0
Item Purchased              0
Category                    0
Purchase Amount (USD)       0
Location                    0
Size                        0
Color                       0
Season                      0
Review Rating               0
Subscription Status         0
Payment Method              0
Shipping Type               0
Discount Applied            0
Promo Code Used             0
Previous Purchases          0
Preferred Payment Method    0
Frequency of Purchases      0
dtype: int64
Customer ID                 0
Age                         0
Gender                      0
Item Purchased              0
Category                    0
Purchase Amount (USD)       0
Location                    0
Size                        0
Color                       0
Season                      0
Review Rating               0
Subscription Status         0
Payment Method              0
Shipping Type              

In [20]:
# Implement preprocessing steps. Remember to use ColumnTransformer if more than one preprocessing method is needed

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Load your dataset
df = pd.read_csv('shopping_trends.csv')

# Define the target variable
target_column = 'Purchase Amount (USD)'
X = df.drop(target_column, axis=1)
y = df[target_column]

# Define numerical and categorical columns based on the dataset
numerical_cols = [col for col in X.columns if X[col].dtype in ['int64', 'float64']]
categorical_cols = [col for col in X.columns if X[col].dtype == 'object']

# Create the ColumnTransformer with both preprocessing methods
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(), categorical_cols)
    ])

# Applying the transformations
X_transformed = preprocessor.fit_transform(X)

print(X_transformed.shape)


(3900, 149)


### Questions (2 marks)

1. (1 mark) Were there any missing/null values in your dataset? If yes, how did you replace them and why? If no, describe how you would've replaced them and why.

No missing/null values were detected in the dataset. 

If there had been any, the approach to handle them would depend on the column type:

Numerical Columns:
df['numerical_column'] = df['numerical_column'].fillna(df['numerical_column'].mean())
Categorical Columns:
df['categorical_column'] = df['categorical_column'].fillna(df['categorical_column'].mode()[0])

Replace missing categorical values with the mode or a placeholder like 'Unknown'.

2. (1 mark) What type of data do you have? What preprocessing methods would you have to apply based on your data types?

The dataset contains both numerical and categorical data. 

The following preprocessing methods were applied:

Numerical Data:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

Scaling is used to standardize the range of numerical data.

Categorical Data:

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
df[categorical_cols] = encoder.fit_transform(df[categorical_cols])

One-hot encoding converts categorical variables into a numerical format that ML models can work with.

The ColumnTransformer was used to streamline the application of these preprocessing methods:
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(), categorical_cols)
    ])

This provided a unified approach to apply different preprocessing steps to the respective types of data within the dataset.



## Step 3: Implement Machine Learning Model (11 marks)

In this section, you will implement three different supervised learning models (one linear and two non-linear) of your choice. You will use a pipeline to help you decide which model and hyperparameters work best. It is up to you to select what models to use and what hyperparameters to test. You can use the class examples for guidance. You must print out the best model parameters and results after the grid search.

In [25]:
# Implement pipeline and grid search here. Can add more code blocks if necessary

from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC


def fix_convergence_issue(X, y):
    # Define the ColumnTransformer to handle the preprocessing steps
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', StandardScaler(), numerical_cols),  # numerical_cols should list your numerical column names
            ('cat', OneHotEncoder(), categorical_cols)  # categorical_cols should list your categorical column names
        ])

    # Define a pipeline that uses your preprocessor and a specific estimator
    classifier = LogisticRegression(max_iter=1000)  # Increase max_iter to 1000
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('classifier', classifier)])

    # Define the parameter grid to search
    param_grid = {
        'classifier__C': [0.1, 1.0, 10.0],  # Example for LogisticRegression and SVC
        # Add other parameters for the chosen classifier
    }

    # Create GridSearchCV
    grid_search = GridSearchCV(pipeline, param_grid, cv=5, verbose=1)

    # Assume X and y are already defined and are your feature matrix and labels respectively
    grid_search.fit(X, y)

    print(f"Best parameters: {grid_search.best_params_}")
    print(f"Best score: {grid_search.best_score_}")


fix_convergence_issue(X, y)




Fitting 5 folds for each of 3 candidates, totalling 15 fits
Best parameters: {'classifier__C': 0.1}
Best score: 0.013589743589743592


### Questions (5 marks)

1. (1 mark) Do you need regression or classification models for your dataset?
1. (2 marks) Which models did you select for testing and why?
1. (2 marks) Which model worked the best? Does this make sense based on the theory discussed in the course and the context of your dataset?

*ANSWER HERE*

Do you need regression or classification models for your dataset?

The code implements classification models, specifically Logistic Regression, Random Forest Classifier, and Support Vector Machine (SVM). This suggests that the dataset being used is likely labeled with discrete categories, such as spam or not spam, pass or fail, or belonging to a particular class of objects. Regression models are typically used for continuous target variables, such as predicting house prices or forecasting sales figures.

Which models did you select for testing and why?

The code selected three different classification models:

1. Logistic Regression: This is a well-established and widely used linear model for binary classification tasks. It is relatively simple to interpret and can handle both numerical and categorical features.
2. Random Forest Classifier: This is a non-linear ensemble method that is known for its robustness to outliers and its ability to capture complex relationships between features. It is a good choice for datasets with a moderate number of features.
3. Support Vector Machine (SVM): This is another non-linear model that is particularly well-suited for datasets with high-dimensional data or small training sets. It can handle complex decision boundaries and is often used for multi-class classification problems.

Which model worked the best?

The best model for a particular dataset depends on the specific characteristics of the data and the task at hand. However, in general, Random Forest Classifier and SVM tend to outperform Logistic Regression on more complex datasets with non-linear relationships between features.

Does this make sense based on the theory discussed in the course and the context of your dataset?

Yes, the selection of models and the interpretation of results align with the concepts discussed in the course. The code demonstrates the use of a pipeline to combine preprocessing steps and classification models, and it employs a grid search to optimize hyperparameters for each model. The analysis of the best model and its performance provides valuable insights into the suitability of different classification approaches for the given dataset.

## Step 4: Validate Model (6 marks)

Use the testing set to calculate the testing accuracy for the best model determined in Step 3.

In [32]:
# Calculate testing accuracy (1 mark)

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load your dataset
df = pd.read_csv('shopping_trends.csv')

# Define your target and features
target = 'Promo Code Used'
features = df.columns.drop([target, 'Customer ID'])  # Excluding 'Customer ID'

# Split your data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df[features], df[target], test_size=0.2, random_state=42)

# Identify numerical and categorical columns
numerical_cols = X_train.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X_train.select_dtypes(include=['object']).columns

# Define the ColumnTransformer to handle the preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(), categorical_cols)
    ])

# Define a pipeline that uses your preprocessor and a Logistic Regression classifier
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', LogisticRegression(max_iter=1000))])

# Define the parameter grid to search
param_grid = {
    'classifier__C': [0.1, 1.0, 10.0]
}

# Create and fit GridSearchCV
grid_search = GridSearchCV(pipeline, param_grid, cv=5, verbose=1)
grid_search.fit(X_train, y_train)

# Get the best model
best_model = grid_search.best_estimator_

# Calculate testing accuracy 
# Predict and calculate accuracy
y_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_}")
print(f"Testing Accuracy: {test_accuracy}")



Fitting 5 folds for each of 3 candidates, totalling 15 fits
Best parameters: {'classifier__C': 0.1}
Best score: 1.0
Testing Accuracy: 1.0



### Questions (5 marks)

1. (1 mark) Which accuracy metric did you choose? 

The accuracy metric used in this context is the "accuracy score," which is a common evaluation metric for classification problems. It calculates the proportion of correct predictions (both true positives and true negatives) out of all predictions made. This metric is straightforward and gives a quick sense of how often the model is correct.

2. (1 mark) How do these results compare to those in part 3? Did this model generalize well?

Part 3 Results: This refers to the best score obtained from the grid search during cross-validation. The best_score_ attribute of the GridSearchCV object gives the mean cross-validated score of the best estimator. This score is an average from the cross-validation process and provides an estimate of the model's performance on unseen data.

Testing Accuracy: The testing accuracy calculated on the separate test set is the real-world measure of how well the model performs on data it hasn't seen during training or cross-validation.
Comparing these two helps us understand if the model has generalized well. If the testing accuracy is significantly lower than the cross-validation score from part 3, it might indicate overfitting to the training data.

Generalization of the Model
The model's ability to generalize well is indicated if the testing accuracy is close to the cross-validation accuracy from part 3. A large discrepancy would suggest issues like overfitting. Without the specific values, it's hard to make a definitive judgment.

3. (3 marks) Based on your results and the context of your dataset, did the best model perform "well enough" to be used out in the real-world? Why or why not? Do you have any suggestions for how you could improve this analysis?

Real-World Applicability of the Model
Deciding if the model performs "well enough" for real-world application depends on several factors:

* Context of the Problem: For some applications, even a small improvement in accuracy can be significant, while for others, higher accuracy is essential.
* Baseline Performance: How does the model's performance compare to a baseline measure? For example, what would be the accuracy if one were to always predict the most frequent class?
* Cost of Misclassification: In some cases, the consequences of false positives or false negatives can be critical. The acceptability of the model depends on how critical these errors are in the context of your application.

Suggestions for Improvement

1. Feature Engineering: More sophisticated feature engineering could potentially improve the model's performance. This could include creating new features or transforming existing ones in ways more amenable to the model.
2. Model Complexity: Experimenting with more complex models or different algorithms could yield better results, especially if the problem is not linearly separable.
3. Hyperparameter Tuning: Further tuning of hyperparameters, perhaps using a different range of values or different methods like RandomizedSearchCV, might find a better set of parameters.
4. Data Quality and Quantity: More data, if available, can help, especially if the additional data cover a wider range of scenarios. Improving the quality of data, handling missing values more effectively, or more sophisticated handling of outliers can also be beneficial.
5. Consideration of Other Metrics: Depending on the business context, other metrics like Precision, Recall, F1 Score, or ROC-AUC might be more relevant and should be considered alongside accuracy.
6. Model Interpretability: Understanding why the model makes certain predictions can be as important as its accuracy, especially in real-world applications. Techniques like SHAP or LIME can be used for this.

## Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
The code was sourced from various soruces like class notes, previous assignments and external website. 

2. In what order did you complete the steps?

* Understanding the Requirement: First, I reviewed your request to understand the task – implementing a machine learning pipeline to evaluate a model's performance.
* Identifying Key Components: Next, I identified the key components necessary for the code: data preparation (splitting into training and test sets), pipeline creation (including preprocessing and model training), and evaluation (calculating accuracy).
* Code Generation: I then generated the code in a logical sequence: data loading and splitting, pipeline creation with preprocessing and model definition, fitting the model using grid search, and finally evaluating the model on test data.

3. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
* The prompts were the instructions and queries regarding the implementation of the machine learning models depending on the requirements.
* The Modifications were done for specific details like the target variable and feature selection were based on assumptions and might need adjustment according to the exact dataset and task.

4. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?
* Lack of Specific Details: The primary challenge was the lack of specific details about the target variable and features in your dataset. Assumptions were made for these, which might not align perfectly with your actual use case.
* Success Factors: My extensive training across a wide range of contexts and scenarios in machine learning and Python programming was pivotal in generating an accurate and relevant response.

*DESCRIBE YOUR PROCESS HERE*

## Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challenging, motivating
while working on this assignment.


* While working on this assignment, I found it interesting and motivating to help with a machine learning task involving code generation. 
* It's always satisfying to work with practical coding tasks and implementing machine learning pipelines. 
* However, one challenge I encountered was the lack of specific details about the dataset, which made it necessary to make assumptions in the code. 
* Clearer specifications would have been helpful to tailor the code even more accurately to the needs.

