# Assignment 4: Pipelines and Text Data (60 total marks)
### Due: March 21 at 11:59pm

### Name: 

In [289]:
import numpy as np
import pandas as pd

In [290]:
import warnings
warnings.filterwarnings('ignore') #ignoring some deprication warnings

## Part 1: Pipelines (26 marks)

The purpose of this part of the assignment is to practice following the grid-search workflow: 
- Split data into training and test set
- Use the training portion to find the best model using grid search and cross-validation
- Retrain the best model
- Evaluate the retrained model on the test set

### 1.1: Load data (4 marks)
For this task, we will be using stock data from the Dow Jones Index. This dataset uses information about different stocks to try to predict what the percent change in price will be from week to week.

More information on the dataset can be found here: https://archive.ics.uci.edu/dataset/312/dow+jones+index

In [291]:
# TO DO: Load the dataset into a dataframe called stock_data (0.5 marks)
stock_data = pd.read_csv('dow_jones_index.data')
# TO DO: Inspect the first few columns (0.5 marks)
print(stock_data.head())


   quarter stock       date    open    high     low   close     volume  \
0        1    AA   1/7/2011  $15.82  $16.72  $15.78  $16.42  239655616   
1        1    AA  1/14/2011  $16.71  $16.71  $15.64  $15.97  242963398   
2        1    AA  1/21/2011  $16.19  $16.38  $15.60  $15.79  138428495   
3        1    AA  1/28/2011  $15.87  $16.63  $15.82  $16.13  151379173   
4        1    AA   2/4/2011  $16.18  $17.39  $16.18  $17.14  154387761   

   percent_change_price  percent_change_volume_over_last_wk  \
0               3.79267                                 NaN   
1              -4.42849                            1.380223   
2              -2.47066                          -43.024959   
3               1.63831                            9.355500   
4               5.93325                            1.987452   

   previous_weeks_volume next_weeks_open next_weeks_close  \
0                    NaN          $16.71           $15.97   
1            239655616.0          $16.19           $15

In [292]:
# TO DO: Check the data types of each column and if there are missing values (0.5 marks)
print(stock_data.dtypes)
print(stock_data.isnull().sum())

quarter                                 int64
stock                                  object
date                                   object
open                                   object
high                                   object
low                                    object
close                                  object
volume                                  int64
percent_change_price                  float64
percent_change_volume_over_last_wk    float64
previous_weeks_volume                 float64
next_weeks_open                        object
next_weeks_close                       object
percent_change_next_weeks_price       float64
days_to_next_dividend                   int64
percent_return_next_dividend          float64
dtype: object
quarter                                0
stock                                  0
date                                   0
open                                   0
high                                   0
low                                    0
clos

You should notice in this dataset that there are multiple columns that look numerical, but include a `$` that turns the value into a string (type object). You can use the code below to convert these columns into numerical ones:

In [293]:
# TO DO: Fill-in which columns need the $ to be removed (1 mark)
columns = ['open', 'high', 'low', 'close', 'next_weeks_open', 'next_weeks_close']

# Code to remove $ - DO NOT CHANGE
stock_data[columns] = stock_data[columns].replace('[\$]', '', regex=True).astype(float)

# TO DO: Inspect first few rows to make sure it worked (0.5 marks)
print(stock_data.head())

   quarter stock       date   open   high    low  close     volume  \
0        1    AA   1/7/2011  15.82  16.72  15.78  16.42  239655616   
1        1    AA  1/14/2011  16.71  16.71  15.64  15.97  242963398   
2        1    AA  1/21/2011  16.19  16.38  15.60  15.79  138428495   
3        1    AA  1/28/2011  15.87  16.63  15.82  16.13  151379173   
4        1    AA   2/4/2011  16.18  17.39  16.18  17.14  154387761   

   percent_change_price  percent_change_volume_over_last_wk  \
0               3.79267                                 NaN   
1              -4.42849                            1.380223   
2              -2.47066                          -43.024959   
3               1.63831                            9.355500   
4               5.93325                            1.987452   

   previous_weeks_volume  next_weeks_open  next_weeks_close  \
0                    NaN            16.71             15.97   
1            239655616.0            16.19             15.79   
2          

In [294]:
# TO DO: Check data type of each column to make sure that the type of the columns selected has changed (0.5 marks)
print(stock_data.dtypes)

quarter                                 int64
stock                                  object
date                                   object
open                                  float64
high                                  float64
low                                   float64
close                                 float64
volume                                  int64
percent_change_price                  float64
percent_change_volume_over_last_wk    float64
previous_weeks_volume                 float64
next_weeks_open                       float64
next_weeks_close                      float64
percent_change_next_weeks_price       float64
days_to_next_dividend                   int64
percent_return_next_dividend          float64
dtype: object


The first thing we need to do is deal with missing values. Looking at the dataset, there are two columns with 30 missing values. For this case, we will drop these rows instead of filling them in.

In [295]:
# TO DO: Drop rows with missing data (0.5 marks)
stock_data = stock_data.dropna()
print(stock_data.isnull().sum())

quarter                               0
stock                                 0
date                                  0
open                                  0
high                                  0
low                                   0
close                                 0
volume                                0
percent_change_price                  0
percent_change_volume_over_last_wk    0
previous_weeks_volume                 0
next_weeks_open                       0
next_weeks_close                      0
percent_change_next_weeks_price       0
days_to_next_dividend                 0
percent_return_next_dividend          0
dtype: int64


### 1.2: Pre-processing (4 marks)

In this dataset, we have columns with:
- Categorical values
- Numerical values

We need to create a column transformer that will use the proper preprocessing methods on each type of column.

In [296]:
# TO DO: Create Column Transformer using an encoder and StandardScaler (1 mark)
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

categorical_features = ['stock']
numerical_features = [
    'open', 'high', 'low', 'close', 'volume',
    'percent_change_price', 'percent_change_volume_over_last_wk',
    'previous_weeks_volume', 'percent_change_next_weeks_price',
    'days_to_next_dividend', 'percent_return_next_dividend'
]

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(), categorical_features)
    ]
)

In [297]:
# TO DO: Initialize your pipeline with your column transformer and the Ridge Regression model (1 mark)
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', Ridge())
])

In [298]:
# TO DO: Separate data into feature matrix and target vector (1 mark)
X = stock_data.drop(columns=['percent_change_next_weeks_price'])
y = stock_data['percent_change_next_weeks_price']

In [299]:
# TO DO: Split data into training and testing sets (use random_state=0 and 10% of the data for testing) (0.5 marks)
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)

Create another column transformer that does not implement scaling

In [300]:
# TO DO: Create a new column transformer that only performs encoding (0.5 marks)
encoder_only = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), categorical_features)
    ]
)

### 1.3: Grid Search (4 marks)

For the grid search, we want to compare the performance of the Random Forest model to a Ridge Regression model with the two different column transformers. Think about if we need to use scaling for both models. Select parameter values to test that make sense for both models.

In [301]:
# TO DO: Create parameter grid and initialize grid object (3 marks)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = [
    {
        'model': [Ridge()],
        'model__alpha': [0.1, 1, 10],  # Parameters for Ridge
        'preprocessor': [preprocessor, encoder_only]  # Test different preprocessors
    },
    {
        'model': [RandomForestRegressor()],
        'model__n_estimators': [50, 100],  # Parameters for RandomForest
        'model__max_depth': [None, 10, 20],  # Additional RandomForest parameters
        'preprocessor': [preprocessor, encoder_only]
    }
]

grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='r2')

In [302]:
# TO DO: Fit grid object to training data (1 mark)
grid_search.fit(X_train, y_train)

### 1.4: Visualize Results (2 marks)

The final step is to print out the results from the grid search. You will need to print out the following items:
- Best parameters
- Best cross-validation train score 
- Best cross-validation test score
- Test set accuracy

In [303]:
# TO DO: Print the results from the grid search (2 marks)
from sklearn.metrics import r2_score

print('Best parameters:', grid_search.best_params_)
print('Best cross-validation score:', grid_search.best_score_)
print('Test set accuracy:', r2_score(y_test, grid_search.predict(X_test)))

Best parameters: {'model': Ridge(), 'model__alpha': 10, 'preprocessor': ColumnTransformer(transformers=[('cat', OneHotEncoder(), ['stock'])])}
Best cross-validation score: -0.03733567897613828
Test set accuracy: -0.05474060798900182


### Questions (8 marks)

1. Which models did you use scaling for? Why?
1. Which model and what parameters produced the best results?
1. Was this model a good fit? Why or why not?
1. Is there anything else we could do to try to improve model performance? Provide two ideas.

- 1. Scaling was used for Ridge regression models as it is sensitive to magnitude of feature values.
- 2. Ridge regression model produced the best results with the alpha=0.1 parameter
- 3. This model was a good fit as it achieved the highest r2 score during the cross validation, showing that it explained the variance in the data.
- 4. By using feature engineering to help create meaningful features or trying additional models or tuning hyperparameters with a finer grid.

*ANSWER HERE*


### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

Code was source by use of class meaterial, documentation from sklearn, and use of generative AI tools. Steps were completed in the order of defining the pipeline, creating parameter grid, setting up grid search, fitting the model and evaluation the results. Promps used with generative AI included questions about setting up pipelines, parameter grids, and interpreting results. Generated code required slight adjustments to align with the dataset and requirements, like adjusting the feature names and parameter values. Challenges I had included resolving parameter mismatches in the grid search which was fixed my reviewing documentation and debugging. Iterative testing and leveragin AI prompts helped me with this challenge

## Part 2: Text Data (32 marks)

The purpose of this part of the assignment is to practice working with text data.

### 2.1: Load data (1 mark)
For this task, we will be using the hobbies dataset from the yellowbrick library. More information on the dataset can be found here: https://www.scikit-yb.org/en/latest/api/datasets/hobbies.html

In [304]:
# TO DO: Load the dataset (1 mark)
from yellowbrick.datasets import load_hobbies

hobbies_data = load_hobbies()
df = pd.DataFrame(hobbies_data.data, columns=["text"])
print(df.head())

                                                text
0  \nI still remember when Oprah selected \n\n no...
1  It shouldn’t be a question at all. But it ofte...
2                                                 \n
3  \n\t\tCongratulations to Eight Emerging Arabic...
4  The Lonely City bristles with heart-piercing w...


### 2.2 Pre-processing (3 marks)

We will need to transform the data from strings to numeric. First, we will transform the data using `CountVectorizer(min_df=5)`.

In [305]:
# TO DO: Create CountVectorizer object (0.5 marks)
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(min_df=5)

In [306]:
# TO DO: Fit vectorizer to data (0.5 marks)
vectorizer.fit(hobbies_data.data)

In [307]:
# TO DO: What is the length of the vocabulary? (0.5 marks)
print('Vocabulary length:', len(vectorizer.vocabulary_))

Vocabulary length: 3940


In [308]:
# TO DO: Transform the data (0.5 marks)
X_transformed = vectorizer.transform(hobbies_data.data)

In [309]:
# TO DO: What is the shape of the transformed data? (0.5 marks)
print('Transformed data shape:', X_transformed.shape)

Transformed data shape: (448, 3940)


In [310]:
# TO DO: Split data into training and testing sets (use random_state=0 and 10% of the data for testing) (0.5 marks)
X_train, X_test, y_train, y_test = train_test_split(
    X_transformed, hobbies_data.target, test_size=0.1, random_state=0
)

### 2.3: Grid Search (5 marks)

For the grid search, we want to compare the performance of Logistic Regression for different values of C. Initialize the parameter grid with parameter values that make sense for this model.

In [311]:
# TO DO: Create parameter grid and initialize grid object (2 marks)
from sklearn.linear_model import LogisticRegression

param_grid = {
    'C': [0.1, 1, 10]
}

grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5, scoring='accuracy')

In [312]:
# TO DO: Fit grid object to training data (1 mark)
grid_search.fit(X_train, y_train)

In [313]:
# TO DO: Print the results from the grid search (2 marks)
print('Best parameters:', grid_search.best_params_)
print('Best cross-validation score:', grid_search.best_score_)
print('Test set accuracy:', grid_search.score(X_test, y_test))

Best parameters: {'C': 10}
Best cross-validation score: 0.7966975308641976
Test set accuracy: 0.8222222222222222


### 2.4: Additional Model Comparisons (9 marks)

### 2.4.1: Naive Bayes (3 marks)
We would like to compare the performance of Logistic Regression with one of the Naive Bayes models. Pick the Naive Bayes model that you think would best suit text data and implement below. Since we are not adjusting hyperparameters, we can use `cross_validate`.

In [314]:
# TO DO: Implement Naive Bayes model with cross-validate (2 marks)
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_validate

nb_model = MultinomialNB()
cv_results = cross_validate(nb_model, X_train, y_train, cv=5, return_train_score=True)

# TO DO: Print training and validation accuracies
print('Training accuracy:', cv_results['train_score'].mean())
print('Validation accuracy:', cv_results['test_score'].mean())

Training accuracy: 0.9578158952368133
Validation accuracy: 0.8834567901234568


In [315]:
# TO DO: Calculate and print test accuracy (1 mark)
nb_model.fit(X_train, y_train)
test_accuracy = nb_model.score(X_test, y_test)
print('Test accuracy:', test_accuracy)

Test accuracy: 0.8888888888888888


### 2.4.2 Tf-idf (6 marks)

To try to improve the results, we can try using Tf-idf to tranform the text data based on the importance of each feature. We will need to use a pipeline and the original data for this section. Use `TfidfVectorizer(min_df=5)` and compare the results for both Logistic Regression and your selected Naive Bayes model. Use the Logistic Regression parameters from the previous section.

In [316]:
# TO DO: Split the data into training and testing sets (same values as previous section) (1 mark)
X_train, X_test, y_train, y_test = train_test_split(
    hobbies_data.data, hobbies_data.target, test_size=0.1, random_state=0
)

In [317]:
# TO DO: Implement Pipeline with Tf-idf vectorizer and both Logistic Regression and your selected Naive Bayes model (3 marks)
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_pipeline_lr = Pipeline([
    ('tfidf', TfidfVectorizer(min_df=5)),
    ('model', LogisticRegression(C=grid_search.best_params_['C']))
])

tfidf_pipeline_nb = Pipeline([
    ('tfidf', TfidfVectorizer(min_df=5)),
    ('model', MultinomialNB())
])

tfidf_pipeline_lr.fit(X_train, y_train)
tfidf_pipeline_nb.fit(X_train, y_train)

In [318]:
# TO DO: Print the results from the grid search (2 marks)
print('Logistic Regression Test Accuracy:', tfidf_pipeline_lr.score(X_test, y_test))
print('Naive Bayes Test Accuracy:', tfidf_pipeline_nb.score(X_test, y_test))

Logistic Regression Test Accuracy: 0.8888888888888888
Naive Bayes Test Accuracy: 0.8


### Questions (10 marks)

1. Which Naive Bayes model did you pick? Why?
1. Which model and what parameters produced the best results?
1. Was this model a good fit? Why or why not?
1. Is there anything else we could do to try to improve model performance? Provide two ideas (must be different from Part 1).
1. Why did we need to implement a pipeline for Tf-idf and not CountVectorizer? What would happen if we didn't use one for Tf-idf?

- 1. Multinomial Naive Bayes model was chosen as it is well-suited for text classification with discrete features such as word counts and term frequencies
- 2. Logistic Regression with C=10 produced best results with a test accuracy of 88.89%
- 3. Yes it was a good fit, as it achieved high accuracy on both validation and tests sets, showing good generalization without overfitting.
- 4. Using ensemble methods to check for non-linear relationships, or perform hyperparameter tuning with a finer grid.
- 5. A pipeline is necessary for Tf-idf to ensure proper preprocessing during cross-validation, without it the transformation could leak information into the training set creating inflated performance metrics

*ANSWER HERE*


### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE - BE SPECIFIC*

## Part 3: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*