# Pipeline Project

You will be using the provided data to create a machine learning model pipeline.

You must handle the data appropriately in your pipeline to predict whether an
item is recommended by a customer based on their review.
Note the data includes numerical, categorical, and text data.

You should ensure you properly train and evaluate your model.

## The Data

The dataset has been anonymized and cleaned of missing values.

There are 8 features for to use to predict whether a customer recommends or does
not recommend a product.
The `Recommended IND` column gives whether a customer recommends the product
where `1` is recommended and a `0` is not recommended.
This is your model's target/

The features can be summarized as the following:

- **Clothing ID**: Integer Categorical variable that refers to the specific piece being reviewed.
- **Age**: Positive Integer variable of the reviewers age.
- **Title**: String variable for the title of the review.
- **Review Text**: String variable for the review body.
- **Positive Feedback Count**: Positive Integer documenting the number of other customers who found this review positive.
- **Division Name**: Categorical name of the product high level division.
- **Department Name**: Categorical name of the product department name.
- **Class Name**: Categorical name of the product class name.

The target:
- **Recommended IND**: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.

## Load Data

In [4]:
import pandas as pd

# Load data
df = pd.read_csv(
    'data/reviews.csv',
)

df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18442 entries, 0 to 18441
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Clothing ID              18442 non-null  int64 
 1   Age                      18442 non-null  int64 
 2   Title                    18442 non-null  object
 3   Review Text              18442 non-null  object
 4   Positive Feedback Count  18442 non-null  int64 
 5   Division Name            18442 non-null  object
 6   Department Name          18442 non-null  object
 7   Class Name               18442 non-null  object
 8   Recommended IND          18442 non-null  int64 
dtypes: int64(4), object(5)
memory usage: 1.3+ MB


Unnamed: 0,Clothing ID,Age,Title,Review Text,Positive Feedback Count,Division Name,Department Name,Class Name,Recommended IND
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,0,General,Dresses,Dresses,0
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",0,General Petite,Bottoms,Pants,1
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,6,General,Tops,Blouses,1
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",4,General,Dresses,Dresses,0
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,1,General Petite,Tops,Knits,1


## Preparing features (`X`) & target (`y`)

In [5]:
data = df

# separate features from labels
X = data.drop('Recommended IND', axis=1)
y = data['Recommended IND'].copy()

print('Labels:', y.unique())
print('Features:')
display(X.head())

Labels: [0 1]
Features:


Unnamed: 0,Clothing ID,Age,Title,Review Text,Positive Feedback Count,Division Name,Department Name,Class Name
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,0,General,Dresses,Dresses
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",0,General Petite,Bottoms,Pants
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,6,General,Tops,Blouses
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",4,General,Dresses,Dresses
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,1,General Petite,Tops,Knits


In [6]:
# Split data into train and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.1,
    shuffle=True,
    random_state=27,
)

# Your Work

## Data Exploration

In [7]:
X_train.head()

Unnamed: 0,Clothing ID,Age,Title,Review Text,Positive Feedback Count,Division Name,Department Name,Class Name
893,1060,37,Super cute. pockets would be nice,Easy and fun jumper. runs slightly large. i or...,2,General Petite,Bottoms,Pants
1767,1072,23,Great for all seasons,The dress looks great both in winter and summe...,0,General Petite,Dresses,Dresses
4491,1078,41,Just ok,I wanted to love this dress as it seemed perfe...,10,General,Dresses,Dresses
17626,862,52,Cute but...,I loved this shirt when i purchased it but it ...,6,General Petite,Tops,Knits
11184,1083,28,Grandmas draperies dress,I had to review this because i purchased befor...,3,General,Dresses,Dresses


In [8]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16597 entries, 893 to 5139
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Clothing ID              16597 non-null  int64 
 1   Age                      16597 non-null  int64 
 2   Title                    16597 non-null  object
 3   Review Text              16597 non-null  object
 4   Positive Feedback Count  16597 non-null  int64 
 5   Division Name            16597 non-null  object
 6   Department Name          16597 non-null  object
 7   Class Name               16597 non-null  object
dtypes: int64(3), object(5)
memory usage: 1.1+ MB


In [9]:
print("Numer of missing values per column:\n")
print(X_train.isna().sum())

Numer of missing values per column:

Clothing ID                0
Age                        0
Title                      0
Review Text                0
Positive Feedback Count    0
Division Name              0
Department Name            0
Class Name                 0
dtype: int64


In [10]:
print("Target values (y_train) are: ", y_train.unique())

Target values (y_train) are:  [1 0]


## Building Pipeline

### Splitting Numerical, Categorical and Text Data


| Feature group | Features                                                |
| ------------ | ------------------------------------------------------- |
| Numerical    | Age, Positive Feedback Count                            |
| Categorical  | Clothing ID, Division Name, Department Name, Class Name |
| Text         | Title, Review Text                                      |


In [11]:
num_features = [
    'Age',
    'Positive Feedback Count'
]
print('Numerical features:', num_features)

cat_features = [
    'Clothing ID',
    'Division Name',
    'Department Name',
    'Class Name',
]
print('Categorical features:', cat_features)

text_features = [
    'Title',
    'Review Text',
]
print ('Review Text features:', text_features)

Numerical features: ['Age', 'Positive Feedback Count']
Categorical features: ['Clothing ID', 'Division Name', 'Department Name', 'Class Name']
Review Text features: ['Title', 'Review Text']


### Numerical Feature Pipeline

In [12]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])
num_pipeline

0,1,2
,steps,"[('imputer', ...), ('scaler', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,missing_values,
,strategy,'mean'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,copy,True
,with_mean,True
,with_std,True


### Categorical Feature Pipeline

In [13]:
from sklearn.preprocessing import OneHotEncoder

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most frequent')),
    ('encoder', OneHotEncoder(sparse_output=False, handle_unknown='ignore'))
])
cat_pipeline

0,1,2
,steps,"[('imputer', ...), ('encoder', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,missing_values,
,strategy,'most frequent'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'


### Text Feature Pipeline

In [None]:
#! python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m26.6 MB/s[0m  [33m0:00:00[0m eta [36m0:00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [15]:
import spacy

nlp = spacy.load('en_core_web_sm')

#### Character Count

In [16]:
from sklearn.base import BaseEstimator, TransformerMixin
class CountCharacter(BaseEstimator, TransformerMixin):
    def __init__(self, character: str):
        self.character = character

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return [[text.count(self.character)] for text in X]

#### Review Length (Word Count)

In [17]:
class WordCount(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(X):
        word_counts = [len(str(text).split()) for text in X]
        return np.array(word_counts).reshape(-1, 1)

#### Ratio Uppercase

In [18]:
class UppercaseRatio(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        ratios = []
        for text in X:
            text = str(text)
            total_chars = len(text)
            if total_chars == 0:
                ratios.append(0)
            else:
                uppercase_chars = sum(1 for c in text if c.isupper())
                ratios.append(uppercase_chars / total_chars)
        return np.array(ratios).reshape(-1, 1)

In [19]:
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import FunctionTransformer
import numpy as np

initial_text_preprocess = Pipeline([
    (
        'dimension_reshaper',
        FunctionTransformer(
            np.reshape,
            kw_args={'newshape':-1},
        ),
    ),
])

feature_engineering = FeatureUnion([
    ('count_spaces', CountCharacter(character=' ')),
    ('count_exclamations', CountCharacter(character='!')),
    ('count_question_marks', CountCharacter(character='?')),
    ('review_length', WordCount()),
    ('uppercase_ratio', UppercaseRatio()),
])

character_counts_pipeline = Pipeline([
    (
        'initial_text_preprocess',
        initial_text_preprocess,
    ),
    (
        'feature_engineering',
        feature_engineering,
    ),
])
character_counts_pipeline

0,1,2
,steps,"[('initial_text_preprocess', ...), ('feature_engineering', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,steps,"[('dimension_reshaper', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,func,<function res...t 0x105debdf0>
,inverse_func,
,validate,False
,accept_sparse,False
,check_inverse,True
,feature_names_out,
,kw_args,{'newshape': -1}
,inv_kw_args,

0,1,2
,transformer_list,"[('count_spaces', ...), ('count_exclamations', ...), ...]"
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True

0,1,2
,character,' '

0,1,2
,character,'!'

0,1,2
,character,'?'


#### spaCy and TF-IDF

In [20]:
class SpacyLemmatizer(BaseEstimator, TransformerMixin):
    def __init__(self, nlp):
        self.nlp = nlp

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        lemmatized = [
            ' '.join(
                token.lemma_ for token in doc
                if not token.is_stop
            )
            for doc in self.nlp.pipe(X)
        ]
        return lemmatized

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_pipeline = Pipeline([
    (
        'dimension_reshaper',
        FunctionTransformer(
            np.reshape,
            kw_args={'newshape':-1},
        ),
    ),
    (
        'lemmatizer',
        SpacyLemmatizer(nlp=nlp),
    ),
    (
        'tfidf_vectorizer',
        TfidfVectorizer(
            stop_words='english',
        ),
    ),
])
tfidf_pipeline 

0,1,2
,steps,"[('dimension_reshaper', ...), ('lemmatizer', ...), ...]"
,transform_input,
,memory,
,verbose,False

0,1,2
,func,<function res...t 0x105debdf0>
,inverse_func,
,validate,False
,accept_sparse,False
,check_inverse,True
,feature_names_out,
,kw_args,{'newshape': -1}
,inv_kw_args,

0,1,2
,nlp,<spacy.lang.e...t 0x16ce19850>

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,analyzer,'word'
,stop_words,'english'
,token_pattern,'(?u)\\b\\w\\w+\\b'


#### Sentiment Analysis

In [22]:
from transformers import pipeline

In [23]:
import torch

In [24]:
class SentimentAnalyzer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.sentiment_pipeline = pipeline("sentiment-analysis")

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        sentiments = self.sentiment_pipeline(list(X))
        scores = [s['score'] if s['label'] == 'POSITIVE' else -s['score'] for s in sentiments]
        return np.array(scores).reshape(-1, 1)


In [25]:
sentiment_pipeline = Pipeline([
    ('initial_text_preprocess', initial_text_preprocess),
    ('sentiment_analyzer', SentimentAnalyzer())
])

sentiment_pipeline

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


0,1,2
,steps,"[('initial_text_preprocess', ...), ('sentiment_analyzer', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,steps,"[('dimension_reshaper', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,func,<function res...t 0x105debdf0>
,inverse_func,
,validate,False
,accept_sparse,False
,check_inverse,True
,feature_names_out,
,kw_args,{'newshape': -1}
,inv_kw_args,


### Full Feature Engineering Pipeline

In [27]:
from sklearn.compose import ColumnTransformer

feature_engineering = ColumnTransformer([
        ('num', num_pipeline, num_features),
        ('cat', cat_pipeline, cat_features),
        ('character_counts', character_counts_pipeline, text_features),
        ('tfidf_text', tfidf_pipeline, text_features),
        ('sentiment', sentiment_pipeline, text_features),
])

feature_engineering

0,1,2
,transformers,"[('num', ...), ('cat', ...), ...]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'mean'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,missing_values,
,strategy,'most frequent'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,steps,"[('dimension_reshaper', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,func,<function res...t 0x105debdf0>
,inverse_func,
,validate,False
,accept_sparse,False
,check_inverse,True
,feature_names_out,
,kw_args,{'newshape': -1}
,inv_kw_args,

0,1,2
,transformer_list,"[('count_spaces', ...), ('count_exclamations', ...), ...]"
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True

0,1,2
,character,' '

0,1,2
,character,'!'

0,1,2
,character,'?'

0,1,2
,func,<function res...t 0x105debdf0>
,inverse_func,
,validate,False
,accept_sparse,False
,check_inverse,True
,feature_names_out,
,kw_args,{'newshape': -1}
,inv_kw_args,

0,1,2
,nlp,<spacy.lang.e...t 0x16ce19850>

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,analyzer,'word'
,stop_words,'english'
,token_pattern,'(?u)\\b\\w\\w+\\b'

0,1,2
,steps,"[('dimension_reshaper', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,func,<function res...t 0x105debdf0>
,inverse_func,
,validate,False
,accept_sparse,False
,check_inverse,True
,feature_names_out,
,kw_args,{'newshape': -1}
,inv_kw_args,


## Training Pipeline

## Fine-Tuning Pipeline