# Pipeline Project

You will be using the provided data to create a machine learning model pipeline.

You must handle the data appropriately in your pipeline to predict whether an
item is recommended by a customer based on their review.
Note the data includes numerical, categorical, and text data.

You should ensure you properly train and evaluate your model.

## The Data

The dataset has been anonymized and cleaned of missing values.

There are 8 features for to use to predict whether a customer recommends or does
not recommend a product.
The `Recommended IND` column gives whether a customer recommends the product
where `1` is recommended and a `0` is not recommended.
This is your model's target/

The features can be summarized as the following:

- **Clothing ID**: Integer Categorical variable that refers to the specific piece being reviewed.
- **Age**: Positive Integer variable of the reviewers age.
- **Title**: String variable for the title of the review.
- **Review Text**: String variable for the review body.
- **Positive Feedback Count**: Positive Integer documenting the number of other customers who found this review positive.
- **Division Name**: Categorical name of the product high level division.
- **Department Name**: Categorical name of the product department name.
- **Class Name**: Categorical name of the product class name.

The target:
- **Recommended IND**: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.

## Load Data

In [1]:
import pandas as pd

# Load data
df = pd.read_csv(
    'data/reviews.csv',
)

df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18442 entries, 0 to 18441
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Clothing ID              18442 non-null  int64 
 1   Age                      18442 non-null  int64 
 2   Title                    18442 non-null  object
 3   Review Text              18442 non-null  object
 4   Positive Feedback Count  18442 non-null  int64 
 5   Division Name            18442 non-null  object
 6   Department Name          18442 non-null  object
 7   Class Name               18442 non-null  object
 8   Recommended IND          18442 non-null  int64 
dtypes: int64(4), object(5)
memory usage: 1.3+ MB


Unnamed: 0,Clothing ID,Age,Title,Review Text,Positive Feedback Count,Division Name,Department Name,Class Name,Recommended IND
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,0,General,Dresses,Dresses,0
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",0,General Petite,Bottoms,Pants,1
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,6,General,Tops,Blouses,1
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",4,General,Dresses,Dresses,0
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,1,General Petite,Tops,Knits,1


## Preparing features (`X`) & target (`y`)

In [2]:
data = df

# separate features from labels
X = data.drop('Recommended IND', axis=1)
y = data['Recommended IND'].copy()

print('Labels:', y.unique())
print('Features:')
display(X.head())

Labels: [0 1]
Features:


Unnamed: 0,Clothing ID,Age,Title,Review Text,Positive Feedback Count,Division Name,Department Name,Class Name
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,0,General,Dresses,Dresses
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",0,General Petite,Bottoms,Pants
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,6,General,Tops,Blouses
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",4,General,Dresses,Dresses
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,1,General Petite,Tops,Knits


In [3]:
# Split data into train and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.1,
    shuffle=True,
    random_state=27,
)

# Your Work

## Data Exploration

In [4]:
# check for missing values
X.isna().sum()

Clothing ID                0
Age                        0
Title                      0
Review Text                0
Positive Feedback Count    0
Division Name              0
Department Name            0
Class Name                 0
dtype: int64

In [5]:
# Data Exploration
X.groupby(['Division Name'])['Clothing ID'].count().sort_index(ascending=True)

Division Name
General           11664
General Petite     6778
Name: Clothing ID, dtype: int64

In [6]:
# Data Exploration
X.groupby(['Department Name'])['Clothing ID'].count().sort_index(ascending=True)

Department Name
Bottoms     3184
Dresses     5371
Intimate     188
Jackets      879
Tops        8713
Trend        107
Name: Clothing ID, dtype: int64

In [7]:
# Data Exploration
X.groupby(['Class Name'])['Clothing ID'].count().sort_index(ascending=True)

Class Name
Blouses           2587
Casual bottoms       1
Dresses           5371
Fine gauge         927
Jackets            598
Jeans              970
Knits             3981
Lounge             188
Outerwear          281
Pants             1157
Shorts             260
Skirts             796
Sweaters          1218
Trend              107
Name: Clothing ID, dtype: int64

In [8]:
# split data into numerical and categorical features
num_features = (
    X[[
        'Positive Feedback Count',
        'Clothing ID',
    ]].columns
)
print('Numeric features:', num_features)

cat_features = (
    X[[
        'Division Name',
        'Department Name',
        'Class Name',
    ]].columns
)
print('Categorical features:', cat_features)

bin_features = (
    X[[
        'Age',
    ]].columns
)
print('Binned features:', bin_features)

review_text_features = (
    X[[
#        'Title',
        'Review Text',
    ]].columns
)
print ('Review Text features:', review_text_features)

title_text_features = (
    X[[
        'Title',
#        'Review Text',
    ]].columns
)
print ('Title Text features:', title_text_features)

Numeric features: Index(['Positive Feedback Count', 'Clothing ID'], dtype='object')
Categorical features: Index(['Division Name', 'Department Name', 'Class Name'], dtype='object')
Binned features: Index(['Age'], dtype='object')
Review Text features: Index(['Review Text'], dtype='object')
Title Text features: Index(['Title'], dtype='object')


## Building Pipeline

In [9]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import KBinsDiscretizer, StandardScaler, OneHotEncoder, OrdinalEncoder

In [10]:
# pipeline for numeric features that are not being binned, impute values + scale
num_pipeline = Pipeline([
#    (
#        'imputer',
#        SimpleImputer(strategy='mean'),
#    ),
    (
        'scaler',
        StandardScaler(),
    ),
])

num_pipeline

In [11]:
# pipeline for binned attributes, impute values + bin into 5 segments
bin_pipeline = Pipeline([
    (
        'imputer',
        SimpleImputer(
            strategy='most_frequent',
        )
    ),
    (
        'binner',
        KBinsDiscretizer(n_bins=5, encode='onehot', strategy='kmeans'),
    ),
])
bin_pipeline

In [12]:
# pipeline for categorical features , impute + onehot encode
cat_pipeline = Pipeline([
    (
        'imputer',
        SimpleImputer(
            strategy='most_frequent',
        )
    ),
    (
        'cat_encoder',
        OneHotEncoder(
            sparse_output=False,
            handle_unknown='ignore',
        )
    ),
])
cat_pipeline

In [13]:
#!pip install textblob
#!pip install spacy-transformers
# !pip install protobuf 

In [14]:
#!python -m spacy download en_core_web_sm

In [15]:
#!python -m spacy download bert-based-uncased

## NLP

In [16]:
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob
from textblob import TextBlob


nlp = spacy.load('en_core_web_sm')

# adding spacytextblob for sentiment scoring with doc._.blob.polarity
nlp.add_pipe('spacytextblob')
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x1cba3375490>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x1cb97acd490>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x1cba3465f50>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x1cba3609250>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x1cba3664450>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x1cba3466030>),
 ('spacytextblob',
  <spacytextblob.spacytextblob.SpacyTextBlob at 0x1cba5238190>)]

In [17]:
# testing polarity 
test_text = ['I love math' , 'I hate xyz', 'I feel neutral about Switzerland', 'generic set of words']
for item in test_text:
    doc = nlp(item)
    print(doc._.blob.polarity)

0.5
-0.8
0.0
0.0


In [18]:
from sklearn.base import BaseEstimator, TransformerMixin

# sentiment: takes X containing text features, returns polarity of each row for each feature
class Sentiment(BaseEstimator, TransformerMixin):
    def __init__(self, nlp):
        self.nlp = nlp

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return [[self.nlp(text)._.blob.polarity] for text in X]

In [19]:
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import FunctionTransformer
import numpy as np

initial_text_preprocess = Pipeline([
    (
        'dimension_reshaper',
        FunctionTransformer(
            np.reshape,
            kw_args={'newshape':-1},
        ),
    ),
])


feature_engineering = FeatureUnion([
    ('sentiment', Sentiment(nlp)),
])

sentiment_pipeline = Pipeline([
    (
        'initial_text_preprocess',
        initial_text_preprocess,
    ),
    (
        'feature_engineering',
        feature_engineering,
    ),
])
sentiment_pipeline

In [20]:
# SpaceLemmatizer developed in class 

class SpacyLemmatizer(BaseEstimator, TransformerMixin):
    def __init__(self, nlp):
        self.nlp = nlp

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        lemmatized = [
            ' '.join(
                token.lemma_ for token in doc
                if not token.is_stop
            )
            for doc in self.nlp.pipe(X)
        ]
        return lemmatized   

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_pipeline = Pipeline([
    (
        'dimension_reshaper',
        FunctionTransformer(
            np.reshape,
            kw_args={'newshape':-1},
        ),
    ),
    (
        'lemmatizer',
        SpacyLemmatizer(nlp=nlp),
    ),
    (
        'tfidf_vectorizer',
        TfidfVectorizer(
            stop_words='english',
        ),
    ),
])
tfidf_pipeline 

In [22]:
from sklearn.compose import ColumnTransformer

feature_engineering = ColumnTransformer([
        ('num', num_pipeline, num_features),
        ('cat', cat_pipeline, cat_features),
        ('sentiment_review', sentiment_pipeline, review_text_features),
        ('sentiment_title', sentiment_pipeline, title_text_features),
        ('tfidf_text', tfidf_pipeline, review_text_features),
        ('bin', bin_pipeline, bin_features)
])

feature_engineering

In [23]:

from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline

model_pipeline = make_pipeline(
    feature_engineering,
    RandomForestClassifier(random_state=42),
    verbose=True
)

model_pipeline.fit(X_train, y_train)

[Pipeline] . (step 1 of 2) Processing columntransformer, total= 6.4min
[Pipeline]  (step 2 of 2) Processing randomforestclassifier, total=   7.8s


## Training Pipeline

In [24]:
from sklearn.metrics import accuracy_score

y_pred_forest_pipeline = model_pipeline.predict(X_test)
accuracy_forest_pipeline = accuracy_score(y_test, y_pred_forest_pipeline)

print('Accuracy:', accuracy_forest_pipeline)

Accuracy: 0.8574525745257453


In [25]:
# model_pipeline.get_params().keys()

## Fine-Tuning Pipeline

In [26]:
from sklearn.model_selection import RandomizedSearchCV

# TODO: set parameters to randomly search over
# A couple parameters with 2-5 options each is plenty
my_distributions = dict(
    randomforestclassifier__max_features=[
        100,
        250,
    ],
    randomforestclassifier__n_estimators=[
        100,
        200,
    ],
)

param_search = RandomizedSearchCV(
    estimator=model_pipeline,
    param_distributions=my_distributions,
    n_iter=4,     # Try 4 different combinations of parameters
    cv=3,         # Use 3-fold cross-validation
    n_jobs=1,    # Use single processor
    refit=True,   # Refit the model using the best parameters found
    verbose=3,    # Output of parameters, score, time
    random_state=42,
    error_score='raise',
)

param_search.fit(X_train, y_train)

# Retrieve the best parameters
param_search.best_params_

Fitting 3 folds for each of 4 candidates, totalling 12 fits
[Pipeline] . (step 1 of 2) Processing columntransformer, total= 4.3min
[Pipeline]  (step 2 of 2) Processing randomforestclassifier, total=   4.3s
[CV 1/3] END randomforestclassifier__max_features=100, randomforestclassifier__n_estimators=100;, score=0.862 total time= 6.5min
[Pipeline] . (step 1 of 2) Processing columntransformer, total= 4.3min
[Pipeline]  (step 2 of 2) Processing randomforestclassifier, total=   4.3s
[CV 2/3] END randomforestclassifier__max_features=100, randomforestclassifier__n_estimators=100;, score=0.861 total time= 6.5min
[Pipeline] . (step 1 of 2) Processing columntransformer, total= 4.2min
[Pipeline]  (step 2 of 2) Processing randomforestclassifier, total=   4.3s
[CV 3/3] END randomforestclassifier__max_features=100, randomforestclassifier__n_estimators=100;, score=0.858 total time= 6.5min
[Pipeline] . (step 1 of 2) Processing columntransformer, total= 4.3min
[Pipeline]  (step 2 of 2) Processing randomf

{'randomforestclassifier__n_estimators': 200,
 'randomforestclassifier__max_features': 250}

In [27]:
model_best = param_search.best_estimator_
model_best

In [28]:
y_pred_forest_pipeline = model_best.predict(X_test)
accuracy_forest_pipeline = accuracy_score(y_test, y_pred_forest_pipeline)

print('Accuracy:', accuracy_forest_pipeline)

Accuracy: 0.870460704607046
