# Week 11 Midterm: NLP

**Brian Roepke**  
*DATA 110*

# Overview

As a clothing retailer with an e-Commerce presence, it's important for us to understand what customers are saying about our products, which products are most popular with our customers, and how we can improve our offerings.  To accomplish this we'll take a look at several different analyses that will help with the following:

1. **Review Trends**: General trends of customer reviews, quantity, distribution, etc.
1. **Sentiment Analysis**:  How customers feel about the products; are they positive or negative generally.
1. **Part of Speech Analysis**: Using different parts of speech (Nouns, Verbs, Adjectives, etc.).  Via this, we can see the most common positive and negative words used to describe products as well as which a the most commonly referenced products categories.
1. **Recommendation Prediction**: We will use this customer sentiment to understand better if a customer will give a positive rating on the clothing items based on their review.
1. **Department Prediction**: Finally, we'll use a multi-label classification model to predict the departments a product belongs to based on the description that's being used.  This might identify cross-selling opportunities or cross-listing opportunities for products in new categories.

![](https://github.com/broepke/DATA110/blob/main/Week%2011/clothing.jpg?raw=true)
<a href='https://www.freepik.com/vectors/woman'>Woman vector created by freepik - www.freepik.com</a>

In [None]:
import numpy as np
import pandas as pd
import re
import itertools
import string
import warnings
warnings.filterwarnings('ignore')

from textblob import TextBlob
from textblob import Word

import sklearn as sk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn import metrics 
# from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, cross_val_score, RepeatedStratifiedKFold
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler, MultiLabelBinarizer
from sklearn.compose import ColumnTransformer
from sklearn.multiclass import OneVsRestClassifier

# NLTK Imports and Downloads
import nltk
from nltk import word_tokenize
from nltk.sentiment.util import *
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Preprocessing & EDA

Importing our dataset and providing the necessary cleaning and analysis.

In [None]:
df = pd.read_csv("ClothingReviews.csv")
df.head()

## Data Cleaning

In [None]:
df.info()

In [None]:
df.shape

### Null Values

Nulll values are generally not desireable in a dataset.  In certain cases, observations (rows) with low counts will simply be dropped, in other cases, they can be filled with other values.

In [None]:
# check for nan/null
df.isnull().values.any()

In [None]:
# count of nulls
df.isnull().sum()

In [None]:
df.dropna(subset=['Department Name', 'Class Name', 'Review Text'], inplace=True)

**Note**: The null values for the lower counts (except `Title`) were dropped from the dataset.

In [None]:
# count of nulls
df.isnull().sum()

In [None]:
# fill the NA values with 0
df['Title'].fillna('', inplace=True)

**Note**: Any `NULL` values for the title field are filled with blank strings.  Next these will be combined with the `Review Text` field so we have a single text field for analysis.

In [None]:
# count of nulls
df.isnull().sum()

In [None]:
df['Text'] = df['Title'] + ' ' + df['Review Text']

In [None]:
df.drop(columns=['Title', 'Review Text'], inplace=True)

In [None]:
# Add column 'text_len' that counts the length for the derived field
df['text_len'] = df.apply(lambda row: len(row['Text']), axis = 1)

### Duplicates

A common practice is to review any duplicates.  If there are large quantities, they can skew the results.

In [None]:
len_before = df.shape[0]
df.drop_duplicates(inplace=True)
len_after = df.shape[0]

print("Before =", len_before)
# drop duplicates
print("After =", len_after)
print('')
print("Total Removed =", len_before - len_after)

**Note**: After the prior clean up of `NULL` values, there were just `2` duplicates left.

### Numeric Variables

In [None]:
df.describe()

**Observations:** 

1. The age ranges for the dataset range from `18` to `99` with a mean of `43`.  Most shoppers are middled aged with our store.
1. Ratings are based on `1-5` star system.  Mean rating is `4.18` meaning most people give positive reviews. 
1. Positive feedback count is the number of times that people found a review useful.  the mean is `2.63` with a min of `0` and max of `122`
1. Text lenght will be covered later in EDA

### Categorical Variables

In [None]:
# get categorical data
cat_data = df.select_dtypes(include=['object'])
cat_data.info()

In [None]:
# show counts values of each categorical variable
for colname in cat_data.columns:
    print (colname)
    print (cat_data[colname].value_counts(), '\n')

**Observations:**  

The categorical values are extermely clean and well labled.  We can look at the distributions of these better via visualization in the next section.

## EDA

In [None]:
plt.figure(figsize=(8,5))
sns.countplot(x='Rating', data=df, palette="tab20", dodge=False);

**Notes:** As observed in the prior section, the reviews are skewed to the postive.

In [None]:
plt.figure(figsize=(8,5))
sns.countplot(x='Department Name', data=df, palette="tab20", 
              order = df['Department Name'].value_counts().index, dodge=False);

**Notes**:  

1. `Tops` followed by `Dresses` are the largest categories.  
1. There are very few `Trend` and `Jackets` in the product line.  Predictions will be harder on these imbalanced classes.

In [None]:
plt.figure(figsize=(10,5))
ax = sns.countplot(x='Class Name', data=df, palette="tab20", 
                   order = df['Class Name'].value_counts().index, dodge=False);
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);

In [None]:
plt.figure(figsize=(10, 5))
ax = sns.histplot(df, x='Positive Feedback Count', palette="tab20c", binwidth=5);
ax.set(yscale="log");

In [None]:
plt.figure(figsize=(10, 5))
sns.histplot(df, x='text_len', kde=True, palette="tab20c", binwidth=10);
# ax.set(yscale="log");

**Notes**: For the `text_len` attribute, there is a fairly even distribution from `100` to `400` characters and then a higher concentration of reviews that are longer, around `500` characters.  Given that there are not massive outliers here, the number of characters of the reviews is probably limited to `~600~` chars.

## Text Cleaning

For **Parts** of our analysis, the text needs to have some basic transformation for our models to work propertly.  These are as follows:

1. **Lower**: Convert all characters to lowercase
1. **Remove Punctuation**: In most cases, punctuation doesn't help NLP and ML models and can be removed.
1. **Stop Word Removal**: Stop words generally don't add context to analysis (unless the length of text is very short (`100` - `200` characters) and can be removed.
1. **Lemmatization**: Words will be reduced to there *Lemma* or root.  This will greatly improve the accuracy of the analysis since words like `simming` and `swimmer` will be reduced to `swim`.

**Note**: The orginal text will be preserved for other analysis.

In [None]:
df['Text'][2]

In [None]:
def process_string(text, stem="None"):
    
    final_string = ""
    
    text = text.lower()
    
    translator = str.maketrans('', '', string.punctuation)
    text = text.translate(translator)

    text = text.split()
    useless_words = nltk.corpus.stopwords.words("english") + list(string.punctuation)
    useless_words = useless_words + ['.', ',', '!', "'"]
    text_filtered = [word for word in text if not word in useless_words]
    
    if stem == 'Stem':
        stemmer = PorterStemmer() 
        text_stemmed = [stemmer.stem(y) for y in text_filtered]
    elif stem == 'Lem':
        lem = WordNetLemmatizer()
        text_stemmed = [lem.lemmatize(y) for y in text_filtered]
    else:
        text_stemmed = text_filtered
    
    for word in text_stemmed:
        final_string += word + " "
    
    return final_string

In [None]:
df['Text_Processed'] = df['Text'].apply(lambda x: process_string(x, stem='Lem'))

In [None]:
df['Text_Processed'][2]

# Sentiment Analysis

For our sentiment analysis section, we will be using the `TextBlob` package to assist in creating `polarity scores` or sentiment scores that range from `-1` to `1` where lower scores are more negative and higher more positive.  Based off of these scores, we'll add a classifier of `1` for positive and `0` for negative to be used later in our prediction model. 

**Note**: `0` is technically nuetral sentiment, we'll verify how many observations were neutral before assuming we can use a binary label.

In [None]:
def get_sentiment(x):
    '''using TextBlob, get the sentiment score for a given body of text'''
    blob = TextBlob(x)
    return blob.sentiment.polarity

In [None]:
# Apply the Polarity Scoring from TextBlob
df['sentiment'] = df['Text_Processed'].apply(lambda x: get_sentiment(x))

In [None]:
# Create a few extra columns to aid in the analysis
df['sentiment_label'] = df['sentiment'].apply(lambda x: 1 if x >= 0 else 0)

In [None]:
df[df.columns[-4:]].sample(5, random_state=4)

In [None]:
plt.figure(figsize=(10, 5))
sns.histplot(df, x='sentiment', palette="tab20c", bins=20);

**Observations:**

The distributions of sentiment, similar to the `1-5` star reviews is left skewed to the positive.  There are very few that have a `<0` polartity score.

In [None]:
len(df[df['sentiment'] == 0])

**Note**: There are a small number (`83`) of reviews that received a `neutral` sentiment.  Since this number is so low, a `0` rating was grouped together with the majority class (`positive`).

# Part-of-speech Tagging

- show word counts for different parts of speech 
- What are popular products? identify nouns that can be used to tag the product (eg: dress, jacket, bottom, etc) and show their counts
- Identify the top adjectives and adverbs for positive vs negative reviews 

In [None]:
# Tokenize the words
df['Text_Tok'] = df['Text_Processed'].apply(word_tokenize)

In [None]:
def parse_text(x):
    '''using TextBlob, get the full parsed results (POS, etc)'''
    blob = TextBlob(x)
    p = blob.parse()
    p = re.sub(r'^\w+/', '',p)
    return p.split('/')

In [None]:
def build_pos(x):
    '''pass a DataFrame column with tokenized text and return a DF of the Words'''
    all_words = []
    for l in x:
        all_words = all_words + l
        
    df = pd.DataFrame(all_words)
    df.columns = ['Word']
    
    # Add a column for the POS
    df['Parse'] = df['Word'].apply(lambda x: parse_text(x))
    
    # Expned the extracted list of POS tags into their own columns, and concat that back to the orig DF
    # https://chrisalbon.com/python/data_wrangling/pandas_expand_cells_containing_lists/
    par = pd.DataFrame(df['Parse'].to_list(), columns=['P1','P2', 'P3', 'P4'])
    df = pd.concat([df[:], par[:]], axis=1)
    df.drop(columns=['Parse'], inplace=True)

    return df

In [None]:
df_words = build_pos(df['Text_Tok'])

In [None]:
df_words.sample(10)

**Notes:** Rather than using the much simpler approach of the POS with the TextBlog `tags` function[1], the `parse` function was used since it provides a more verpose labeling of the text.

The attempt here was to try to discover if there was a better way to identify nouns that would better represent product features vs. other nouns.  Unfortunatley this didn't end up providing the detail needed. More information on this is presented below.

## Word Counts for Different Parts of Speech

In [None]:
df_top_pos = df_words.groupby('P1')['P1'].count().\
    reset_index(name='count').sort_values(['count'],ascending=False).head(15)

In [None]:
df_top_pos

In [None]:
plt.figure(figsize=(10,5))
sns.barplot(data =df_top_pos, x='P1', y='count', palette="tab20");

## Identify Top Product Nouns

In [None]:
df_nn = df_words[df_words['P1'] == 'NN'].copy()

In [None]:
df_nn.groupby('Word')['Word'].count().reset_index(name='count').\
    sort_values(['count'], ascending=False).head(10)

**Notes**:  When inspecting `nouns` only, there is a mix of different types of words displayed, and some we can see are not tagged in such a way that seems to make sense with this dataset.  For example, `love` is tagged as a noun, but it's probably an adjective.  `bit` is probably referring to an adjective as well but is showing as a noun.

We can inspect these words directly to see if there is a difference in their POS tags.

In [None]:
print(TextBlob('dress').parse())
print(TextBlob('love').parse())
print(TextBlob('bit').parse())

**Observations:**  When we try to use the Part of Speech (POS) tagging there isn't a distinction between Nouns.  Each of these have eactly the same POS sequence. 

We can use the Class name to determine clothing nouns to use.  

In [None]:
# Extract a list of all the unique class names
noun_types = list(df['Class Name'].unique())

# The words from the categories need to be lemmatized.
lem = WordNetLemmatizer()
for i in range(len(noun_types)):
    noun_types[i] = lem.lemmatize(noun_types[i].lower())
noun_types

In [None]:
# Extract all the text into a huge string and use Text Blobs to get a Dictionary out with counts
all_text = ' '.join(df['Text_Processed'])
all_text_blob = TextBlob(all_text)
all_text_dict = all_text_blob.word_counts

# Turn the dictionary into a Dataframe.  Filter by the word list and then sort for plotting.
df_dict = pd.DataFrame(list(all_text_dict.items()),columns = ['Word','Count']) 
df_products = df_dict[df_dict.Word.isin(noun_types)]
df_products.sort_values(by=['Count'], inplace=True, ascending=False)
df_products

**Observations:**: Based on the top outputs we can see that `dresses` are the largest mentioned product line at a rate of `4x` the second, `sweaters`.

In [None]:
plt.figure(figsize=(10,5))
ax = sns.barplot(x='Word', y='Count', data=df_products, palette="tab20", dodge=False)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);

## Top Adjectives and Adverbs for Positive vs Negative Reviews

Using Part of Speech taggs, we can look at which adjectives and adverbs people are using most to describe the products.  Below is a table that shows how different parts of speech are encoded in this system.
 
**Part of Speech Codes**

<table align="left" cellpadding="2" cellspacing="2" border="0">
  <tbody><tr bgcolor="#DFDFFF" align="none"> 
    <td align="none"> 
      <div align="left">Number</div>
    </td>
    <td> 
      <div align="left">Tag</div>
    </td>
    <td> 
      <div align="left">Description</div>
    </td>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 7. </td>
    <td>JJ </td>
    <td>Adjective </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 8. </td>
    <td>JJR </td>
    <td>Adjective, comparative </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 9. </td>
    <td>JJS </td>
    <td>Adjective, superlative </td>
  </tr>

  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 20. </td>
    <td>RB </td>
    <td>Adverb </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 21. </td>
    <td>RBR </td>
    <td>Adverb, comparative </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 22. </td>
    <td>RBS </td>
    <td>Adverb, superlative </td>
  </tr>
  
</td></tr></tbody></table>
 

In [None]:
# Extract pos and neg reviews based on sentiment into their own DFs
df_pos = df[df['sentiment_label'] == 1]
df_neg = df[df['sentiment_label'] == 0]

# Drop the rest of the columns after separating
df_pos = df_pos[['Text_Tok']]
df_neg = df_neg[['Text_Tok']]

In [None]:
def get_top_mods(df_all_words):
    ''' this function will return a dataframe of the top adjetives and 
    adverbs group together and counted'''
    
    df_mods = df_all_words[(df_all_words.P1.str.startswith('JJ')) | (df_all_words.P1.str.startswith('RB'))]

    # Groupby, count, sort in order to get the counts of the words
    df_grouped = df_mods.groupby(['P1', 'Word'])['Word'].count().\
        reset_index(name='count').sort_values(['count'],ascending=False)

    # Convert to Multi-Level Index
    df_grouped.set_index(['P1', 'Word'], inplace=True)

    # Finally, just display the top 3 (if there are 3)
    return df_grouped.groupby(level=0).head(5)

In [None]:
# Build the Dataframe via the Function
df_all_words_pos = build_pos(df_pos['Text_Tok'])

# Get the top words
get_top_mods(df_all_words_pos)

In [None]:
# Build the Dataframe via the Function
df_all_words_neg = build_pos(df_neg['Text_Tok'])

# Get the top words
get_top_mods(df_all_words_neg)

**Notes**:

Above are the most frequently occuring positive and negative words per Adjective and Adverb.  

1. **Positive**: Top words are `top`, `great`, `perfect`, `really`, and `pretty`
1. **Negative**: Top words are `small`, `little`, `thin`, `tight`, and `short`

None of the words really are suprising with the positive words, but with the negative words there appears to be a **sizing issue** where products are smaller or larger than people expect vs. the sizes claimed. 

# Sentiment Based Prediction Model:

Next we'll create a Supervised ML model to predict whether a product will be Recommended based on the text from the review as well as the sentiment of that text and the lenght of the review.

To create our model we will be mixing both text and numeric values.  There are multiple ways to accomplish this but we will be using a `ColumnTransformer` in a Pipeline[2].

## Model Selection

In [None]:
X = df[['Text', 'sentiment', 'text_len']]
y = df['sentiment_label']

In [None]:
print(X.shape)
print(y.shape)

In [None]:
def col_trans():
    column_trans = ColumnTransformer(
            [('Text', TfidfVectorizer(stop_words='english'), 'Text'),
             ('Text Length', MinMaxScaler(), ['text_len']),
             ('Sentiment', MinMaxScaler(), ['sentiment'])],
            remainder='drop') 
    
    return column_trans

In [None]:
def create_pipe(clf):
    '''Create a pipeline for a given classifier.  The classifier needs to be an instance
    of the classifier with all parmeters needed specified.'''
    
    # Each pipeline uses the same column transformer.  
    column_trans = col_trans()
    
    pipeline = Pipeline([('prep',column_trans),
                         ('over', SMOTE(random_state=42)),
                         ('under', RandomUnderSampler(random_state=42)),
                         ('clf', clf)])
     
    return pipeline

In [None]:
models = {'ComplementNB' : ComplementNB(),
          'SVC' : SVC(class_weight='balanced', random_state=42),
          'LogReg' : LogisticRegression(random_state=42, class_weight='balanced', max_iter=500),
          'RandomForest' : RandomForestClassifier(class_weight='balanced', random_state=42)}

for name, model, in models.items():
    clf = model
    pipeline = create_pipe(clf)
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
    scores = cross_val_score(pipeline, X, y, scoring='f1_macro', cv=cv, n_jobs=-1, error_score='raise')
    print(name, ': Mean f1 Macro: %.3f and Standard Deviation: (%.3f)' % (np.mean(scores), np.std(scores)))

**Observations**:

The **Support Vector Machine** Classifier performed the best with the **Random Forest** and **Logistic Regression** behind it.  Complement Naive Bayes performed the worst. 

`SVC` is a fairly computationally expensive algorithm [5], it might be an advantages to use **Logistic Regression** if performance were top prioroty.

## Model Building

In [None]:
# Make training and test sets 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=53)

In [None]:
print(y_train.shape)
print(X_train.shape)

In [None]:
def get_params(parameters, X, y, pipeline):
    ''' implements a the GridSearch Cross validation for a given model and set of parameters'''
    
    cv = RepeatedStratifiedKFold(n_splits=3, n_repeats=3, random_state=1)
    grid = GridSearchCV(pipeline, parameters, scoring='f1_macro', n_jobs=-1, cv=cv, error_score='raise')
    grid.fit(X, y)

    return grid

In [None]:
parameters = [{'clf__C': np.linspace(.1, 2 ,5), 
               'clf__gamma': [.01, .1, .5], 
               'clf__class_weight' : ['balanced']}]

clf = SVC()
pipeline = create_pipe(clf)
grid = get_params(parameters, X_train, y_train, pipeline)

print("Best cross-validation accuracy: {:.3f}".format(grid.best_score_))
print("Test set score: {:.3f}".format(grid.score(X_test, y_test))) 
print("Best parameters: {}".format(grid.best_params_))

svc_C = grid.best_params_['clf__C']
svc_gamma = grid.best_params_['clf__gamma']

## Model Validation

In [None]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    See full source and example: 
    http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
    
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('predicted label')

In [None]:
def fit_and_print(pipeline, name):
    ''' take a supplied pipeline and run it against the train-test spit 
    and product scoring results.'''
    
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    score = metrics.f1_score(y_test, y_pred, average='macro')

    print(metrics.classification_report(y_test, y_pred, digits=3))

    cm = metrics.confusion_matrix(y_test, y_pred, labels=[0,1])
    plot_confusion_matrix(cm, classes=[0,1])

In [None]:
clf = SVC(C=svc_C, gamma=svc_gamma, class_weight='balanced', random_state=42)
pipeline = create_pipe(clf)
fit_and_print(pipeline, 'SVC')

**Observations**:

This model performed extermely well across our dataset with an `f1 macro` score of `0.915`.  The dataset does have imbalanced which was corrected for with SMOTE (Over Sampling combined with Undersampling).  The end result was a very strong predictor model based on `text`, `sentiment`, and `text_len`.

Synthethic Minority Oversamlping Technique uses a nearest-neighbor approach for generating new minority class samples.  The method is applied only to the training data and then tested on the original, untouched test partition.  The method chosen here is to first oversample the minority class making it baalanced and then undersample it to reduce the size.  This helps bring balance without bloating the dataset [4].

## Test on Custom Data

In [None]:
def create_test_data(x):
    '''calculate the numbers needed to run on custom data including sentiment and text length,
    this is a farily simple process using the fuctions from previous transformations.'''
    
    x = process_string(x)
    sent = get_sentiment(x)
    length = len(x)
    
    d = {'Text' : x,
         'sentiment' : sent,
        'text_len' : length}

    df = pd.DataFrame(d, index=[0])
    
    return df

In [None]:
revs = ['This dress is gorgeous and I love it and would gladly reccomend it to all of my friends.',
        'This skirt has really horible quality and I hate it!',
        'A super cute top with the perfect fit.',
        'The most gorgeous pair of jeans I have seen.',
        'this item is too little and tight.']

In [None]:
print('The classifier will return 1 for Positive reviews and 0 for Negative reviews:','\n')
for rev in revs:
    c_res = pipeline.predict(create_test_data(rev))
    print(rev, '=', c_res)

**Notes**:

Based on our custom strings, each one produced **Correct** classifications with our model.

# Text Classification for Departments

Next we'll attempt to create a Supervised Machine Learning model to classify products by Department.  This will look at the text that a user wrote in a review and determine what department the item came from.

An interesting opportunity is to use this information to **cross sell** or **cross list** products.  If there is  a strong enough probability that an item could be in multiple departments from our analysis, could we **increase sales** with cross marketing?

## Model Selection

In [None]:
# Tokenize the words
df['Department Name'] = df['Department Name'].apply(word_tokenize)

In [None]:
X = df[['Text', 'Department Name']]
y = df['Department Name']

In [None]:
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(y)

In [None]:
print(X.shape)
print(y.shape)
print(mlb.classes_)

In [None]:
def col_trans():
    column_trans = ColumnTransformer(
            [('Text', TfidfVectorizer(stop_words='english'), 'Text')],
            remainder='drop') 
    
    return column_trans

In [None]:
def create_pipe(clf):
    '''Create a pipeline for a given classifier.  The classifier needs to be an instance
    of the classifier with all parmeters needed specified.'''
    
    # Each pipeline uses the same column transformer.  
    column_trans = col_trans()
    
    pipeline = Pipeline([('prep',column_trans), 
                         ('over', SMOTE(random_state=42)),
                         ('under', RandomUnderSampler(random_state=42)),
                         ('clf', clf)])
     
    return pipeline

In [None]:
models = {'SVC' : OneVsRestClassifier(SVC(kernel='linear'), n_jobs=-1),
          'RF' : OneVsRestClassifier(RandomForestClassifier(), n_jobs=-1),
          'LogReg' : OneVsRestClassifier(LogisticRegression(), n_jobs=-1),
          'Bayes' : OneVsRestClassifier(MultinomialNB(), n_jobs=-1)}

for name, model, in models.items():
    clf = model
    pipeline = create_pipe(clf)
    scores = cross_val_score(pipeline, X, y, scoring='f1_macro', cv=3, n_jobs=-1, error_score='raise')
    print(name, ': Mean f1 Macro: %.3f and Standard Deviation: (%.3f)' % (np.mean(scores), np.std(scores)))

**Notes**:

Again the `SVC` classifier performed the best with `LogisticRegression` coming out second best.   `MultinomialNB` perfrormed the worst out of these.

## Model Building & Validation

In [None]:
# Make training and test sets 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

In [None]:
print(y_train.shape)
print(X_train.shape)

In [None]:
# Note: Optimization was performed prior to Hyperparemeter selection
clf = OneVsRestClassifier(SVC(C=.5, gamma=.1, kernel='linear', 
                              class_weight='balanced', random_state=42))
pipeline = create_pipe(clf)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
score = metrics.f1_score(y_test, y_pred, average='macro')

print(metrics.classification_report(y_test, y_pred, digits=3))

**Observations**:

 - There are very few observations in the `5` class overall in the dataset.  You can see in the support column for this class only 38 samples are present.  Through synthetic oversampling (SMOTE), we could get a small precision and recall score, but it's difficult without more observations.
 - The `2` and `3` classes are also smaller and therefore do not have as high an `f1` score.
 - The remainder of the classes are all performing very well.  
 - The `f1 macro` score is `~.6`, which normally is not considered a good score, but in the case of multi-value classification, it doesn't necessarily mean it's a poor-performing model.  We can test this thorough inspection of the actual predictions.

In [None]:
# Retreive the text lables from the MultiLabelBinarizer
pred_labels = mlb.inverse_transform(y_pred)

# Append them to the DataFrame
X_test['Predicted Labels'] = pred_labels

In [None]:
# Display a random sample of them
pd.set_option('display.max_colwidth', -1)
X_test.sample(10, random_state=60)

**Observations:**

In many casses our classifier was very accurate in determining the correct Department.  Since we used a multi-label classifier, the algorithm sugggested more than one label in some cases.  Upon inspecting these, in some cases the suggestion of an additional Deparment is incredibly logical.  

It's possible to use these multi-label classes to investigate **cross marketing / cross listing** opportunities for these products.  

## Test on New Data

In [None]:
def create_test_data(x):
    '''calculate the numbers needed to run on custom data including sentiment and text length,
    this is a farily simple process using the fuctions from previous transformations.
    ['Bottoms' 'Dresses' 'Intimate' 'Jackets' 'Tops' 'Trend']'''
    
    s = process_string(x[0])
    
    d = {'Text' : s,
         'Department Name' : x[1]}

    df = pd.DataFrame(d, index=[0])
    
    return df

In [None]:
revs = [('This dress is gorgeous and would gladly reccomend it to all of my friends.', ['Dresses']),
     ('These pants have really horible quality and I hate them!', ['Bottoms']),
     ('A super cute blouse with a great fit.', ['Tops']),
     ('The most gorgeous pair of jeans I have seen.', ['Bottoms']),
     ('This bra is silky smooth material and fits perfectly.', ['Intimate'])]

In [None]:
for rev in revs:
    c_res = pipeline.predict(create_test_data(rev))
    print(rev[0], '\n', rev[1], '\n' ,mlb.inverse_transform(c_res), '\n')

**Observations**:  

- The first three sample sentences do not have a problem with the classification.  They do contain keywords that are highly realted to the Deparement names.  `Dress`, `Pants`, and `Blouse`.  
- The fourth sentence miss-classifed with a match to both `tops` and `bottoms`.
- The fifth sentence also miss classifed with getting one correct, but miss matching on `bottoms` and `dresses`.

# Conclusion

**Parts of Speech**  

The most interesting takeaway from this portion of the exercise was the negative adjectives and adverbs.  These descriptors of the products help point us in a direction we might not have been aware of.  For example, the following top words identified above:

 - **Positive**: Top words are `top`, `great`, `perfect`, `really`, and `pretty`
 - **Negative**: Top words are `small`, `little`, `thin`, `tight`, and `short`

The negative words suggest there appears to be a **size accuracy issue** where products are smaller or larger than people expect vs. the sizes claimed. 

**Sentiment Analysis**

 - Our model to predict recommendations based on sentiment was very accurate.  Using the `SVC` classifier tuned to the model, we achieved about a `0.915` accuracy in predictions on our test data set.  With `91.5%` confidence, we can determine how a person will rate our products based on the reviews they write.

**Department Classification**

- This exercise proved interesting as well.  While the overall performance (as measured by `f1 macro`) was not as strong as the sentiment-based model, it also provided different points of view we could not achieve otherwise.  In the sample selections, we can see that those were often quite relevant to the prediction in the cases where it misclassified with multiple categories/departments.  Therefore it might be possible to **cross-list/market** products in these categories.



# References

1. https://textblob.readthedocs.io/en/dev/api_reference.html#textblob.blob.TextBlob.tags
1. https://stackoverflow.com/questions/55604249/featureunion-vs-columntransformer
1. https://chrisalbon.com/python/data_wrangling/pandas_expand_cells_containing_lists/
1. https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/
1. https://sklearn.org/modules/svm.html#complexity