# Data Science Training

## Workshop - Women's E-Commerce Clothing Reviews

### Introduction 

In these times of quarantine, online shopping have shown to be one of the greatest stress-busters for women!
While all of us are on our journeys to becoming shopaholics or broke, let us analyze the importance of CUSTOMER REVIEWS while shopping online.

As per facts, 61% of customers read online reviews before making a purchase decision, and they are now essential for e-commerce sites. Also, according to Reevoo, reviews produce an average 18% uplift in sales. Hence USER REVIEWS are proven sales drivers, and something the majority of customers will definitely want to see before deciding to make a purchase.

#### Overall goal: 
You are a data scienctist and would like you to perform a Topic Modeling (LDA). Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

#### Section 01: Exploratory Data Analysis
* Are there any null values or outliers? How will you wrangle/handle them?
* Are there any variables that warrant transformations?
* Are there any useful variables that you can engineer with the given data?
* Do you notice any patterns or anomalies in the data? Can you plot them?

#### Section 02: Data Analysis
* Is it a regression/ classification problem?
* Is it a supervised/ unsupervised problem?
* Which model parameters should we take care of?


### Importing data and libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
import seaborn as sns
import time

# nltk
import nltk
from nltk.corpus import stopwords # python3 -m nltk.downloader stopwords
stoplist= stopwords.words('english')
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer= WordNetLemmatizer()
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer

import warnings
warnings.filterwarnings('ignore')

# Enable logging
import logging
logging.basicConfig(level= logging.INFO)

#### Section 01: Exploratory Data Analysis

Import the data from data folder, the file is called Womens Clothing E-Commerce Reviews.csv

In [2]:
df= pd.read_csv("data/Womens Clothing E-Commerce Reviews.csv", index_col=0)
df.columns= df.columns.str.replace(" ", "_")
print(df.shape)

(23486, 10)


Let's check the column names

In [3]:
# your code here

Index(['Clothing_ID', 'Age', 'Title', 'Review_Text', 'Rating',
       'Recommended_IND', 'Positive_Feedback_Count', 'Division_Name',
       'Department_Name', 'Class_Name'],
      dtype='object')

How do the first 5 rows of data look?

In [4]:
# your code here

Unnamed: 0,Clothing_ID,Age,Title,Review_Text,Rating,Recommended_IND,Positive_Feedback_Count,Division_Name,Department_Name,Class_Name
0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


How about the last 5 rows of data?

In [5]:
# your code here

Unnamed: 0,Clothing_ID,Age,Title,Review_Text,Rating,Recommended_IND,Positive_Feedback_Count,Division_Name,Department_Name,Class_Name
23481,1104,34,Great dress for many occasions,I was very happy to snag this dress at such a ...,5,1,0,General Petite,Dresses,Dresses
23482,862,48,Wish it was made of cotton,"It reminds me of maternity clothes. soft, stre...",3,1,0,General Petite,Tops,Knits
23483,1104,31,"Cute, but see through","This fit well, but the top was very see throug...",3,0,1,General Petite,Dresses,Dresses
23484,1084,28,"Very cute dress, perfect for summer parties an...",I bought this dress for a wedding i have this ...,3,1,2,General,Dresses,Dresses
23485,1104,52,Please make more like this one!,This dress in a lovely platinum is feminine an...,5,1,22,General Petite,Dresses,Dresses


Let's describe our data

In [6]:
# your code here

Unnamed: 0,Clothing_ID,Age,Rating,Recommended_IND,Positive_Feedback_Count
count,23486.0,23486.0,23486.0,23486.0,23486.0
mean,918.118709,43.198544,4.196032,0.822362,2.535936
std,203.29898,12.279544,1.110031,0.382216,5.702202
min,0.0,18.0,1.0,0.0,0.0
25%,861.0,34.0,4.0,1.0,0.0
50%,936.0,41.0,5.0,1.0,1.0
75%,1078.0,52.0,5.0,1.0,3.0
max,1205.0,99.0,5.0,1.0,122.0


In [7]:
# your code here

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23486 entries, 0 to 23485
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Clothing_ID              23486 non-null  int64 
 1   Age                      23486 non-null  int64 
 2   Title                    19676 non-null  object
 3   Review_Text              22641 non-null  object
 4   Rating                   23486 non-null  int64 
 5   Recommended_IND          23486 non-null  int64 
 6   Positive_Feedback_Count  23486 non-null  int64 
 7   Division_Name            23472 non-null  object
 8   Department_Name          23472 non-null  object
 9   Class_Name               23472 non-null  object
dtypes: int64(5), object(5)
memory usage: 2.0+ MB


Let's check the missing values

In [8]:
# Check missing values
# your code here

Missing values in data : 
 Clothing_ID                   0
Age                           0
Title                      3810
Review_Text                 845
Rating                        0
Recommended_IND               0
Positive_Feedback_Count       0
Division_Name                14
Department_Name              14
Class_Name                   14
dtype: int64


Since the reviews is our main content, dropping rows where 'Review Text' is null

In [9]:
# your code here

(22641, 10)

Let's check the missing values after dropping rows where 'Review Text' is null

In [10]:
# your code here

Missing values in data : 
 Clothing_ID                   0
Age                           0
Title                      2966
Review_Text                   0
Rating                        0
Recommended_IND               0
Positive_Feedback_Count       0
Division_Name                13
Department_Name              13
Class_Name                   13
dtype: int64


In [18]:
# Golden rule: Save up the original dataframe
df_orig= df.copy()
df_orig.shape 

# df= df_orig.copy()

(22641, 10)

#### Section 02: Data Analysis

For topic modeling, we will use the infamous LDA (Latent Dirichlet Allocation) algorithm. In LDA, each document can be described by a distribution of topics and each topic can be described by a distribution of words.

STEP 1: Preprocessing text - Tokenizing sentences, stopwords removal and lemmatization

* Sentence tokenization is the process of splitting text into individual sentences
* Stop words are available in abundance in any human language. By removing these words, we remove the low-level information from our text in order to give more focus to the important information.
* Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .

In [19]:
def get_pos_tag(tag):
    """This function is used to get the part-of-speech(POS) for lemmatization"""
    
    if tag.startswith('N') or tag.startswith('J'):
        return wordnet.NOUN
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN #default case

In [20]:
# python3 -m nltk.downloader punkt
# python3 -m nltk.downloader averaged_perceptron_tagger
# python3 -m nltk.downloader wordnet
import re
def preprocess(text):
    """ 1. Removes Punctuations
        2. Removes words smaller than 3 letters
        3. Converts into lowercase
        4. Lemmatizes words
        5. Removes Stopwords
    """   
    punctuation= list(string.punctuation)
    doc_tokens= nltk.word_tokenize(text)
    word_tokens= [word.lower() for word in doc_tokens if not (word in punctuation or len(word)<=3)]
    
    # Lemmatize    
    pos_tags=nltk.pos_tag(word_tokens)
#     print(pos_tags)
    doc_words=[wordnet_lemmatizer.lemmatize(word, pos=get_pos_tag(tag)) for word, tag in pos_tags]
    doc_words= [word for word in doc_words if word not in stoplist]
    
    return doc_words

0    [absolutely, wonderful, silky, sexy, comfortable]
1    [love, dress, sooo, pretty, happen, find, stor...
2    [high, hope, dress, really, want, work, initia...
3    [love, love, love, jumpsuit, flirty, fabulous,...
4    [shirt, flatter, adjustable, front, perfect, l...
Name: Review_Text, dtype: object

In [None]:
# We are interested in the column "Review_Text"
# your code here

STEP 2: DATA CLEANING- PROCURE ONLY NOUNS AND ADJECTIVES TO OBTAIN MEANINGFUL TOPICS!

In [22]:
# Tried multiple parts of speech and obtained best topic results using Nouns and Adjectives!
def get_nouns_adjs(series):
    
    " Topic Modeling using only nouns and adjectives"
    
    pos_tags= nltk.pos_tag(series)
    all_adj_nouns= [word for (word, tag) in pos_tags if (tag=="NN" or tag=="NNS" or tag=="JJ")] 
    return all_adj_nouns

In [None]:
# your code here hint: use apply

Step 3: Add bigrams to your corpus using Word2vec model from gensim

Importing gensim related libraries

In [23]:
# Importing gensim related libraries
import gensim
from gensim.models.ldamulticore import LdaMulticore
from gensim.corpora import Dictionary
from gensim.models import Phrases
from collections import Counter
from gensim.models import Word2Vec

In [24]:
docs= list(df_nouns_adj)
phrases = gensim.models.Phrases(docs, min_count=10, threshold=20)
bigram_model = gensim.models.phrases.Phraser(phrases)

INFO:gensim.models.phrases:collecting all words and their counts
INFO:gensim.models.phrases:PROGRESS: at sentence #0, processed 0 words and 0 word types
INFO:gensim.models.phrases:PROGRESS: at sentence #10000, processed 174273 words and 96030 word types
INFO:gensim.models.phrases:PROGRESS: at sentence #20000, processed 349830 words and 162208 word types
INFO:gensim.models.phrases:collected 177651 token types (unigram + bigrams) from a corpus of 396320 words and 22641 sentences
INFO:gensim.models.phrases:merged Phrases<177651 vocab, min_count=10, threshold=20, max_vocab_size=40000000>
INFO:gensim.utils:Phrases lifecycle event {'msg': 'built Phrases<177651 vocab, min_count=10, threshold=20, max_vocab_size=40000000> in 0.97s', 'datetime': '2021-11-05T09:32:23.303883', 'gensim': '4.1.2', 'python': '3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:46) \n[GCC 9.4.0]', 'platform': 'Linux-4.15.0-159-generic-x86_64-with-glibc2.31', 'event': 'created'}
INFO:gensim.models.phrases:ex

In [25]:
def make_bigrams(texts):
    return [bigram_model[doc] for doc in texts]

# Form Bigrams
data_words_bigrams = make_bigrams(docs)

In [26]:
# Checkout most frequent bigrams :
bigram_counter1= Counter()
for key in phrases.vocab.keys():
    if key not in stopwords.words('english'):
        if len(str(key).split('_'))>1:
            bigram_counter1[key]+=phrases.vocab[key]

for key, counts in bigram_counter1.most_common(20):
    print(key,">>>>", counts)

true_size >>>> 1317
look_great >>>> 735
size_small >>>> 730
order_size >>>> 628
size_size >>>> 524
usual_size >>>> 496
fabric_soft >>>> 391
many_compliment >>>> 364
order_small >>>> 349
soft_comfortable >>>> 349
love_dress >>>> 349
skinny_jean >>>> 339
regular_size >>>> 333
wear_size >>>> 318
material_soft >>>> 316
super_soft >>>> 313
size_large >>>> 301
dress_look >>>> 292
petite_size >>>> 286
fit_true >>>> 278


Feeding the bigrams into a Word2Vec model produces more meaningful bigrams

In [21]:
# Adding business stopwords to exclude
common_terms= ["wear","look","ordered","color","purchase","order"]

stoplist= stoplist+ common_terms


In [27]:
w2vmodel = Word2Vec(bigram_model[docs], vector_size=100, sg=1, hs= 1, seed=33, epochs=33)
bigram_counter = Counter()

for key in w2vmodel.wv.index_to_key:
    if key not in stoplist:
        if len(str(key).split("_")) > 1:
            bigram_counter[key] += w2vmodel.wv.get_vecattr(key, "count")

for key, counts in bigram_counter.most_common(30):
    print(key,">>>>> " ,counts)

INFO:gensim.models.word2vec:collecting all words and their counts
INFO:gensim.models.word2vec:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #10000, processed 170511 words, keeping 8158 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #20000, processed 342158 words, keeping 11817 word types
INFO:gensim.models.word2vec:collected 12697 word types from a corpus of 387611 raw words and 22641 sentences
INFO:gensim.models.word2vec:Creating a fresh vocabulary
INFO:gensim.utils:Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 3280 unique words (25.83287390722218%% of original 12697, drops 9417)', 'datetime': '2021-11-05T09:32:51.290279', 'gensim': '4.1.2', 'python': '3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:46) \n[GCC 9.4.0]', 'platform': 'Linux-4.15.0-159-generic-x86_64-with-glibc2.31', 'event': 'prepare_vocab'}
INFO:gensim.utils:Word2Vec lifecycle event {'msg': 'effective_min

many_compliment >>>>>  364
skinny_jean >>>>>  339
body_type >>>>>  231
sale_price >>>>>  220
full_price >>>>>  209
first_time >>>>>  204
local_retailer >>>>>  169
broad_shoulder >>>>>  157
local_store >>>>>  149
base_review >>>>>  137
light_weight >>>>>  131
worth_price >>>>>  128
cami_underneath >>>>>  119
previous_review >>>>>  110
spring_summer >>>>>  108
last_year >>>>>  101
lot_compliment >>>>>  97
read_review >>>>>  97
pencil_skirt >>>>>  90
right_place >>>>>  89
athletic_build >>>>>  89
denim_jacket >>>>>  88
fall_winter >>>>>  88
price_point >>>>>  79
right_amount >>>>>  79
real_life >>>>>  76
tank_underneath >>>>>  75
hand_wash >>>>>  72
ton_compliment >>>>>  71
hourglass_figure >>>>>  63


Checkout some cool stuff from the bigram model!

In [28]:
# MostOften mentioned along with the word 'pregnant'
w2vmodel.wv.most_similar(positive= ['pregnant'])

[('baby_bump', 0.5819551944732666),
 ('pregnancy', 0.5376015901565552),
 ('month_pregnant', 0.5323860049247742),
 ('maternity', 0.5232221484184265),
 ('twin', 0.49868470430374146),
 ('sack', 0.4877704679965973),
 ('trimester', 0.48121994733810425),
 ('potato', 0.4309651255607605),
 ('busty', 0.4304738938808441),
 ('tent', 0.42559415102005005)]

In [29]:
# Which color is to 'work' as 'white' is to 'wedding'
# your code here

[('black', 0.5721330642700195),
 ('stark', 0.5108568072319031),
 ('navy', 0.5096889138221741),
 ('cute', 0.4732438027858734),
 ('cream', 0.45449140667915344)]

In [30]:
# your code here

[('sale_price', 0.43278151750564575),
 ('steep', 0.39156848192214966),
 ('sweater/coat', 0.39007094502449036),
 ('expensive', 0.3879004120826721),
 ('tracy_reese', 0.379138320684433)]

In [31]:
# What is a 'deal_breaker', if 'quality'is 'worth_penny' 
# your code here


[('exterior', 0.3735155463218689),
 ('execution', 0.3655067980289459),
 ('hop', 0.352865070104599)]

Step 4: Create a dictionary and corpus for input to our LDA model. Filter out the most common and uncommon words.

In [32]:
dictionary= Dictionary(data_words_bigrams)

# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.6)
corpus = [dictionary.doc2bow(doc) for doc in docs]

print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:adding document #10000 to Dictionary(8158 unique tokens: ['comfortable', 'sexy', 'silky', 'wonderful', 'dress']...)
INFO:gensim.corpora.dictionary:adding document #20000 to Dictionary(11817 unique tokens: ['comfortable', 'sexy', 'silky', 'wonderful', 'dress']...)
INFO:gensim.corpora.dictionary:built Dictionary(12697 unique tokens: ['comfortable', 'sexy', 'silky', 'wonderful', 'dress']...) from 22641 documents (total 387611 corpus positions)
INFO:gensim.utils:Dictionary lifecycle event {'msg': "built Dictionary(12697 unique tokens: ['comfortable', 'sexy', 'silky', 'wonderful', 'dress']...) from 22641 documents (total 387611 corpus positions)", 'datetime': '2021-11-05T09:34:17.410803', 'gensim': '4.1.2', 'python': '3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:46) \n[GCC 9.4.0]', 'platform': 'Linux-4.15.0-159-generic-x86_64-with-glibc2.31', 'event': 'create

Number of unique tokens: 1530
Number of documents: 22641


Step 5: Train your LDA model- Topic Modeling

In [33]:
from gensim.models.ldamulticore import LdaMulticore

t0= time.time()
passes= 150
np.random.seed(1) # setting up random seed to get the same results
ldamodel= LdaMulticore(corpus, 
                    id2word=dictionary, 
                    num_topics=4, 
#                   alpha='asymmetric', 
                    chunksize= 4000, 
                    batch= True,
                    minimum_probability=0.001,
                    iterations=350,
                    passes=passes)                    

t1= time.time()
print("time for",passes," passes: ",(t1-t0)," seconds")


INFO:gensim.models.ldamodel:using symmetric alpha at 0.25
INFO:gensim.models.ldamodel:using symmetric eta at 0.25
INFO:gensim.models.ldamodel:using serial LDA version on this node
INFO:gensim.models.ldamulticore:running batch LDA training, 4 topics, 150 passes over the supplied corpus of 22641 documents, updating every 22641 documents, evaluating every ~22641 documents, iterating 350x with a convergence threshold of 0.001000
INFO:gensim.models.ldamulticore:training LDA model using 31 processes
INFO:gensim.models.ldamulticore:PROGRESS: pass 0, dispatched chunk #0 = documents up to #4000/22641, outstanding queue size 1
INFO:gensim.models.ldamulticore:PROGRESS: pass 0, dispatched chunk #1 = documents up to #8000/22641, outstanding queue size 2
INFO:gensim.models.ldamulticore:PROGRESS: pass 0, dispatched chunk #2 = documents up to #12000/22641, outstanding queue size 3
INFO:gensim.models.ldamulticore:PROGRESS: pass 0, dispatched chunk #3 = documents up to #16000/22641, outstanding queue si

time for 150  passes:  767.9923477172852  seconds


STEP 5: Here are your Topics!

In [34]:
ldamodel.show_topics(num_words=25, formatted=False)

[(0,
  [('dress', 0.083370134),
   ('waist', 0.02299345),
   ('look', 0.022960795),
   ('fabric', 0.022639794),
   ('skirt', 0.014640253),
   ('length', 0.012020286),
   ('beautiful', 0.010689758),
   ('material', 0.010224576),
   ('flatter', 0.010110618),
   ('short', 0.010081764),
   ('little', 0.010009438),
   ('nice', 0.008944226),
   ('body', 0.0086883195),
   ('work', 0.008556451),
   ('shape', 0.008435519),
   ('petite', 0.0084021995),
   ('model', 0.008249083),
   ('hip', 0.008227384),
   ('bottom', 0.008093334),
   ('color', 0.008057289),
   ('much', 0.007919191),
   ('line', 0.0074790856),
   ('great', 0.0074631018),
   ('right', 0.0073345983),
   ('fit', 0.0073066726)]),
 (1,
  [('size', 0.12714909),
   ('small', 0.054531682),
   ('order', 0.047720738),
   ('large', 0.031019436),
   ('medium', 0.027167108),
   ('true', 0.020845152),
   ('fit', 0.01683209),
   ('petite', 0.016092978),
   ('store', 0.014172953),
   ('retailer', 0.013511704),
   ('wear', 0.01227326),
   ('color