# Analysis of Reviews on Olist

🎯 Now that you are familiar with NLP, let's analyze the reviews of Olist.

👇 Run the following cell to load the reviews dataset.

In [1]:
import pandas as pd

url = "https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ml_olist_nlp_reviews.csv"
df = pd.read_csv(url, low_memory = False)

df.head()

Unnamed: 0.1,Unnamed: 0,review_id,length_review,review_score,order_id,product_category_name,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,0,7bc2406110b926393aa56f80a40eba40,0,4,73fc7af87114b39712e6da79b0a377eb,esporte_lazer,,,2018-01-18 00:00:00,2018-01-18 21:46:59,41dcb106f807e993532d446263290104,delivered,2018-01-11 15:30:49,2018-01-11 15:47:59,2018-01-12 21:57:22,2018-01-17 18:42:41,2018-02-02 00:00:00
1,1,80e641a11e56f04c1ad469d5645fdfde,0,5,a548910a1c6147796b98fdf73dbeba33,informatica_acessorios,,,2018-03-10 00:00:00,2018-03-11 03:05:13,8a2e7ef9053dea531e4dc76bd6d853e6,delivered,2018-02-28 12:25:19,2018-02-28 12:48:39,2018-03-02 19:08:15,2018-03-09 23:17:20,2018-03-14 00:00:00
2,2,228ce5500dc1d8e020d8d1322874b6f0,0,5,f9e4b658b201a9f2ecdecbb34bed034b,informatica_acessorios,,,2018-02-17 00:00:00,2018-02-18 14:36:24,e226dfed6544df5b7b87a48208690feb,delivered,2018-02-03 09:56:22,2018-02-03 10:33:41,2018-02-06 16:18:28,2018-02-16 17:28:48,2018-03-09 00:00:00
3,3,e64fb393e7b32834bb789ff8bb30750e,37,5,658677c97b385a9be170737859d3511b,ferramentas_jardim,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06,de6dff97e5f1ba84a3cd9a3bc97df5f6,delivered,2017-04-09 17:41:13,2017-04-09 17:55:19,2017-04-10 14:24:47,2017-04-20 09:08:35,2017-05-10 00:00:00
4,4,f7c4243c7fe1938f181bec41a392bdeb,100,5,8e6bfb81e283fa7e4f11123a3fb894f1,esporte_lazer,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53,5986b333ca0d44534a156a52a8e33a83,delivered,2018-02-10 10:59:03,2018-02-10 15:48:21,2018-02-15 19:36:14,2018-02-28 16:33:35,2018-03-09 00:00:00


In [2]:
df.shape

(98657, 17)

❓ **Question: Analyse the reviews to understand what could be the causes of the bad review scores** ❓

This challenge is not as guided as the previous ones. But here are some questions to ask yourself:

- Are all the reviews relevant ? 
- What about combining the title and the body of a review ?
- What cleaning operations would you apply to the reviews ?

🇧🇷 Some Brazilian expressions and their translations:

- `producto errado` = wrong product
- `ainda nao` = not yet
- `nao entregue` = not delivered
- `nao veio` = did not come
- `nao gostei` = did not like it
- `produto defeito` = defective product
- `nao functiona` = not working
- `produto diferente` = different product
- `pessima qualidade` = poor quality
- `veio defeito` = came defect
- `veio faltando` = came missing
- `veio errado` = came wrong

In [3]:
df.isnull().sum().sort_values(ascending=False)/len(df)

review_comment_title             0.883576
review_comment_message           0.590105
order_delivered_customer_date    0.021266
order_delivered_carrier_date     0.009984
order_approved_at                0.000132
Unnamed: 0                       0.000000
customer_id                      0.000000
order_purchase_timestamp         0.000000
order_status                     0.000000
review_creation_date             0.000000
review_answer_timestamp          0.000000
review_id                        0.000000
product_category_name            0.000000
order_id                         0.000000
review_score                     0.000000
length_review                    0.000000
order_estimated_delivery_date    0.000000
dtype: float64

In [4]:
df.duplicated().sum()

0

In [43]:
import numpy as np
from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import recall_score
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
from nltk.stem import WordNetLemmatizer

In [44]:
def cleaning(sentence):
    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation, '')
    
    # Basic cleaning
    sentence = sentence.strip() ## remove whitespaces
    sentence = sentence.lower() ## lowercase
    sentence = ''.join(char for char in sentence if not char.isdigit()) ## remove numbers
    
    # Advanced cleaning
    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation, '') ## remove punctuation
    
    tokenized_sentence = word_tokenize(sentence) ## tokenize 
    stop_words = set(stopwords.words('portuguese')) ## define stopwords
    
    tokenized_sentence_cleaned = [ ## remove stopwords
        w for w in tokenized_sentence if not w in stop_words
    ]

    lemmatized = [
        WordNetLemmatizer().lemmatize(word, pos = "v") 
        for word in tokenized_sentence_cleaned
    ]
    lemmatized = [
        WordNetLemmatizer().lemmatize(word, pos = "a") 
        for word in tokenized_sentence_cleaned
    ]
    lemmatized = [
        WordNetLemmatizer().lemmatize(word, pos = "n") 
        for word in tokenized_sentence_cleaned
    ]
    lemmatized = [
        WordNetLemmatizer().lemmatize(word, pos = "s") 
        for word in tokenized_sentence_cleaned
    ]
    lemmatized = [
        WordNetLemmatizer().lemmatize(word, pos = "r") 
        for word in tokenized_sentence_cleaned
    ]
    
    cleaned_sentence = ' '.join(word for word in lemmatized)
    
    return cleaned_sentence

In [45]:
df.dropna(inplace=True)

In [46]:
df.review_comment_message = df.review_comment_message.apply(cleaning)

In [47]:
df

Unnamed: 0.1,Unnamed: 0,review_id,length_review,review_score,order_id,product_category_name,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
9,9,8670d52e15e00043ae7de4c01cc2fe06,174,4,b9bf720beb4ab3728760088589c62129,eletroportateis,recomendo,aparelho eficiente site marca aparelho impress...,2018-05-22 00:00:00,2018-05-23 16:45:47,a5224bdc7685fd39cd7a23404415493d,delivered,2018-05-14 10:29:02,2018-05-15 10:37:47,2018-05-15 13:29:00,2018-05-21 17:52:12,2018-06-06 00:00:00
15,15,3948b09f7c818e2d86c9a546758b2335,56,5,e51478e7e277a83743b6f9991dbfa3fb,informatica_acessorios,Super recomendo,vendedor confiável produto ok entrega antes prazo,2018-05-23 00:00:00,2018-05-24 03:00:01,659ded3e9b43aaf51cf9586d03033b46,delivered,2018-05-18 18:20:45,2018-05-18 18:35:28,2018-05-19 09:27:00,2018-05-22 14:58:47,2018-06-07 00:00:00
22,22,d21bbc789670eab777d27372ab9094cc,12,5,4fc44d78867142c627497b60a7e0228a,beleza_saude,Ótimo,loja nota,2018-07-10 00:00:00,2018-07-11 14:10:25,e494ff798e6549f9ba9747f00f5681c2,delivered,2018-07-04 20:34:57,2018-07-05 16:33:00,2018-07-05 15:55:00,2018-07-09 20:27:50,2018-07-23 00:00:00
36,36,c92cdd7dd544a01aa35137f901669cdf,112,4,37e7875cdce5a9e5b3a692971f370151,esporte_lazer,Muito bom.,recebi exatamente esperava demais encomendas o...,2018-06-07 00:00:00,2018-06-09 18:44:02,3fecd6727aed19735e06945b7c3e49c9,delivered,2018-05-18 12:15:11,2018-05-18 13:05:53,2018-05-21 16:13:00,2018-06-06 18:22:40,2018-06-14 00:00:00
38,38,08c9d79ec0eba1d252e3f52f14b8e6a9,11,5,e029f708df3cc108b3264558771605c6,pet_shop,Bom,recomendo,2018-06-13 00:00:00,2018-06-13 22:54:44,d2aa6bef2582c7482ab992fa89f965bd,delivered,2018-06-01 14:12:09,2018-06-01 14:32:30,2018-06-05 14:48:00,2018-06-12 19:41:53,2018-06-29 00:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98622,98622,47e0954e156dac6512c25c6d2ecc1c66,160,5,16cbf959cfdb88c47ee2a29303547ec2,relogios_presentes,Nota máxima!,obrigado excelente atendimentobaratheon fornec...,2018-05-22 00:00:00,2018-05-23 00:51:43,0888ceb9767c8f38f24962e72e7c32fa,delivered,2018-05-15 19:52:47,2018-05-15 20:17:07,2018-05-16 07:54:00,2018-05-21 20:34:50,2018-06-06 00:00:00
98627,98627,0e7bc73fde6782891898ea71443f9904,9,4,bd78f91afbb1ecbc6124974c5e813043,cama_mesa_banho,👍,aprovado,2018-07-04 00:00:00,2018-07-05 00:25:13,b503b349a145e75d7b6fb083b221a5cb,delivered,2018-06-28 21:26:30,2018-06-28 21:51:05,2018-06-29 12:07:00,2018-07-03 20:52:04,2018-07-20 00:00:00
98631,98631,58be140ccdc12e8908ff7fd2ba5c7cb0,103,5,0ebf8e35b9807ee2d717922d5663ccdb,papelaria,muito bom produto,ficamos satisfeitos produto atende necessidade...,2018-06-30 00:00:00,2018-07-02 23:09:35,9126539aa02befb9271bed176c06c637,delivered,2018-06-21 22:43:32,2018-06-21 22:57:28,2018-06-22 15:18:00,2018-06-29 19:21:05,2018-07-20 00:00:00
98632,98632,51de4e06a6b701cb2be47ea0e689437b,83,3,b7467ae483dbe956fe9acdf0b1e6e3f4,relogios_presentes,Não foi entregue o pedido,bom dia unidades compradas recebi unidades agu...,2018-06-05 00:00:00,2018-06-06 10:52:19,424bb0b58d2a1f8f1a185d44a8116aae,delivered,2018-05-21 14:14:35,2018-05-21 16:14:55,2018-05-23 12:57:00,2018-06-04 18:52:15,2018-06-08 00:00:00


In [51]:
df['target'] = pd.cut(x = df['review_score'],
                       bins=[df['review_score'].min()-1,
                             df['review_score'].mean(),
                             df['review_score'].max()+1], 
                       labels=['bad', 'good'])

df.tail()

Unnamed: 0.1,Unnamed: 0,review_id,length_review,review_score,order_id,product_category_name,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,target
98622,98622,47e0954e156dac6512c25c6d2ecc1c66,160,5,16cbf959cfdb88c47ee2a29303547ec2,relogios_presentes,Nota máxima!,obrigado excelente atendimentobaratheon fornec...,2018-05-22 00:00:00,2018-05-23 00:51:43,0888ceb9767c8f38f24962e72e7c32fa,delivered,2018-05-15 19:52:47,2018-05-15 20:17:07,2018-05-16 07:54:00,2018-05-21 20:34:50,2018-06-06 00:00:00,good
98627,98627,0e7bc73fde6782891898ea71443f9904,9,4,bd78f91afbb1ecbc6124974c5e813043,cama_mesa_banho,👍,aprovado,2018-07-04 00:00:00,2018-07-05 00:25:13,b503b349a145e75d7b6fb083b221a5cb,delivered,2018-06-28 21:26:30,2018-06-28 21:51:05,2018-06-29 12:07:00,2018-07-03 20:52:04,2018-07-20 00:00:00,good
98631,98631,58be140ccdc12e8908ff7fd2ba5c7cb0,103,5,0ebf8e35b9807ee2d717922d5663ccdb,papelaria,muito bom produto,ficamos satisfeitos produto atende necessidade...,2018-06-30 00:00:00,2018-07-02 23:09:35,9126539aa02befb9271bed176c06c637,delivered,2018-06-21 22:43:32,2018-06-21 22:57:28,2018-06-22 15:18:00,2018-06-29 19:21:05,2018-07-20 00:00:00,good
98632,98632,51de4e06a6b701cb2be47ea0e689437b,83,3,b7467ae483dbe956fe9acdf0b1e6e3f4,relogios_presentes,Não foi entregue o pedido,bom dia unidades compradas recebi unidades agu...,2018-06-05 00:00:00,2018-06-06 10:52:19,424bb0b58d2a1f8f1a185d44a8116aae,delivered,2018-05-21 14:14:35,2018-05-21 16:14:55,2018-05-23 12:57:00,2018-06-04 18:52:15,2018-06-08 00:00:00,bad
98634,98634,2ee221b28e5b6fceffac59487ed39348,87,2,f2d12dd37eaef72ed7b1186b2edefbcd,pet_shop,Foto enganosa,foto diferente principalmente graninha sintéti...,2018-03-28 00:00:00,2018-05-25 01:23:26,75b5d720874f58a6f6e2863e378c8575,delivered,2018-03-25 18:01:37,2018-03-25 18:15:29,2018-03-26 20:03:43,2018-03-27 13:48:59,2018-04-06 00:00:00,bad


In [52]:
vectorizer = TfidfVectorizer()
vectorized_documents = vectorizer.fit_transform(df.review_comment_message)
vectorized_documents = pd.DataFrame(vectorized_documents.toarray(), 
                                    columns = vectorizer.get_feature_names_out())

vectorized_documents

Unnamed: 0,aa,aaa,aaprelho,ab,abaixada,abaixo,abajur,abaulada,abdominal,abençoe,...,ônibus,última,últimas,último,única,único,únicos,úteis,útil,ünica
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9501,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9502,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9503,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9504,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [53]:
from sklearn.decomposition import LatentDirichletAllocation

# Instantiate the LDA 
n_components = 2
lda_model = LatentDirichletAllocation(n_components=n_components, max_iter = 100)

# Fit the LDA on the vectorized documents
lda_model.fit(vectorized_documents)

In [54]:
document_topic_mixture = lda_model.transform(vectorized_documents)

In [60]:
document_topic_mixture

array([[0.80512229, 0.19487771],
       [0.14724799, 0.85275201],
       [0.62006946, 0.37993054],
       ...,
       [0.14979393, 0.85020607],
       [0.79669434, 0.20330566],
       [0.84339321, 0.15660679]])

In [61]:
topic_word_mixture = pd.DataFrame(lda_model.components_, 
                                 columns = vectorizer.get_feature_names_out())
topic_word_mixture

Unnamed: 0,aa,aaa,aaprelho,ab,abaixada,abaixo,abajur,abaulada,abdominal,abençoe,...,ônibus,última,últimas,último,única,único,únicos,úteis,útil,ünica
0,0.730096,0.730096,0.511947,0.774827,0.813198,2.29474,0.508122,0.74498,0.967935,0.541508,...,0.901943,2.025268,0.522495,0.947717,5.449309,4.680402,0.766456,3.288745,0.695891,0.89561
1,0.53092,0.53092,0.951752,0.506284,0.552896,0.547763,1.201533,0.50624,0.510831,2.343291,...,0.509376,0.509739,0.82249,0.56577,0.53994,0.533057,0.547574,2.371054,8.820441,0.516481


In [62]:
def print_topics(lda_model, vectorizer, top_words):
    # 1. TOPIC MIXTURE OF WORDS FOR EACH TOPIC
    topic_mixture = pd.DataFrame(lda_model.components_,
                                 columns = vectorizer.get_feature_names_out())
    
    # 2. FINDING THE TOP WORDS FOR EACH TOPIC
    ## Number of topics
    n_components = topic_mixture.shape[0]
    ## Top words for each topic
    for topic in range(n_components):
        print("-"*10)
        print(f"For topic {topic}, here are the the top {top_words} words with weights:")
        topic_df = topic_mixture.iloc[topic]\
                             .sort_values(ascending = False).head(top_words)
        
        print(round(topic_df,3))

In [63]:
print_topics(lda_model, vectorizer, 5)

----------
For topic 0, here are the the top 5 words with weights:
recebi     201.859
produto    155.731
veio       138.080
ainda      124.615
comprei    117.272
Name: 0, dtype: float64
----------
For topic 1, here are the the top 5 words with weights:
bom        532.905
produto    525.276
prazo      474.761
antes      382.684
entrega    379.673
Name: 1, dtype: float64


🏁 Congratulations. Instead of reading 90K+ reviews, you were able to detect the main reasons of dissatisfactions on Olist.

💾 Don't forget to `git add/commit/push`