# Analysis of Reviews on Olist

🎯 Now that you are familiar with NLP, let's analyze the reviews of Olist.

👇 Run the following cell to load the reviews dataset and install `unidecode`

In [1]:
!pip install -q unidecode

import pandas as pd

url = "https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ml_olist_nlp_reviews.csv"
df = pd.read_csv(url, low_memory = False)

df.head()


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Unnamed: 0.1,Unnamed: 0,review_id,length_review,review_score,order_id,product_category_name,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,0,7bc2406110b926393aa56f80a40eba40,0,4,73fc7af87114b39712e6da79b0a377eb,esporte_lazer,,,2018-01-18 00:00:00,2018-01-18 21:46:59,41dcb106f807e993532d446263290104,delivered,2018-01-11 15:30:49,2018-01-11 15:47:59,2018-01-12 21:57:22,2018-01-17 18:42:41,2018-02-02 00:00:00
1,1,80e641a11e56f04c1ad469d5645fdfde,0,5,a548910a1c6147796b98fdf73dbeba33,informatica_acessorios,,,2018-03-10 00:00:00,2018-03-11 03:05:13,8a2e7ef9053dea531e4dc76bd6d853e6,delivered,2018-02-28 12:25:19,2018-02-28 12:48:39,2018-03-02 19:08:15,2018-03-09 23:17:20,2018-03-14 00:00:00
2,2,228ce5500dc1d8e020d8d1322874b6f0,0,5,f9e4b658b201a9f2ecdecbb34bed034b,informatica_acessorios,,,2018-02-17 00:00:00,2018-02-18 14:36:24,e226dfed6544df5b7b87a48208690feb,delivered,2018-02-03 09:56:22,2018-02-03 10:33:41,2018-02-06 16:18:28,2018-02-16 17:28:48,2018-03-09 00:00:00
3,3,e64fb393e7b32834bb789ff8bb30750e,37,5,658677c97b385a9be170737859d3511b,ferramentas_jardim,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06,de6dff97e5f1ba84a3cd9a3bc97df5f6,delivered,2017-04-09 17:41:13,2017-04-09 17:55:19,2017-04-10 14:24:47,2017-04-20 09:08:35,2017-05-10 00:00:00
4,4,f7c4243c7fe1938f181bec41a392bdeb,100,5,8e6bfb81e283fa7e4f11123a3fb894f1,esporte_lazer,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53,5986b333ca0d44534a156a52a8e33a83,delivered,2018-02-10 10:59:03,2018-02-10 15:48:21,2018-02-15 19:36:14,2018-02-28 16:33:35,2018-03-09 00:00:00


In [2]:
df.shape

(98657, 17)

❓ **Question: Analyse the reviews to understand what could be the causes of the bad review scores** ❓

This challenge is not as guided as the previous ones. But here are some questions to ask yourself:

- Are all the reviews relevant ? 
- What about combining the title and the body of a review ?
- What cleaning operations would you apply to the reviews ?

🇧🇷 Some Brazilian expressions and their translations:

- `producto errado` = wrong product
- `ainda nao` = not yet
- `nao entregue` = not delivered
- `nao veio` = did not come
- `nao gostei` = did not like it
- `produto defeito` = defective product
- `nao functiona` = not working
- `produto diferente` = different product
- `pessima qualidade` = poor quality
- `veio defeito` = came defect
- `veio faltando` = came missing
- `veio errado` = came wrong

In [3]:
import string
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
import re

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

import unidecode


In [4]:
# We should consider reviews written only after receiving the order

df = df[(df['review_creation_date'] >= df['order_delivered_customer_date'])]
df.shape

(88005, 17)

In [5]:
df.columns

Index(['Unnamed: 0', 'review_id', 'length_review', 'review_score', 'order_id',
       'product_category_name', 'review_comment_title',
       'review_comment_message', 'review_creation_date',
       'review_answer_timestamp', 'customer_id', 'order_status',
       'order_purchase_timestamp', 'order_approved_at',
       'order_delivered_carrier_date', 'order_delivered_customer_date',
       'order_estimated_delivery_date'],
      dtype='object')

In [6]:
df = df[['order_id','product_category_name','review_comment_title','review_comment_message','review_score']]
df.columns

Index(['order_id', 'product_category_name', 'review_comment_title',
       'review_comment_message', 'review_score'],
      dtype='object')

In [7]:
df.isnull().sum()

order_id                      0
product_category_name         0
review_comment_title      77777
review_comment_message    53440
review_score                  0
dtype: int64

In [8]:
df[['review_comment_title', 'review_comment_message']] = df[['review_comment_title', 'review_comment_message']].fillna('')

In [9]:
df.isnull().sum()

order_id                  0
product_category_name     0
review_comment_title      0
review_comment_message    0
review_score              0
dtype: int64

In [10]:
df["review"] = df["review_comment_title"].str.cat(df["review_comment_message"], sep="-")

In [11]:
df.drop(columns=['review_comment_title', 'review_comment_message'], inplace = True)

In [12]:
df

Unnamed: 0,order_id,product_category_name,review_score,review
0,73fc7af87114b39712e6da79b0a377eb,esporte_lazer,4,-
1,a548910a1c6147796b98fdf73dbeba33,informatica_acessorios,5,-
2,f9e4b658b201a9f2ecdecbb34bed034b,informatica_acessorios,5,-
3,658677c97b385a9be170737859d3511b,ferramentas_jardim,5,-Recebi bem antes do prazo estipulado.
4,8e6bfb81e283fa7e4f11123a3fb894f1,esporte_lazer,5,-Parabéns lojas lannister adorei comprar pela ...
...,...,...,...,...
98652,2a8c23fee101d4d5662fa670396eb8da,moveis_decoracao,5,-
98653,22ec9f0669f784db00fa86d035cf8602,brinquedos,5,-
98654,55d4004744368f5571d1f590031933e4,papelaria,5,"-Excelente mochila, entrega super rápida. Supe..."
98655,7725825d039fc1f0ceb7635e3f7d9206,esporte_lazer,4,-


In [13]:
def preprocessing(sentence):
    # Basic cleaning
    sentence = sentence.strip() ## remove whitespaces
    sentence = sentence.lower() ## lowercase
    sentence = ''.join(char for char in sentence if not char.isdigit()) ## remove numbers


    # Advanced cleaning
    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation, '') ## remove punctuation

    sentence = unidecode.unidecode(sentence) # remove accents

    tokenized_sentence = word_tokenize(sentence) ## tokenize
    stop_words = set(stopwords.words('portuguese')) ## define stopwords

    tokenized_sentence_cleaned = [w for w in tokenized_sentence if not w in stop_words] ## remove stopwords

#     lemmatized = [
#         WordNetLemmatizer().lemmatize(word)
#         for word in tokenized_sentence]

    cleaned_sentence = ' '.join(word for word in tokenized_sentence_cleaned)

    return cleaned_sentence

In [14]:
df['clean_review'] = df['review'].apply(preprocessing)
df

Unnamed: 0,order_id,product_category_name,review_score,review,clean_review
0,73fc7af87114b39712e6da79b0a377eb,esporte_lazer,4,-,
1,a548910a1c6147796b98fdf73dbeba33,informatica_acessorios,5,-,
2,f9e4b658b201a9f2ecdecbb34bed034b,informatica_acessorios,5,-,
3,658677c97b385a9be170737859d3511b,ferramentas_jardim,5,-Recebi bem antes do prazo estipulado.,recebi bem antes prazo estipulado
4,8e6bfb81e283fa7e4f11123a3fb894f1,esporte_lazer,5,-Parabéns lojas lannister adorei comprar pela ...,parabens lojas lannister adorei comprar intern...
...,...,...,...,...,...
98652,2a8c23fee101d4d5662fa670396eb8da,moveis_decoracao,5,-,
98653,22ec9f0669f784db00fa86d035cf8602,brinquedos,5,-,
98654,55d4004744368f5571d1f590031933e4,papelaria,5,"-Excelente mochila, entrega super rápida. Supe...",excelente mochila entrega super rapida super r...
98655,7725825d039fc1f0ceb7635e3f7d9206,esporte_lazer,4,-,


In [15]:
df['review_score'].value_counts(normalize=True)

5    0.619738
4    0.203773
3    0.082200
1    0.067212
2    0.027078
Name: review_score, dtype: float64

In [16]:
bad_review_filter = df['review_score'] < 4
df_badreview = df[bad_review_filter]
df_badreview

Unnamed: 0,order_id,product_category_name,review_score,review,clean_review
5,b18dcdf73be66366873cd26c5724d1dc,cama_mesa_banho,1,-,
14,d7bd0e4afdf94846eb73642b4e3e75c3,pet_shop,3,-,
18,70a752414a13d09cc1f2b437b914b28e,bebes,3,-,
42,aad1dcbe4c9fe2e3486e5e04c6649097,relogios_presentes,2,-,
53,548df2c6e5f089574614894bca78acf5,eletronicos,1,-recebi somente 1 controle Midea Split ESTILO....,recebi somente controle midea split estilo fal...
...,...,...,...,...,...
98634,f2d12dd37eaef72ed7b1186b2edefbcd,pet_shop,2,Foto enganosa -Foto muito diferente principalm...,foto enganosa foto diferente principalmente gr...
98637,18ed848509774f56cc8c1c0a1903ad7f,construcao_ferramentas_construcao,2,-Tive um problema na entrega em que o correio ...,problema entrega correio colocou site entregue...
98648,d5cb12269711bd1eaf7eed8fd32a7c95,telefonia,3,"-O produto não foi enviado com NF, não existe ...",produto nao enviado nf nao existe venda nf cer...
98649,acd45245723df7cb52772a34416b41b1,malas_acessorios,3,-,


In [17]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import set_config; set_config("diagram")
from sklearn.decomposition import LatentDirichletAllocation

vectorizer = TfidfVectorizer(ngram_range = (2, 2), min_df=0.01, max_df=0.1)

X = df_badreview['clean_review']

vectorized_text= vectorizer.fit_transform(X)
vectorized_text = pd.DataFrame(
    vectorized_text.toarray(),
    columns = vectorizer.get_feature_names_out())

vectorized_text

Unnamed: 0,ainda nao,ate agora,nao chegou,nao entregue,nao gostei,nao recebi,nao recomendo,nao veio,nota fiscal,produto chegou,produto entregue,produto nao,produto veio,recebi apenas,recebi produto,so recebi
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15527,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15528,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15529,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
15530,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
vectorized_text.sum(axis=0)

ainda nao           257.700094
ate agora           161.740341
nao chegou          133.562673
nao entregue        162.951762
nao gostei          212.685935
nao recebi          501.862193
nao recomendo       175.837714
nao veio            190.547070
nota fiscal         195.264978
produto chegou      210.407191
produto entregue    204.834622
produto nao         392.694975
produto veio        371.157351
recebi apenas       203.977642
recebi produto      390.398325
so recebi           222.774940
dtype: float64

🏁 Congratulations. Instead of reading 90K+ reviews, you were able to detect the main reasons of dissatisfactions on Olist.

💾 Don't forget to `git add/commit/push`