<a href="https://colab.research.google.com/github/cintia-shinoda/Birdie/blob/master/Teste_T%C3%A9cnico_Birdie.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


**Instruções para Execução:**
Antes de executar o Jupyter Notebook, fazer o upload do arquivo tech_test.tsv.



**Objetivo:**
Extrair aspectos de um conjunto de reviews bem-estruturado e apresentar insights sobre as suas descobertas. Opcionalmente, criar/propor gráficos para visualização desses insights.

**Detalhes:**
Nós usamos informações de usuários para encurtar a ponte entre consumidor e marca, criando observações pontuais sobre seus produtos pra facilitar seu entendimento. Uma das partes de nosso processo de enriquecimento de informação é a extração de aspectos: palavras dentro de um texto que codificam uma característica de seu funcionamento, estrutura, ou do processo de compra (entrega, SAC, consertos e problemas). Por exemplo:

> This phone has a great battery life and a slick screen, but I'm not a fan of the lack of a headphone jack.

> The store attendant was such a dear! She helped me lots with the return process and even gave me a future discount.

Usando essas informações, nós descobrimos e respondemos perguntas relevantes aos nossos clientes sobre o quê está sendo falado nos canais de venda de seus produtos. Nosso teste envolve utilizar uma base de reviews bem-estruturados com diversas informações que coletamos de retailers americanos para:

- Encontrar uma maneira de extrair estes aspectos;
- Explorar essas informações para gerar insights (por exemplo, quais aspectos estão mais relacionados com reviews positivos?);
- Opcionalmente, criar gráficos e propostas de visualizações para suas observações.

Aqui, queremos analisar o seu pensamento e processo de desenvolvimento e teste de ideias: não queremos só o melhor método que você encontrar. É mais importante mostrar seu raciocínio do que estar correto! Bote suas hipóteses para solucionar estes problemas em seu código ou em uma apresentação separada, e vá passando por elas uma a uma.
Como aqui na empresa utilizamos Python diariamente, recomendamos a linguagem para este teste. Um Jupyter Notebook com o seu processo de desenvolvimento bem documentado e visual é o melhor canal pra você apresentar seus resultados pra gente.

**Keywords** para facilitar sua pesquisa:
nlp, named entity recognition, syntax pattern matching, feature extraction, topic representation, word embedding, sentence embedding

**Insumos:**
- User Reviews de Refrigeradores coletados em 2019/2020. Pode haver duplicações entre SKUs diferentes: duas cores de um refrigerador podem ter o mesmo review.

### Carregando o dataset

In [16]:
import pandas as pd

In [17]:
df = pd.read_csv('tech_test.tsv')

### EDA

In [18]:
df.shape

(20473, 25)

In [19]:
df.head(10)

Unnamed: 0,retailer,category,breadcrumb,brand,offer_url,offer_sku,offer_retailer,offer_title,title_keywords,price,specs,offer_last_update_at,review_id,review_title,review_body,review_user_rating,review_posted_at,review_year,review_month,review_week,review_day,review_collected_at,locale,original_offer,variant
0,lowes,Refrigerators,"[""Appliances"", ""Refrigerators"", ""Side-by-Side ...",General Electric,https://www.lowes.com/pd/GE-25-3-cu-ft-Side-by...,1000859768,lowes,GE 25.3-cu ft Side-by-Side Refrigerator with I...,'25.3':2 'by':7 'cu':3 'fingerprint':14 'finge...,,"{'brand': ['GE', 'Ge'], 'model': ['GSS25IYNFS'...",2020-05-26 21:51:27.805520,218183104,Functional,Pros: fingerprint resistant so you don't have ...,3.0,2020-03-23,2020,3,13,23,2020-04-24 15:58:56.293182,us,True,"['1000859768', '1000859852']"
1,lowes,Refrigerators,"[""Appliances"", ""Refrigerators"", ""Side-by-Side ...",General Electric,https://www.lowes.com/pd/GE-25-1-cu-ft-Side-by...,1000859852,lowes,GE 25.1-cu ft Side-by-Side Refrigerator with I...,'25.1':2 'black':13 'by':7 'cu':3 'ft':4 'ge':...,,"{'brand': ['GE', 'Ge'], 'model': ['GSS25IBNTS'...",2020-05-26 21:51:28.597592,218183104,Functional,Pros: fingerprint resistant so you don't have ...,3.0,2020-03-23,2020,3,13,23,2020-04-24 14:55:13.485647,us,False,"['1000859768', '1000859852']"
2,lowes,Refrigerators,"[""Appliances"", ""Refrigerators"", ""French Door R...",Frigidaire,https://www.lowes.com/pd/Frigidaire-Gallery-21...,1000289721,lowes,Frigidaire Gallery 21.7-cu ft Counter-depth Fr...,'21.7':3 'counter':7 'counter-depth':6 'cu':4 ...,,"{'brand': ['Frigidaire', 'Frigidaire'], 'model...",2020-05-26 21:50:49.940754,190217370,Ample Door Storage User Friendly Visibility,Feels solid and âupscaleâ. Excellent desig...,5.0,2019-09-28,2019,9,39,28,2020-03-30 23:53:02.331711,us,True,['1000289721']
3,bestbuy_us,Refrigerators,"[""Best Buy"", ""Appliances"", ""Refrigerators"", ""B...",Whirlpool,https://www.bestbuy.com/site/whirlpool-21-9-cu...,3928039,bestbuy_us,Whirlpool - 21.9 Cu. Ft. Bottom-Freezer Refrig...,'21.9':2 'bottom':6 'bottom-freezer':5 'cu':3 ...,,"{'brand': ['whirlpool'], 'Other_UPC': ['883049...",2020-06-05 18:35:55.990040,c407068f-f900-3478-a983-ad74754c1460,So much room,I love this fridge. So much room over having a...,5.0,2019-12-13,2019,12,50,13,2020-04-28 14:09:38.255158,us,True,"['3928039', '3928048', '3979801', '6112639']"
4,bestbuy_us,Refrigerators,"[""Best Buy"", ""Appliances"", ""Refrigerators"", ""B...",Whirlpool,https://www.bestbuy.com/site/whirlpool-21-9-cu...,3979801,bestbuy_us,Whirlpool - 21.9 Cu. Ft. Bottom-Freezer Refrig...,'21.9':2 'black':9 'bottom':6 'bottom-freezer'...,,"{'brand': ['whirlpool'], 'Other_UPC': ['883049...",2020-06-05 17:55:04.244327,c407068f-f900-3478-a983-ad74754c1460,So much room,I love this fridge. So much room over having a...,5.0,2019-12-13,2019,12,50,13,2020-04-28 14:09:41.960342,us,False,"['3928039', '3928048', '3979801', '6112639']"
5,bestbuy_us,Refrigerators,"[""Best Buy"", ""Appliances"", ""Refrigerators"", ""B...",Whirlpool,https://www.bestbuy.com/site/whirlpool-22-1-cu...,6112639,bestbuy_us,Whirlpool - 22.1 Cu. Ft. Bottom-Freezer Refrig...,'22.1':2 'black':9 'bottom':6 'bottom-freezer'...,,"{'brand': ['whirlpool'], 'Other_UPC': ['883049...",2020-06-05 18:40:24.418263,c407068f-f900-3478-a983-ad74754c1460,So much room,I love this fridge. So much room over having a...,5.0,2019-12-13,2019,12,50,13,2020-04-27 19:49:13.756253,us,False,"['3928039', '3928048', '3979801', '6112639']"
6,bestbuy_us,Refrigerators,"[""Best Buy"", ""Appliances"", ""Refrigerators"", ""B...",Whirlpool,https://www.bestbuy.com/site/whirlpool-21-9-cu...,3928048,bestbuy_us,Whirlpool - 21.9 Cu. Ft. Bottom-Freezer Refrig...,'21.9':2 'bottom':6 'bottom-freezer':5 'cu':3 ...,,"{'brand': ['whirlpool'], 'Other_UPC': ['883049...",2020-06-05 17:54:46.970540,c407068f-f900-3478-a983-ad74754c1460,So much room,I love this fridge. So much room over having a...,5.0,2019-12-13,2019,12,50,13,2020-04-28 14:12:47.787123,us,False,"['3928039', '3928048', '3979801', '6112639']"
7,lowes,Refrigerators,"[""Appliances"", ""Refrigerators"", ""Side-by-Side ...",Frigidaire,https://www.lowes.com/pd/Frigidaire-Gallery-22...,1000502823,lowes,Frigidaire Gallery 22-cu ft Counter-depth Side...,'22':3 'by':11 'counter':7 'counter-depth':6 '...,,"{'brand': ['Frigidaire', 'Frigidaire'], 'model...",2020-05-26 21:51:12.317173,189887995,My refrigerator stopped working in less than a...,We were away from the weekend and the frig wen...,2.0,2019-09-09,2019,9,37,9,2020-03-30 21:44:19.988966,us,False,"['1000368269', '1000502823']"
8,lowes,Refrigerators,"[""Appliances"", ""Refrigerators"", ""Side-by-Side ...",Frigidaire,https://www.lowes.com/pd/Frigidaire-Gallery-22...,1000368269,lowes,Frigidaire Gallery 22-cu ft Counter-depth Side...,'22':3 'black':20 'by':11 'counter':7 'counter...,,"{'brand': ['Frigidaire', 'Frigidaire'], 'model...",2020-05-26 21:51:01.809169,189887995,My refrigerator stopped working in less than a...,We were away from the weekend and the frig wen...,2.0,2019-09-09,2019,9,37,9,2020-03-30 20:57:14.319254,us,True,"['1000368269', '1000502823']"
9,homedepot,Refrigerators,,Frigidaire Gallery,https://www.homedepot.com//p/FRIGIDAIRE-GALLER...,303015062,homedepot,22.1 cu. ft. Side by Side Refrigerator in Stai...,'22.1':1 'by':5 'counter':11 'cu':2 'depth':12...,,{'brand': ['frigidaire gallery']},2020-04-03 12:23:21.158517,189887995,My refrigerator stopped working in less than a...,We were away from the weekend and the frig wen...,2.0,2019-09-09,2019,9,37,9,2020-04-24 15:42:15.725709,us,True,['303015062']


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20473 entries, 0 to 20472
Data columns (total 25 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   retailer              20473 non-null  object 
 1   category              20473 non-null  object 
 2   breadcrumb            15677 non-null  object 
 3   brand                 20473 non-null  object 
 4   offer_url             20473 non-null  object 
 5   offer_sku             20473 non-null  object 
 6   offer_retailer        20473 non-null  object 
 7   offer_title           20473 non-null  object 
 8   title_keywords        20473 non-null  object 
 9   price                 993 non-null    float64
 10  specs                 20473 non-null  object 
 11  offer_last_update_at  16619 non-null  object 
 12  review_id             20473 non-null  object 
 13  review_title          20356 non-null  object 
 14  review_body           20473 non-null  object 
 15  review_user_rating 

In [21]:
# valores nulos
df.isnull().sum()

retailer                    0
category                    0
breadcrumb               4796
brand                       0
offer_url                   0
offer_sku                   0
offer_retailer              0
offer_title                 0
title_keywords              0
price                   19480
specs                       0
offer_last_update_at     3854
review_id                   0
review_title              117
review_body                 0
review_user_rating          0
review_posted_at            0
review_year                 0
review_month                0
review_week                 0
review_day                  0
review_collected_at         0
locale                      0
original_offer              0
variant                     0
dtype: int64

In [None]:
# número de reviews por modelo de refrigerador
df.offer_sku.value_counts()

In [None]:
# número de reviews por marcas
df.brand.value_counts()

In [46]:
# média do review_user_rating
df.review_user_rating.mean()

4.229179895472085

### Limpeza e Padronização

In [25]:
# remover pontuação
df.review_body = df.review_body.str.replace("[^\w\s]", "")

In [27]:
# lower case
df.review_body = df.review_body.apply(lambda x: x.lower())

In [29]:
# caracteres
df.review_body = df.review_body.str.replace("‰Ûª", "''")

In [None]:
# tokenização
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

[word_tokenize(t) for t in df.review_body]

In [38]:
# stopwords
import nltk
nltk.download('stopwords')

stops = nltk.corpus.stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [39]:
df.review_body = df.review_body.apply(lambda x: ' '.join([word for word in x.split() if word not in (stops)]))

In [None]:
# # unigrama
# from sklearn.feature_extraction.text import CountVectorizer

# vect = CountVectorizer(ngram_range=(1,1))
# vect.fit(df.review)
# text_vect = vect.transform(df.review)

# print(pd.DataFrame(text_vect.A, columns=vect.get_feature_names()).to_string())

In [None]:
# bigrama
# from sklearn.feature_extraction.text import CountVectorizer

# vect = CountVectorizer(ngram_range=(2,2))
# vect.fit(df.review)
# text_vect = vect.transform(df.review)

# print(pd.DataFrame(text_vect.A, columns=vect.get_feature_names()).to_string())

In [None]:
# trigrama
# from sklearn.feature_extraction.text import CountVectorizer

# vect = CountVectorizer(ngram_range=(3,3))
# vect.fit(df.review)
# text_vect = vect.transform(df.review)

# print(pd.DataFrame(text_vect.A, columns=vect.get_feature_names()).T.to_string())

In [41]:
from collections import Counter
c = Counter()

In [44]:
# 30 palavras que mais ocorrem
df.review_body.str.lower().str.split(" ").apply(c.update)
c.most_common(30)

[('fridge', 26672),
 ('ice', 20004),
 ('part', 17896),
 ('refrigerator', 17448),
 ('review', 17126),
 ('love', 17012),
 ('door', 16914),
 ('promotion', 16634),
 ('collected', 16632),
 ('freezer', 16020),
 ('space', 11880),
 ('great', 11842),
 ('water', 11022),
 ('one', 10668),
 ('like', 9646),
 ('bought', 8656),
 ('side', 7944),
 ('maker', 7626),
 ('room', 7208),
 ('new', 6670),
 ('good', 6046),
 ('shelves', 5880),
 ('would', 5784),
 ('much', 5638),
 ('inside', 5272),
 ('old', 5258),
 ('well', 5202),
 ('ago', 4936),
 ('drawer', 4650),
 ('doors', 4544)]

In [51]:
# Reconhecimento de Entidades Nomeadas (NER)
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()