<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-necessary-modules" data-toc-modified-id="Import-necessary-modules-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import necessary modules</a></span></li><li><span><a href="#Import-the-dataset" data-toc-modified-id="Import-the-dataset-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Import the dataset</a></span><ul class="toc-item"><li><span><a href="#Creating-a-copy-of-the-original-dataframe-(in-case-things-need-to-restarts!)" data-toc-modified-id="Creating-a-copy-of-the-original-dataframe-(in-case-things-need-to-restarts!)-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Creating a copy of the original dataframe (in case things need to restarts!)</a></span></li></ul></li><li><span><a href="#Grouping-all-the-reviews-of-a-particular-product-together" data-toc-modified-id="Grouping-all-the-reviews-of-a-particular-product-together-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Grouping all the reviews of a particular product together</a></span><ul class="toc-item"><li><span><a href="#Before-grouping-dropping-all-the-brands-that-have-less-than-4-reviews-(because-1-or-2-reviews-for-a-brand-is-not-sufficient-to-determine-something-about-it)" data-toc-modified-id="Before-grouping-dropping-all-the-brands-that-have-less-than-4-reviews-(because-1-or-2-reviews-for-a-brand-is-not-sufficient-to-determine-something-about-it)-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Before grouping dropping all the brands that have less than 4 reviews (because 1 or 2 reviews for a brand is not sufficient to determine something about it)</a></span></li></ul></li><li><span><a href="#Remove-all-unnecessary-(stop-words)-from-all-the-combined-reviews" data-toc-modified-id="Remove-all-unnecessary-(stop-words)-from-all-the-combined-reviews-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Remove all unnecessary (stop words) from all the combined reviews</a></span></li><li><span><a href="#Apply-Spacy-NER-tagging-to-get-the-nouns-and-the-adjectives-from-all-the-reviews" data-toc-modified-id="Apply-Spacy-NER-tagging-to-get-the-nouns-and-the-adjectives-from-all-the-reviews-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Apply Spacy NER tagging to get the nouns and the adjectives from all the reviews</a></span></li></ul></div>

### Import necessary modules

In [59]:
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

import spacy
from spacy import displacy
from spacy.util import minibatch, compounding, decaying
nlp=spacy.load("en_core_web_md")

from tqdm.notebook import tqdm
# Initialize TQDM
tqdm.pandas(desc="progress bar")

import warnings
warnings.filterwarnings('ignore')

### Import the dataset

In [2]:
df_raw = pd.read_csv("input/raw_data.csv")

In [3]:
df_raw.head()

Unnamed: 0,id,content,date,product,brand,rating
0,6254,"12 Nov 2021 Sweet , dark roast malts , sweet o...",2022-01-15,anchorage-gutted,Anchorage,4.0
1,6347,"Bottle . Hazy golden body with a large , froth...",2022-01-09 00:00:00,samuel-adams-cold-snap-2020-spring,Samuel Adams,3.0
2,8149,Pours an opaque deep dark brown with a decent ...,2022-01-03,sweetwater-festive-ale,Sweetwater,3.6
3,5240,Canned . Dark black with undetermined clarity ...,2021-12-27,athletic-all-out,Athletic,4.0
4,8500,"Light gold , finger of white froth atop . Nose...",2021-12-22,bell-s-rind-over-matter,Bell's,3.4


#### Creating a copy of the original dataframe (in case things need to restarts!)

In [4]:
df = df_raw.copy(deep=True)

> There are some 1:1 mappings, but there are also 1:many mappings between brand and unqiue products

### Grouping all the reviews of a particular product together

#### Before grouping dropping all the brands that have less than 4 reviews (because 1 or 2 reviews for a brand is not sufficient to determine something about it)

In [32]:
df_low_review_count_df = df.groupby('brand')['content'].nunique().reset_index()
df_low_review_count_df.columns = ['brand', 'unique_review_counts']
df_low_review_count_df = df_low_review_count_df[df_low_review_count_df['unique_review_counts'] < 4]
df_low_review_count_df.shape

(614, 2)

In [35]:
brands_to_remove = df_low_review_count_df.brand.unique().tolist()

In [38]:
df = df[~df['brand'].isin(brands_to_remove)]

In [43]:
df_brand_wise_grouped = df.groupby('brand')['content'].agg(' '.join).reset_index()

In [44]:
df_brand_wise_grouped.to_csv("outputs/preprocessed/brand_wise_grouped_reviews.csv", index=False)

### Remove all unnecessary (stop words) from all the combined reviews

In [45]:
stop_words = list(set(stopwords.words("english")))

In [46]:
def remove_stop_words(text):
    tokenized_text = word_tokenize(text)
    final_text_ls = []
    for i in tokenized_text:
        if i not in stop_words:
            final_text_ls.append(i)
    return " ".join([str(i) for i in final_text_ls])

In [47]:
df_brand_wise_grouped['content_preprocessed'] = df_brand_wise_grouped['content'].progress_apply(lambda x: remove_stop_words(x))

progress bar:   0%|          | 0/264 [00:00<?, ?it/s]

In [48]:
df_brand_wise_grouped.head()

Unnamed: 0,brand,content,content_preprocessed
0,3 Fonteinen,"16 / 17 bottle . Citrusy , tart , yoghurt like...","16 / 17 bottle . Citrusy , tart , yoghurt like..."
1,AC Golden,Pours an effervescent amber with 2 + fingers o...,Pours effervescent amber 2 + fingers pearl col...
2,Against the Grain,Can: Poured a pitch - black color stout with a...,Can : Poured pitch - black color stout nice la...
3,Alaskan Brewing,"Black and opaque color , has fine cream - colo...","Black opaque color , fine cream - colored foam..."
4,Alesmith,Big sweet caramel flavours as this one hits th...,Big sweet caramel flavours one hits tongue . M...


In [49]:
# saving the processed dataframe
df_brand_wise_grouped.to_csv("outputs/preprocessed/brand_wise_grouped_reviews_without_stop_words.csv", index=False)

### Apply Spacy NER tagging to get the nouns and the adjectives from all the reviews

In [50]:
def get_aspects(x):
    doc=nlp(x) ## Tokenize and extract grammatical components
    doc=[i.text for i in doc if i.text not in stop_words and i.pos_=="NOUN"] ## Remove common words and retain only nouns
    doc=list(map(lambda i: i.lower(),doc)) ## Normalize text to lower case
    doc=pd.Series(doc)
    doc=doc.value_counts().head().index.tolist() ## Get 5 most frequent nouns
    return doc

In [51]:
df_brand_wise_grouped['top_nouns'] = df_brand_wise_grouped['content_preprocessed'].progress_apply(lambda x: get_aspects(x))

progress bar:   0%|          | 0/264 [00:00<?, ?it/s]

In [52]:
df_brand_wise_grouped.head()

Unnamed: 0,brand,content,content_preprocessed,top_nouns
0,3 Fonteinen,"16 / 17 bottle . Citrusy , tart , yoghurt like...","16 / 17 bottle . Citrusy , tart , yoghurt like...","[bottle, head, beer, aroma, funk]"
1,AC Golden,Pours an effervescent amber with 2 + fingers o...,Pours effervescent amber 2 + fingers pearl col...,"[head, lemon, lacing, taste, aroma]"
2,Against the Grain,Can: Poured a pitch - black color stout with a...,Can : Poured pitch - black color stout nice la...,"[notes, chocolate, stout, head, malt]"
3,Alaskan Brewing,"Black and opaque color , has fine cream - colo...","Black opaque color , fine cream - colored foam...","[aroma, head, coffee, chocolate, smoke]"
4,Alesmith,Big sweet caramel flavours as this one hits th...,Big sweet caramel flavours one hits tongue . M...,"[malt, head, bottle, aroma, bit]"


In [70]:
df_brand_wise_grouped[df_brand_wise_grouped['brand'] == 'Samuel Adams']

Unnamed: 0,brand,content,content_preprocessed,top_nouns
200,Samuel Adams,"Bottle . Hazy golden body with a large , froth...","Bottle . Hazy golden body large , frothy white...","[head, aroma, beer, bottle, flavor]"


In [71]:
# saving the file with the top 5 nouns
df_brand_wise_grouped.to_csv("outputs/preprocessed/brand_wise_grouped_reviews_with_top_5_nouns.csv", index=False)