### About

This notebook is a playground for various extraction techniques for relevant words that are later embedded.

In [37]:
import pandas as pd
from tqdm import tqdm

# NLP processing
import spacy

nlp = spacy.load("en_core_web_sm")

### Load the data

In [47]:
!wc -l ../data/BeerAdvocate/reviews.txt

 44022962 ../data/BeerAdvocate/reviews.txt


In [46]:
!wc -l ../data/BeerAdvocate/ratings.txt

 151074576 ../data/BeerAdvocate/ratings.txt


In [44]:
!head -20 ../data/BeerAdvocate/ratings.txt

beer_name: Régab
beer_id: 142544
brewery_name: Societe des Brasseries du Gabon (SOBRAGA)
brewery_id: 37262
style: Euro Pale Lager
abv: 4.5
date: 1440064800
user_name: nmann08
user_id: nmann08.184925
appearance: 3.25
aroma: 2.75
palate: 3.25
taste: 2.75
overall: 3.0
rating: 2.88
text: From a bottle, pours a piss yellow color with a fizzy white head.  This is carbonated similar to soda.The nose is basic.. malt, corn, a little floral, some earthy straw.  The flavor is boring, not offensive, just boring.  Tastes a little like corn and grain.  Hard to write a review on something so simple.Its ok, could be way worse.
review: True

beer_name: Barelegs Brew
beer_id: 19590


In [45]:
!head -20 ../data/BeerAdvocate/reviews.txt

beer_name: Régab
beer_id: 142544
brewery_name: Societe des Brasseries du Gabon (SOBRAGA)
brewery_id: 37262
style: Euro Pale Lager
abv: 4.5
date: 1440064800
user_name: nmann08
user_id: nmann08.184925
appearance: 3.25
aroma: 2.75
palate: 3.25
taste: 2.75
overall: 3.0
rating: 2.88
text: From a bottle, pours a piss yellow color with a fizzy white head.  This is carbonated similar to soda.The nose is basic.. malt, corn, a little floral, some earthy straw.  The flavor is boring, not offensive, just boring.  Tastes a little like corn and grain.  Hard to write a review on something so simple.Its ok, could be way worse.

beer_name: Barelegs Brew
beer_id: 19590
brewery_name: Strangford Lough Brewing Company Ltd


In [39]:
file_path = "../data/BeerAdvocate/reviews.txt"
n_lines = 100

# Initialize an empty list to store each beer's information
beer_data = []

# Open the text file for reading
with open(file_path, "r") as file:
    # Initialize an empty dictionary for the current beer
    current_beer = {}
    total = 1
    # Iterate over each line in the file
    for line in file:
        # Strip the line of leading/trailing whitespace
        line = line.strip()
        # If the line is empty or not a key-value pair, it means we're between beer entries
        if line == "" or ": " not in line:
            if current_beer:
                # Add the current beer's dictionary to the list and reset it
                beer_data.append(current_beer)
                current_beer = {}
                total += 1
                if total == n_lines:
                    break
        else:
            # Split the line into key and value, if possible
            parts = line.split(": ", 1)
            if len(parts) == 2:
                key, value = parts
                # Special handling for 'date' field to convert it into a readable format
                if key == "date":
                    value = pd.to_datetime(int(value), unit="s")
                # Convert boolean string 'True'/'False' to a Python boolean
                elif value == "True":
                    value = True
                elif value == "False":
                    value = False
                # Add the key-value pair to the current beer's dictionary
                current_beer[key] = value
            else:
                print(f"Line skipped: {line}")

    # Make sure to add the last beer's data if the file doesn't end with a newline
    if current_beer:
        beer_data.append(current_beer)

# Create a DataFrame from the list of dictionaries
df = pd.DataFrame(beer_data)

In [40]:
df.head()

Unnamed: 0,beer_name,beer_id,brewery_name,brewery_id,style,abv,date,user_name,user_id,appearance,aroma,palate,taste,overall,rating,text
0,Régab,142544,Societe des Brasseries du Gabon (SOBRAGA),37262,Euro Pale Lager,4.5,2015-08-20 10:00:00,nmann08,nmann08.184925,3.25,2.75,3.25,2.75,3.0,2.88,"From a bottle, pours a piss yellow color with ..."
1,Barelegs Brew,19590,Strangford Lough Brewing Company Ltd,10093,English Pale Ale,4.5,2009-02-20 11:00:00,StJamesGate,stjamesgate.163714,3.0,3.5,3.5,4.0,3.5,3.67,Pours pale copper with a thin head that quickl...
2,Barelegs Brew,19590,Strangford Lough Brewing Company Ltd,10093,English Pale Ale,4.5,2006-03-13 11:00:00,mdagnew,mdagnew.19527,4.0,3.5,3.5,4.0,3.5,3.73,"500ml Bottle bought from The Vintage, Antrim....."
3,Barelegs Brew,19590,Strangford Lough Brewing Company Ltd,10093,English Pale Ale,4.5,2004-12-01 11:00:00,helloloser12345,helloloser12345.10867,4.0,3.5,4.0,4.0,4.5,3.98,Serving: 500ml brown bottlePour: Good head wit...
4,Barelegs Brew,19590,Strangford Lough Brewing Company Ltd,10093,English Pale Ale,4.5,2004-08-30 10:00:00,cypressbob,cypressbob.3708,4.0,4.0,4.0,4.0,4.0,4.0,"500ml bottlePours with a light, slightly hazy ..."


In [41]:
def process_text(text):
    # Runs preprocessing and tokenization
    doc = nlp(text)

    # here we can decide what we want to do with the preprocessed text
    lemmas = " ".join([token.lemma_ for token in doc])
    adjectives = " ".join([token.text for token in doc if token.pos_.startswith("ADJ")])

    return pd.Series({"lemmas": lemmas, "adjectives": adjectives})

In [42]:
extracted_data = df["text"].apply(process_text)
df = pd.concat([df, extracted_data], axis=1)

In [43]:
df[["text", "adjectives"]].head()

Unnamed: 0,text,adjectives
0,"From a bottle, pours a piss yellow color with ...",piss yellow fizzy white similar basic little e...
1,Pours pale copper with a thin head that quickl...,pale thin golden big grassy dark more Brave mo...
2,"500ml Bottle bought from The Vintage, Antrim.....",golden yellow White thick foamy thin light spi...
3,Serving: 500ml brown bottlePour: Good head wit...,Good excellent slight cloudy golden subtle swe...
4,"500ml bottlePours with a light, slightly hazy ...",light hazy golden light Slight slight balanced...
