The goal of this notebook is to show the NLP process used to extract the main flavor and appearance descriptors in the different clusters. Those words are used to create the word clouds seen the main result notebook.

Note that the variables haven't been renamed. ipa_reviews correspond to cluser_reviews and non_ipa_reviews correspond to other_reviews

In [21]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk import pos_tag
import pandas as pd
import numpy as np
import os
import sys
import json

current_dir = os.getcwd()  # Get current working directory
sys.path.append(os.path.abspath(os.path.join(current_dir, '..', '..')))
from src.data.beerdata_loader import BeerDataLoader

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()


Here are the flavor and appearance descriptors that we manually selected. They were selected after the extract_top_words methods. extract_top_words gave us the top words for both the cluster_reviews and the other_reviews. It's from those words that we picked the appearance and flavor descriptors from our knowledge on beers.

In [3]:
flavor_descriptors = [
    'hoppy', 'bitter', 'citrus', 'pine', 'floral', 'malty', 'sweet', 'roasted', 'caramel',
    'chocolate', 'coffee', 'fruit', 'spicy', 'herbal', 'earthy', 'toffee', 'tropical', 'yeast',
    'banana', 'clove', 'smoke', 'oak', 'vanilla', 'nutty', 'wheat', 'grain', 'sour', 'tart', 'milk', 'coriander'
]
appearance_descriptors = [
    'head', 'dark', 'brown', 'light', 'white', 'golden', 'straw', 'orange', 'yellow', 'pale', 'hazy', 'opaque', 'amber'
]

This function extract for both ipa_beers_reviews and in non_ipa_beers_reviews some reviews such that there is a minumum of beer reviews for each beer (so that the beer is beeing reviewed eough), we set as well a maximum number of reviews so that we have a big enough list of reviews to represent each cluster. We sample 100000 beers per cluster/other for efficiency reasons. 

In [27]:
def filter_beer_reviews(ipa_reviews_df, non_ipa_reviews_df):
    # Set parameters
    review_threshold = 100  # Minimum number of reviews a beer must have
    max_reviews_per_beer = 1000  # Maximum number of reviews to process per beer
    sample_size = 100000  # Total reviews to process per category

    ipa_size = (ipa_reviews_df.shape[0])
    if ipa_size < 100000 and ipa_size > 40000:
        review_threshold = 10
    elif ipa_size < 40000:
        review_threshold = 1

    # For IPA reviews
    ipa_review_counts = ipa_reviews_df['beer_id'].value_counts()
    ipa_beers_selected = ipa_review_counts[ipa_review_counts >= review_threshold].index.tolist()

    ipa_reviews_filtered = ipa_reviews_df[ipa_reviews_df['beer_id'].isin(ipa_beers_selected)]
    ipa_reviews_filtered = ipa_reviews_filtered.groupby('beer_id').head(max_reviews_per_beer).reset_index(drop=True)

    # For Non-IPA reviews
    non_ipa_review_counts = non_ipa_reviews_df['beer_id'].value_counts()
    non_ipa_beers_selected = non_ipa_review_counts[non_ipa_review_counts >= review_threshold].index.tolist()

    non_ipa_reviews_filtered = non_ipa_reviews_df[non_ipa_reviews_df['beer_id'].isin(non_ipa_beers_selected)]
    non_ipa_reviews_filtered = non_ipa_reviews_filtered.groupby('beer_id').head(max_reviews_per_beer).reset_index(drop=True)

    # Sample to reduce data size
    sample_size = min(sample_size, ipa_reviews_filtered.shape[0])
    ipa_reviews_filtered = ipa_reviews_filtered.sample(n=sample_size, random_state=42)
    non_ipa_reviews_filtered = non_ipa_reviews_filtered.sample(n=sample_size, random_state=42)

    print("Total number of selected IPA reviews:", len(ipa_reviews_filtered))
    print("Total number of selected Non-IPA reviews:", len(non_ipa_reviews_filtered))
    return ipa_reviews_filtered, non_ipa_reviews_filtered

This function takes the text of each review, lowercase and tokenize them, thhen remove punctuation and stop words from the tokens and lemmatize them. 

In [28]:
def preprocess_text(text):
    if not isinstance(text, str):
        return []
    # Lowercase
    text = text.lower()
    # Tokenize with preserve_line=True to avoid sent_tokenize
    tokens = word_tokenize(text, preserve_line=True)
    # Remove punctuation and non-alphabetic tokens
    tokens = [word for word in tokens if word.isalpha()]
    # Remove stop words
    tokens = [word for word in tokens if word not in stop_words]
    # Lemmatize
    tokens = [lemmatizer.lemmatize(word, pos='v') for word in tokens]
    return tokens

Tokenize both for cluster_reviews and other_reviews and keep only non-empty tokens

In [41]:
def tokenize(ipa_reviews_filtered, non_ipa_reviews_filtered):
    # Apply preprocessing
    #print("Preprocessing IPA reviews...")
    ipa_reviews_filtered['tokens'] = ipa_reviews_filtered['text'].apply(preprocess_text)

    #print("Preprocessing Non-IPA reviews...")
    non_ipa_reviews_filtered['tokens'] = non_ipa_reviews_filtered['text'].apply(preprocess_text)

    # Remove entries with empty tokens
    ipa_reviews_filtered = ipa_reviews_filtered[ipa_reviews_filtered['tokens'].str.len() > 0].reset_index(drop=True)
    non_ipa_reviews_filtered = non_ipa_reviews_filtered[non_ipa_reviews_filtered['tokens'].str.len() > 0].reset_index(drop=True)

    #print(f"Total IPA reviews after preprocessing: {len(ipa_reviews_filtered)}")
    #print(f"Total Non-IPA reviews after preprocessing: {len(non_ipa_reviews_filtered)}")
    return ipa_reviews_filtered, non_ipa_reviews_filtered

This function changes tokens into a lit

In [30]:
def extract_top_words(ipa_reviews_filtered, non_ipa_reviews_filtered):
    
    ipa_tokens_list = ipa_reviews_filtered['tokens'].tolist()
    non_ipa_tokens_list = non_ipa_reviews_filtered['tokens'].tolist()
    return ipa_tokens_list, non_ipa_tokens_list

This function, for each of the words for appearance or flavor descriptors, count the number of times this word appear in cluster_reviews and in other_reviews and output the difference. Positive means that there are more words that are in the cluster than in the other clusters for the reviews. 

In [46]:
def compute_criterium(ipa_tokens_list, non_ipa_tokens_list, descriptor):
     
    def count_appearance(tokens_list, flavor_list):
        flavor_counts = dict.fromkeys(flavor_list, 0)
        for tokens in tokens_list:
            for token in tokens:
                if token in flavor_list:
                    flavor_counts[token] += 1
        return flavor_counts

    # Count flavors in IPA and Non-IPA reviews
    ipa_appearance_counts = count_appearance(ipa_tokens_list, descriptor)
    non_ipa_appearance_counts = count_appearance(non_ipa_tokens_list, descriptor)

    # Convert to DataFrame
    ipa_appearance_df = pd.DataFrame(list(ipa_appearance_counts.items()), columns=['Appear', 'Count_IPA'])
    non_ipa_appearance_df = pd.DataFrame(list(non_ipa_appearance_counts.items()), columns=['Appear', 'Count_Non_IPA'])
    comparison_df = ipa_appearance_df.merge(non_ipa_appearance_df, on='Appear')
    comparison_df['Difference'] = comparison_df['Count_IPA'] - comparison_df['Count_Non_IPA']
    return comparison_df

Apply all the previous steps sequencially

In [44]:
def extract_features(cluster_reviews_df, other_reviews_df):

    #Filter our subsets and balance
    cluster_reviews_filtered, other_reviews_filtered = filter_beer_reviews(cluster_reviews_df, other_reviews_df)

    #Tokenize
    cluster_reviews_filtered, other_reviews_filtered = tokenize(cluster_reviews_filtered, other_reviews_filtered)

    #Change tokens into a lit
    cluster_top_words, other_top_words = extract_top_words(cluster_reviews_filtered, other_reviews_filtered)

    #Check what appearance terms our cluster correspond to
    appearance_comparison_df = compute_criterium(cluster_top_words, other_top_words, appearance_descriptors)

    ##Check what taste terms our cluster correspond to
    flavor_comparison_df = compute_criterium(cluster_top_words, other_top_words, flavor_descriptors)

    return appearance_comparison_df, flavor_comparison_df

The next four cells are only used to load the data

In [None]:
data_loader = BeerDataLoader(data_dir="../../../ada-2024-project-data-crusadas/src/data/BeerAdvocate", force_process=False)

ba_reviews_df, ba_ratings_df, ba_beers_df, ba_breweries_df, ba_users_df = data_loader.load_all_data()

Processed file '../../../ada-2024-project-data-crusadas/src/data/BeerAdvocate\reviews_processed.csv' already exists. Skipping processing.
Processed file '../../../ada-2024-project-data-crusadas/src/data/BeerAdvocate\ratings_processed.csv' already exists. Skipping processing.


In [23]:
result_df = pd.read_csv('../../../ada-2024-project-data-crusadas/src/data/beer_word_counts2.csv')
with open('../../../ada-2024-project-data-crusadas/src/data/partition.json', 'r') as f:
    partition = json.load(f)

In [24]:
result_df['beer_id'] = result_df['beer_id'].astype(str)
result_df['cluster'] = result_df['beer_id'].map(partition)
result_df['beer_id'] = result_df['beer_id'].astype(int)

In [25]:
interesting_rating = ba_ratings_df[['beer_id', 'appearance', 'aroma', 'palate', 'taste', 'rating']]
mean_ratings = interesting_rating.groupby('beer_id').mean().reset_index()
merged_df_beers = pd.merge(ba_beers_df, mean_ratings, on='beer_id', how='right')
assigned_cluster = pd.merge(result_df[['beer_id', 'cluster']], merged_df_beers, on='beer_id', how='inner')

In this cell we sepearate the reviews between the cluster's reviews and the other cluster's reviews

In [38]:
cluster_number = 2
cluster_reviews_df = ba_reviews_df[ba_reviews_df['beer_id'].isin(assigned_cluster[assigned_cluster['cluster'] == cluster_number]['beer_id'])].copy()
cluster_reviews_df.reset_index(drop=True, inplace=True)
other_reviews_df = ba_reviews_df[~ba_reviews_df['beer_id'].isin(assigned_cluster[assigned_cluster['cluster'] == cluster_number]['beer_id'])].copy()
other_reviews_df.reset_index(drop=True, inplace=True)

Here is how to call the function

In [47]:
appearance_comparison_df, flavor_comparison_df = extract_features(cluster_reviews_df, other_reviews_df)

Total number of selected IPA reviews: 100000
Total number of selected Non-IPA reviews: 100000


In [48]:
print(appearance_comparison_df)

    Appear  Count_IPA  Count_Non_IPA  Difference
0     head      66395          67995       -1600
1     dark      11684          54813      -43129
2    brown       7229          34458      -27229
3    light      44252          56920      -12668
4    white      36484          25940       10544
5   golden      17819          11324        6495
6    straw       2294           3918       -1624
7   orange      37225          13035       24190
8   yellow       5889           7717       -1828
9     pale      18423           9795        8628
10    hazy      14367           8743        5624
11  opaque        948           3472       -2524
12   amber      23541          11909       11632
