# Preprocessing for Content Model

**Description:**  
This notebook takes the combined and cleaned beer reviews and brewery metadata from `final_beers_reviews_breweries.csv`, aggregates all review texts for each beer into a single field. After aggregation, A BERT tokenizer to identify the most frequent tokens across each beer’s review corpus, reducing the feature space so that the downstream content autoencoder is not overwhelmed by a very high-dimensional vocabulary. The resulting file, `beer_content.csv`, containing `beer_id`, `name`, and `all_text` will be used as input to the content recommendation model.

---

## Overview

- **Data Source:**  
  `final_beers_reviews_breweries.csv` – contains user reviews joined with beer and brewery information.


In [12]:
import pandas as pd

try:
    df = pd.read_csv('final_beers_reviews_breweries.csv')
    print("\final Data Sample:")
    print(df.head())
except Exception as e:
    print(f"Error loading reviews.csv: {e}")

inal Data Sample:
              name state country                    style availability   abv  \
0  Older Viscosity    CA      US  American Imperial Stout     Rotating  12.0   
1  Older Viscosity    CA      US  American Imperial Stout     Rotating  12.0   
2  Older Viscosity    CA      US  American Imperial Stout     Rotating  12.0   
3  Older Viscosity    CA      US  American Imperial Stout     Rotating  12.0   
4  Older Viscosity    CA      US  American Imperial Stout     Rotating  12.0   

                                               notes  beer_id     username  \
0  Imperial Stout aged for 12 months in new bourb...    34094        Sazz9   
1  Imperial Stout aged for 12 months in new bourb...    34094  Amguerra305   
2  Imperial Stout aged for 12 months in new bourb...    34094      TheGent   
3  Imperial Stout aged for 12 months in new bourb...    34094         bobv   
4  Imperial Stout aged for 12 months in new bourb...    34094      Tony210   

         date  ...  look  smell

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614525 entries, 0 to 614524
Data columns (total 21 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   name           614525 non-null  object 
 1   state          614525 non-null  object 
 2   country        614525 non-null  object 
 3   style          614525 non-null  object 
 4   availability   614525 non-null  object 
 5   abv            614525 non-null  float64
 6   notes          614525 non-null  object 
 7   beer_id        614525 non-null  int64  
 8   username       614525 non-null  object 
 9   date           614525 non-null  object 
 10  text           614525 non-null  object 
 11  look           614525 non-null  float64
 12  smell          614525 non-null  float64
 13  taste          614525 non-null  float64
 14  feel           614525 non-null  float64
 15  overall        614525 non-null  float64
 16  score          614525 non-null  float64
 17  name_brewery   614525 non-nul

## Aggregate Beer Metadata and Reviews

This cell groups the data by `beer_id`, collecting core metadata (name, state, country, style, availability, ABV, brewery name, city) and computing the average score, while concatenating all individual review texts into a single `text` field for each beer, resulting in the consolidated DataFrame `dfbeers`.


In [3]:
dfbeers = df.groupby('beer_id').agg({
    'name': 'first',               # Beer name
    'state': 'first',              # State
    'country': 'first',            # Country
    'style': 'first',              # Beer style
    'availability': 'first',       # Availability info
    'abv': 'first',                # ABV (assumes consistency)
    'notes': 'first',              # First review note
    'text': lambda x: ' '.join(x), # Join all reviews together
    'score': 'mean',               # Average score
    'name_brewery': 'first',       # Brewery name
    'city': 'first',               # City
}).reset_index()

## Categorize ABV Levels

This cell defines ABV bins ([0–5, 5–10, 10+]) with corresponding labels (`'low abv'`, `'medium abv'`, `'high abv'`) and uses `pd.cut` to create a new `abv_category` column in `dfbeers`, classifying each beer’s alcohol content into these categories.


In [14]:
import numpy as np
bins = [0, 5, 10, np.inf]
labels = ['low abv', 'medium abv', 'high abv']

dfbeers['abv_category'] = pd.cut(dfbeers['abv'], bins=bins, labels=labels)

In [5]:
dfbeers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   beer_id       500 non-null    int64   
 1   name          500 non-null    object  
 2   state         500 non-null    object  
 3   country       500 non-null    object  
 4   style         500 non-null    object  
 5   availability  500 non-null    object  
 6   abv           500 non-null    float64 
 7   notes         500 non-null    object  
 8   text          500 non-null    object  
 9   score         500 non-null    float64 
 10  name_brewery  500 non-null    object  
 11  city          500 non-null    object  
 12  abv_category  500 non-null    category
dtypes: category(1), float64(2), int64(1), object(9)
memory usage: 47.6+ KB


## Create Combined Text Feature

This cell concatenates key metadata fields (`name`, `style`, `country`, `notes`, `abv_category`) and the aggregated review text (`text`) into a single `all_text` column on `dfbeers`, producing a unified text representation for each beer.


In [6]:
colsConcat = ['name','style', 'country', 'notes', 'abv_category','text']
dfbeers['all_text'] = dfbeers[colsConcat].astype(str).agg(' '.join, axis=1)


In [7]:
dfbeers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   beer_id       500 non-null    int64   
 1   name          500 non-null    object  
 2   state         500 non-null    object  
 3   country       500 non-null    object  
 4   style         500 non-null    object  
 5   availability  500 non-null    object  
 6   abv           500 non-null    float64 
 7   notes         500 non-null    object  
 8   text          500 non-null    object  
 9   score         500 non-null    float64 
 10  name_brewery  500 non-null    object  
 11  city          500 non-null    object  
 12  abv_category  500 non-null    category
 13  all_text      500 non-null    object  
dtypes: category(1), float64(2), int64(1), object(10)
memory usage: 51.5+ KB


## Detect and Report Compute Device

This cell checks for CUDA or MPS GPU support (falling back to CPU), sets the `device`, and prints the selected device along with GPU details if available.


In [8]:
import torch

device = torch.device(
    "cuda" if torch.cuda.is_available()
    else "mps"  if torch.backends.mps.is_available()
    else "cpu"
)

print(f"Using device: {device}")

if device.type == "cuda":
    idx = device.index or 0
    print("GPU name: ", torch.cuda.get_device_name(idx))
elif device.type == "mps":
    print("Running on Apple Silicon GPU via MPS")
else:
    print("Running on CPU")

Using device: mps
Running on Apple Silicon GPU via MPS


## Initialize KeyBERT and Embedding Model

This cell installs (if necessary) and imports KeyBERT and SentenceTransformer, loads the “all-MiniLM-L6-v2” BERT encoder onto the chosen `device`, and instantiates `kw_model` for keyword extraction.


In [9]:
# installation step if necessary
# pip install keybert
from keybert import KeyBERT
from sentence_transformers import SentenceTransformer

# compact BERT model
embedder = SentenceTransformer("all-MiniLM-L6-v2", device=device)
kw_model = KeyBERT(model=embedder)

  from tqdm.autonotebook import tqdm, trange


## Extract Top BERT Keywords to Compact Text

This cell merges standard English stop words with beer‑specific terms, defines `bertTopTerms` using KeyBERT to pull the top 150 keyphrases from each beer’s aggregated text, and applies it to `dfbeers["all_text"]`, replacing the full text with a concise set of BERT‑derived keywords. 

<b>Do not re-run as it takes a few hours proceed to use the csv output attached in the autoencoder model</b>


In [10]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

non_taste_stop = {
    "us", "ca", "beer", "head", "taste", "pours", "like", "s", "t", "year", "round", "abv",'compared','10','11', '12',
    "availability", "yearround", "us","ca","just","really","overall","bottle","good", "nice", 'brewery',
    'brewing', 'brews', 'brew', 'style','drink', 'drinks', 'drinking', 'tastes', 'tasted', 'enjoy', 'enjoyable', 'okay',
    'recommended','appearance', 'looking','great', 'decent', 'pretty', 'amazing', 'excellent', 'best', 'favorite',
    'interesting', 'quality', 'look' 'beautiful', 'looks', 'look', 'exceptional', 'outstanding', 'slightly','expected',
    'definitely', 'pleasant', 'recommend', 'slightly', "awesome", "fantastic", "beautiful", "solid", "make", "review",
    'delightful','fine','quite','mix','makes','type','curious','liked','average','perfect','way','does','try','buy',
    'following','enjoyed','somewhat','definitely','wonderful','wonderfully','making','experience','specific','pure','odd',
    'impression','love', 'try', 'tried', 'extremely', 'alcoholic', 'appreciate', 'attractive', 'bad', 'better','compare',
    'especially', 'bold', 'characteristics', 'choice', 'color', 'colored','totally', 'unusual', 'usual', 'particular',
    'generally', 'different', 'drinker', 'finest', 'flavoring', 'flavours', 'kind','flavored', 'flavour', 
    'general', 'normal', 'opinion', 'pleasantly', 'prefer', 'regular','sample','gives','usually', 'brewed','honestly',
    'slight', 'smell', 'smells', 'smoothest', 'styles', 'traditional', 'typical', 'unique', 'worthy','breweries',
    'concerned', 'consider', 'considering', '12oz', '22oz', '2x', '750ml', '22','actually','absolutely', 'amazingly', 
    'exceptionally', 'extraordinary','fabulously', 'fantastically', 'incredibly', 'perfectly','remarkably', 'spectacular',
    'stunning', 'truly', 'totally','wonderfully', 'delightfully', 'deliciously', 'fascinating','exciting', 'fantastic',
    'disappointing','phenomenal','satisfying','enjoying','experienced', 'bottled', 'bottles', 'highly', 'surprisingly'
    'intriguing', 'noticeable', 'liking', 'lovely', 'impressed', 'impressive','intriguing', 'similar','remarkable',
    'reviewed','reviews', 'fairly','qualities','beautifully','drank','tastey', 'terrific','flavoured','surprisingly'
    'incredible','brewer','apparent'
}

stopwords = list(ENGLISH_STOP_WORDS.union(non_taste_stop))

def bertTopTerms(doc, top_n = 150):
    kws = kw_model.extract_keywords(
        doc,
        keyphrase_ngram_range=(1,2),
        stop_words=stopwords,
        top_n=top_n
    )
    return " ".join(term for term, score in kws)

dfbeers["all_text"] = dfbeers["all_text"].apply(lambda txt: bertTopTerms(txt, top_n=150))

## Experimental TF‑IDF Keyword Extraction (Unused)

This cell experiments with a TF‑IDF Vectorizer to select the most frequent terms from `all_text` as an alternative to BERT/KeyBERT, but this approach was ultimately not adopted in the final content preprocessing pipeline as it was selecting from the whole corpus rather than per beer.


In [None]:
# Alternative faster but results in too many similarities hence UNUSED in final model
from sklearn.feature_extraction.text import TfidfVectorizer

token_pattern = r'\b[a-z]{4,}\b'
vec = TfidfVectorizer(stop_words=list(stopwords),
                      token_pattern=token_pattern,
                      max_features=2000,
                      max_df=0.8)
tf = vec.fit_transform(dfbeers['all_text'])
feat = vec.get_feature_names_out()

def top_tfidf_terms(row, N=100): 
    arr = row.toarray().ravel()
    idxs = arr.argsort()[-N:][::-1]
    return " ".join(feat[i] for i in idxs)

# UNUSED
# dfbeers['all_text'] = [top_tfidf_terms(r, 150) for r in tf]

## Summarize Token Counts per Beer

This cell splits each beer’s `all_text` into tokens and prints the average, median, and 90th percentile token counts to assess the length distribution of the processed text.


In [15]:
# How many tokens (words) per beer on average?
import numpy as np
token_counts = dfbeers['all_text'].str.split().str.len()
print("Average tokens per beer:", token_counts.mean().round(1))
print("Median tokens per beer: ", token_counts.median())
print("90th percentile tokens: ", np.percentile(token_counts, 90))

Average tokens per beer: 299.5
Median tokens per beer:  300.0
90th percentile tokens:  300.0


## Export Final Content DataFrame

This cell selects the `beer_id`, `name`, and compressed `all_text` columns from `dfbeers` and writes them to `beer_content.csv`, producing the final dataset for the content recommendation model.


In [17]:
output = dfbeers[['beer_id', 'name', 'all_text']]
output.to_csv("beer_content.csv", index=False)