# Preprocessing for User Autoencoder

**Description:**  
This notebook takes the combined and cleaned beer reviews and brewery metadata from `final_beers_reviews_breweries.csv`, and constructs a binary user–beer interaction matrix by grouping each user’s reviews, thresholding their overall scores into “likes” (e.g. score ≥ 4 → 1, else 0), and pivoting so that each row is a user and each column is a beer.

The resulting file, `user_beer_likes.csv`, contains one row per user and one column per beer, with entries of 1 indicating the user liked that beer and 0 otherwise—exactly the format needed to train the user autoencoder.

---

## Overview

- **Data Source:**  
  `final_beers_reviews_breweries.csv` – contains user reviews joined with beer and brewery information.


In [1]:
import pandas as pd

try:
    df = pd.read_csv('final_beers_reviews_breweries.csv')
    print("\final Data Sample:")
    print(df.head())
except Exception as e:
    print(f"Error loading reviews.csv: {e}")

inal Data Sample:
              name state country                    style availability   abv  \
0  Older Viscosity    CA      US  American Imperial Stout     Rotating  12.0   
1  Older Viscosity    CA      US  American Imperial Stout     Rotating  12.0   
2  Older Viscosity    CA      US  American Imperial Stout     Rotating  12.0   
3  Older Viscosity    CA      US  American Imperial Stout     Rotating  12.0   
4  Older Viscosity    CA      US  American Imperial Stout     Rotating  12.0   

                                               notes  beer_id     username  \
0  Imperial Stout aged for 12 months in new bourb...    34094        Sazz9   
1  Imperial Stout aged for 12 months in new bourb...    34094  Amguerra305   
2  Imperial Stout aged for 12 months in new bourb...    34094      TheGent   
3  Imperial Stout aged for 12 months in new bourb...    34094         bobv   
4  Imperial Stout aged for 12 months in new bourb...    34094      Tony210   

         date  ...  look  smell

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614525 entries, 0 to 614524
Data columns (total 21 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   name           614525 non-null  object 
 1   state          614525 non-null  object 
 2   country        614525 non-null  object 
 3   style          614525 non-null  object 
 4   availability   614525 non-null  object 
 5   abv            614525 non-null  float64
 6   notes          614525 non-null  object 
 7   beer_id        614525 non-null  int64  
 8   username       614525 non-null  object 
 9   date           614525 non-null  object 
 10  text           614525 non-null  object 
 11  look           614525 non-null  float64
 12  smell          614525 non-null  float64
 13  taste          614525 non-null  float64
 14  feel           614525 non-null  float64
 15  overall        614525 non-null  float64
 16  score          614525 non-null  float64
 17  name_brewery   614525 non-nul

## User vs Liked Beer Matrix
This cell marks review scores  scores into “likes” (e.g. score ≥ 4 → 1, else 0). A pivot is performed after so that each row is a user and each column is a beer.


In [7]:
df['liked'] = (df['score'] >= 4.0).astype(int)

user_beer = df.pivot_table(
    index='username', 
    columns='beer_id',
    values='liked',
    aggfunc='max',
    fill_value=0
)


In [8]:
user_beer.info()

<class 'pandas.core.frame.DataFrame'>
Index: 15894 entries, --Dom-- to zymurgy4all
Columns: 500 entries, 6 to 148052
dtypes: int64(500)
memory usage: 60.8+ MB


## Export Final Content DataFrame


In [9]:
user_beer.to_csv("user_beer_likes.csv", index=True)