# Introduction 

The goal of the M2 is to explore the hidden associations between specific book genres and user-defined shelves. We also want to try to identify distrinct Reader Personas based on interaction metrics through clustering users based on their "average rating variance", rating frequency, and review length to classify readers "critical reviewer" or "casual consumer"

What type of patterns emerge from highly-voted reviews compared to low-voted ones?

# Data loading & Subsetting

1. Load `romantasay_books_subset.json` - Created from scripts/main.py
2. Steam and filter `goodreads_reviews_dedup.json` to match book subset
3. Load a chunk of `goodreads_interactions.csv`  to calculate user-level metrics

# Data Quality & Cleaning

- Use langid to ensure review text is English only
- Remove "to-read", "owned", and "kindle" tags that skew associaiton rules

# EDA Visualizations

- Genre/Shelf Analysis
- Persona Distributions: Histogram of rating variance ot seed if most readers are "nice" high ratings or "critical"
  - Scatter plot" `review length` vs `rating_frequency`
- Vote Analysis: Box plots comparing `review_length` across different `n_votes` tiers



1. Create ouput.json and filter out books in the dataset that match Romantasy 

In [1]:
%run scripts/01_filterbooks.py data/goodreads_books.json.gz data/01_filteredbooks.json 

Starting filtration of data/goodreads_books.json.gz...
Processed 100000 books... Saved 910 matches.
Processed 200000 books... Saved 1854 matches.
Processed 300000 books... Saved 2816 matches.
Processed 400000 books... Saved 3747 matches.
Processed 500000 books... Saved 4654 matches.
Processed 600000 books... Saved 5594 matches.
Processed 700000 books... Saved 6485 matches.
Processed 800000 books... Saved 7440 matches.
Processed 900000 books... Saved 8388 matches.
Processed 1000000 books... Saved 9300 matches.
Processed 1100000 books... Saved 10168 matches.
Processed 1200000 books... Saved 11093 matches.
Processed 1300000 books... Saved 12000 matches.
Processed 1400000 books... Saved 12948 matches.
Processed 1500000 books... Saved 13861 matches.
Processed 1600000 books... Saved 14758 matches.
Processed 1700000 books... Saved 15716 matches.
Processed 1800000 books... Saved 16646 matches.
Processed 1900000 books... Saved 17545 matches.
Processed 2000000 books... Saved 18455 matches.
Proce

In [2]:
%run scripts/02_extractbooks.py data/01_filteredbooks.json data/goodreads_books.json.gz data/02_extractedbooks.json data/02_extractedbooks.csv 

1. Loading book ids from data/01_filteredbooks.json
Loaded 21741 unique book IDs
2. Filtering books from data/goodreads_books.json.gz
Processed 50000 books... Saved 428 books
Processed 100000 books... Saved 910 books
Processed 150000 books... Saved 1368 books
Processed 200000 books... Saved 1854 books
Processed 250000 books... Saved 2307 books
Processed 300000 books... Saved 2816 books
Processed 350000 books... Saved 3280 books
Processed 400000 books... Saved 3747 books
Processed 450000 books... Saved 4205 books
Processed 500000 books... Saved 4654 books
Processed 550000 books... Saved 5152 books
Processed 600000 books... Saved 5594 books
Processed 650000 books... Saved 6023 books
Processed 700000 books... Saved 6485 books
Processed 750000 books... Saved 6966 books
Processed 800000 books... Saved 7440 books
Processed 850000 books... Saved 7900 books
Processed 900000 books... Saved 8388 books
Processed 950000 books... Saved 8854 books
Processed 1000000 books... Saved 9300 books
Processe

## 3. ExtractReviews

In [5]:
%run scripts/03_extractreviews.py data/01_filteredbooks.json data/goodreads_reviews_dedup.json.gz data/03_extractedreviews.json data/03_extractedreviews.csv 

1. Loading book ids from data/01_filteredbooks.json
Loaded 21741 unique book IDs
2. Filtering reviews from data/goodreads_reviews_dedup.json.gz
Processed 50000 review... Saved 1218 reviews
Processed 100000 review... Saved 2138 reviews
Processed 150000 review... Saved 3598 reviews
Processed 200000 review... Saved 5336 reviews
Processed 250000 review... Saved 6420 reviews
Processed 300000 review... Saved 7638 reviews
Processed 350000 review... Saved 8863 reviews
Processed 400000 review... Saved 10167 reviews
Processed 450000 review... Saved 11448 reviews
Processed 500000 review... Saved 12772 reviews
Processed 550000 review... Saved 14026 reviews
Processed 600000 review... Saved 15449 reviews
Processed 650000 review... Saved 16354 reviews
Processed 700000 review... Saved 17314 reviews
Processed 750000 review... Saved 18673 reviews
Processed 800000 review... Saved 19876 reviews
Processed 850000 review... Saved 21231 reviews
Processed 900000 review... Saved 22436 reviews
Processed 950000 r

### 3. Run 04_association.py

In [4]:
%run scripts/04_association.py data/01_filteredbooks.json data/04_temp.csv data/04_association.csv

Filtering English for datasets
Done! Processed 21741 total records. Saved 15620 English-only records.
Extracting transactions from data/01_filteredbooks.json...
Done! Saved 15239 transactions to data/04_temp.csv.
Loading transactions from data/04_temp.csv into memory...
Encoding 15239 transactions into a Sparse Matrix...
Running FP-Growth (min_support=0.08)...
Generating association rules...

--- Top 10 Discovered Patterns (by Lift) ---
                               antecedents  \
69587              frozenset({ya-fiction})   
69582        frozenset({teen, ya-fantasy})   
69585                    frozenset({teen})   
69584  frozenset({ya-fiction, ya-fantasy})   
69581              frozenset({ya-fiction})   
69578           frozenset({teen, romance})   
69579     frozenset({romance, ya-fiction})   
69580                    frozenset({teen})   
69489              frozenset({ya-fiction})   
69488                    frozenset({teen})   

                               consequents      lift

## EDA Analysis

Now that we got our data finalized we can begin EDA