# Chapter 3 :: Yelp Example

The book is amazing, but the code in examples is barely organized.

## Requirements

This notebook requires:

1. [Yelp Dataset](https://www.goodreads.com/book/show/34691713-natural-language-processing-with-pytorch?ac=1&from_search=true&qid=XshWEHfZ6z&rank=1) unpacked in `./data` folder.
1. python 3.8
1. Packages from `requirements.txt`

In [23]:
import collections
import json
import typing
import re
import numpy as np
import pandas as pd
    

In [9]:
train_proportion = 0.7
test_proportion = 0.15
val_proportion = 0.15
seed = 32167

`read_reviews` reads reviews as json lines and groups them by star ratings

In [2]:

def read_reviews(path: str, limit: int) -> typing.DefaultDict:
    by_rating = collections.defaultdict(list)
    
    with open(path) as reviews:
        for x in range(limit):
            review = json.loads(next(reviews))
            by_rating[review['stars']].append(review)
    
    return by_rating

In [3]:
by_rating = read_reviews("data/yelp_academic_dataset_review.json", 300)

Mark reviews as `train`, `val`, or `test` randomly with given proportion. Convert to pandas DataFrame.

In [19]:

# Create split data
final_list = []
np.random.seed(seed)

for _, item_list in sorted(by_rating.items()):
    np.random.shuffle(item_list)
    
    n_total = len(item_list)
    n_train = int(train_proportion * n_total)
    n_val = int(val_proportion * n_total)
    n_test = int(test_proportion * n_total)
    
    # Give data point a split attribute
    for item in item_list[:n_train]:
        item['split'] = 'train'
    
    for item in item_list[n_train:n_train+n_val]:
        item['split'] = 'val'

    for item in item_list[n_train+n_val:n_train+n_val+n_test]:
        item['split'] = 'test'

    # Add to final list
    final_list.extend(item_list)

final_reviews = pd.DataFrame(final_list)


Remove invalid symbols, add spaces around punctuation symbols.

In [24]:
def clean_punctuation(text):
    text = re.sub(r"([.,!?])", r" \1 ", text)
    text = re.sub(r"[^a-zA-Z.,!?]+", r" ", text)
    return text

def to_lower(text):
    text = text.lower()
    return text

final_reviews.text = final_reviews.text.apply(to_lower).apply(clean_punctuation)