## **02 Process Reviews**
This notebook processes all of the review data scraped in the previous notebook and outputs a summary dataframe containing the brewery name, review rating, and review text for each individual review.

### **Notebook Objectives**
1. Extract the review data for each brewery from the directory of txt files
2. Wrangle the data into a consistent form (scrape from 'Attraction' vs 'Restaurant' pages have slightly different formatting)
3. Generate a dataframe containing the processed review data and export a csv for later notebooks

In [1]:
import pandas as pd
import re
from pathlib import Path
import random
random.seed(11)
import numpy as np
np.random.seed(11)
import spacy

In [2]:
def process_rating(raw):
    # Attraction case
    match1 = re.findall(r'^\d.\d', raw)
    # Restaurant case
    match2 = re.findall(r'\d\d', raw)
    if match1:
        rating = float(match1[0])
        return rating
    elif match2:
        rating = float(match2[0][0] + '.' + match2[0][1])
        return rating
    else:
        raise ValueError(f'Rating processing error: {raw}')

def process_date():
    return NotImplementedError

def process_review(title, text):
    title = title.strip('\n')
    text = text.strip('\n')
    review = title + '. ' + text
    return review

def append_df(df, id, rating, review):
    id = [id] * len(rating)
    reviews_df = pd.concat([df, 
    pd.DataFrame({'id' : id,
                'rating' : rating,
                'review' : review})], ignore_index=True)
    return reviews_df

def append_brewery_data(df, path):
    id = re.findall(r'text\/([0-9a-z_-]+)-[A-Z]', str(path))[0]
    # Extract data from review text
    rating = []
    review = []
    with open(path, 'r') as f:
        while True:
            try:
                raw_date = next(f)
                raw_rating = next(f)
                title = next(f)
                text = next(f)
                rating.append(process_rating(raw_rating))
                review.append(process_review(title, text))
            except StopIteration:
                break
    # Append new data to dataframe
    return append_df(df, id, rating, review)

In [3]:
# Initialize df
review_text_df = pd.DataFrame(columns=['id', 'rating', 'review'])

# Extract review data from all txt files in directory
dir_path = '../assets/text/'
paths = Path(dir_path).glob('**/*.txt')
for path in paths:
    review_text_df = append_brewery_data(review_text_df, path)

# Sort alphabetically
review_text_df = review_text_df.sort_values(by=['id'], ignore_index=True)

In [6]:
# Inspect df
print(review_text_df.shape)
review_text_df.tail(5)

(5985, 3)


Unnamed: 0,id,rating,review
5980,zero-gravity-craft-brewery-burlington-2,4.0,Eat good food and drink good beer!. We haven’t...
5981,zero-gravity-craft-brewery-burlington-2,4.0,When u feel like a drink at 8am. One of the ve...
5982,zero-gravity-craft-brewery-burlington-2,5.0,Nice Vibe At Zero Gravity. I had discovered an...
5983,zero-gravity-craft-brewery-burlington-2,3.0,Fun place; so-so beer. Really fun atmosphere w...
5984,zero-gravity-craft-brewery-burlington-2,4.0,Great stop. Was in Burlington and decided to g...


In [12]:
# Spot check random reviews
index = random.randint(0, len(review_text_df))
print(f'Index: {index}')
print(review_text_df['id'][index])
print(review_text_df['rating'][index])
print(review_text_df['review'][index])

Index: 4811
sons-of-liberty-beer-and-spirits-co-wakefield
5.0
Fun place. Well done! Went to Sip and Shuck last night, a monthly event. Live music, nice crowd, great way to sample Son's whiskey and other brews. They have some games and food vendors at these events. Good way to sample local food paired with Son's cocktails. I did a whiskey flight which allowed me to sample four whiskeys. All were good but Foolproof Porter was the favorite followed by Uprising. We also had American Pilsner which was very refreshing. Got to sample some local oysters too. Great time. Want to go for a tasting next time. Prices are very reasonable. 


In [20]:
# Import brewery info table for merging
filepath = Path('../assets/breweries_clean_address.csv') 
brewery_info_df = pd.read_csv(filepath)
print(f'Shape of main brewery listing: {brewery_info_df.shape}')

# Merge info and reviews tables
brewery_reviews_df = brewery_info_df.merge(review_text_df, left_on='obdb_id', right_on='id', how='inner').drop('id', axis=1)
brewery_reviews_df = brewery_reviews_df[['obdb_id', 'name', 'state', 'city', 'street', 
    'longitude', 'latitude', 'website_url', 'rating', 'review']]
print(f'Shape of review table: {brewery_reviews_df.shape}')
brewery_reviews_df.head(3)

Shape of main brewery listing: (8170, 17)
Shape of review table: (5985, 10)


Unnamed: 0,obdb_id,name,state,city,street,longitude,latitude,website_url,rating,review
0,10th-district-brewing-company-abington,10th District Brewing Company,Massachusetts,Abington,491 Washington St,-70.945941,42.105918,http://www.10thdistrictbrewing.com,4.0,Nice local SE MASS Brewery. Definitely worth t...
1,10th-district-brewing-company-abington,10th District Brewing Company,Massachusetts,Abington,491 Washington St,-70.945941,42.105918,http://www.10thdistrictbrewing.com,4.0,"Tasty, fresh Brew. Went for a quick taste of t..."
2,10th-district-brewing-company-abington,10th District Brewing Company,Massachusetts,Abington,491 Washington St,-70.945941,42.105918,http://www.10thdistrictbrewing.com,5.0,Great Beer. I come here often for some good be...


In [10]:
# Export dataframe to csv for procesing in later notebooks
filepath = Path('../assets/brewery_reviews.csv') 
brewery_reviews_df.to_csv(filepath, index=False)