## **02 Process Reviews**
This notebook processes all of the review data scraped in the previous notebook and outputs a summary dataframe containing the brewery name, review rating, and review text for each individual review.

### **Notebook Objectives**
1. Extract the review data for each brewery from the directory of txt files
2. Wrangle the data into a consistent form (scrape from 'Attraction' vs 'Restaurant' pages have slightly different formatting)
3. Generate a dataframe containing the processed review data and export a csv for later notebooks

In [11]:
import pandas as pd
import re
from pathlib import Path
import random
random.seed(11)
import numpy as np
np.random.seed(11)
import spacy

In [2]:
def process_rating(raw):
    # Attraction case
    match1 = re.findall(r'^\d.\d', raw)
    # Restaurant case
    match2 = re.findall(r'\d\d', raw)
    if match1:
        rating = float(match1[0])
        return rating
    elif match2:
        rating = float(match2[0][0] + '.' + match2[0][1])
        return rating
    else:
        raise ValueError(f'Rating processing error: {raw}')

def process_date():
    return NotImplementedError

def process_review(title, text):
    title = title.strip('\n')
    text = text.strip('\n')
    review = title + '. ' + text
    return review

def append_df(df, id, rating, review):
    id = [id] * len(rating)
    reviews_df = pd.concat([df, 
    pd.DataFrame({'id' : id,
                'rating' : rating,
                'review' : review})], ignore_index=True)
    return reviews_df

def append_brewery_data(df, path):
    id = re.findall(r'text\/([0-9a-z_-]+)-[A-Z]', str(path))[0]
    # Extract data from review text
    rating = []
    review = []
    with open(path, 'r') as f:
        while True:
            try:
                raw_date = next(f)
                raw_rating = next(f)
                title = next(f)
                text = next(f)
                rating.append(process_rating(raw_rating))
                review.append(process_review(title, text))
            except StopIteration:
                break
    # Append new data to dataframe
    return append_df(df, id, rating, review)

In [5]:
# Initialize df
review_text_df = pd.DataFrame(columns=['id', 'rating', 'review'])

# Extract review data from all txt files in directory
dir_path = '../assets/text/'
paths = Path(dir_path).glob('**/*.txt')
for path in paths:
    review_text_df = append_brewery_data(review_text_df, path)

# Sort alphabetically
review_text_df = review_text_df.sort_values(by=['id'], ignore_index=True)

In [6]:
# Inspect df
print(review_text_df.shape)
review_text_df.tail(5)

(2576, 3)


Unnamed: 0,id,rating,review
2571,wormtown-brewery-worcester,4.0,Great beers. Wormtown makes some solid beers. ...
2572,wormtown-brewery-worcester,5.0,Fun Place. Fun place on a Sunday afternoon. L...
2573,wormtown-brewery-worcester,5.0,"Small, but convivial taproom. Many craft brewe..."
2574,wormtown-brewery-worcester,5.0,Good Brewery. Wormtown Brewery is worth the tr...
2575,wormtown-brewery-worcester,5.0,Loved the Atmosphere. We enjoyed our visit to ...


In [7]:
# Spot check random reviews
index = random.randint(0, len(review_text_df))
print(f'Index: {index}')
print(review_text_df['rating'][index])
print(review_text_df['review'][index])

Index: 1852
5.0
Brewery Tour - Salem, Mass. This was our very first stop in Salem while waiting to check into our room.  The staff was friendly and knowledgeable and the beer, according to my husband was extremely tasty.  Down fall, we cannot buy this beer in Canada where we live.  The atmosphere...More


In [8]:
# Import brewery info table for merging
filepath = Path('../assets/breweries_subset_CLEAN_ADDRESS.csv') 
brewery_info_df = pd.read_csv(filepath)
brewery_info_df.head()

Unnamed: 0,obdb_id,name,state,city,street,longitude,latitude,website_url
0,10th-district-brewing-company-abington,10th District Brewing Company,Massachusetts,Abington,491 Washington St,-70.945941,42.105918,http://www.10thdistrictbrewing.com
1,3-beards-beer-company-williamsburg,3 Beards Beer Company,Massachusetts,Williamsburg,4 Main St,-72.730506,42.392366,http://www.3beardsbeer.com
2,3cross-fermentation-cooperative-worcester,3cross Fermentation Cooperative,Massachusetts,Worcester,4 Knowlton Ave,-71.830576,42.243649,http://www.3cross.coop
3,7th-wave-brewing-medfield,7th Wave Brewing,Massachusetts,Medfield,120 N Meadows Rd,,,http://www.7thwavebrewing.com
4,abandoned-building-brewery-easthampton,Abandoned Building Brewery,Massachusetts,Easthampton,142 Pleasant St Unit,,,http://www.abandonedbuildingbrewery.com


In [9]:
# Merge info and reviews tables
brewery_reviews_df = brewery_info_df.merge(review_text_df, left_on='obdb_id', right_on='id', how='inner').drop('id', axis=1)
print(f'Shape: {brewery_reviews_df.shape}')
brewery_reviews_df.head()

Shape: (2576, 10)


Unnamed: 0,obdb_id,name,state,city,street,longitude,latitude,website_url,rating,review
0,10th-district-brewing-company-abington,10th District Brewing Company,Massachusetts,Abington,491 Washington St,-70.945941,42.105918,http://www.10thdistrictbrewing.com,4.0,"Tasty, fresh Brew. Went for a quick taste of t..."
1,10th-district-brewing-company-abington,10th District Brewing Company,Massachusetts,Abington,491 Washington St,-70.945941,42.105918,http://www.10thdistrictbrewing.com,4.0,Great local micro brewery. The beer here is ve...
2,10th-district-brewing-company-abington,10th District Brewing Company,Massachusetts,Abington,491 Washington St,-70.945941,42.105918,http://www.10thdistrictbrewing.com,4.0,Nice local SE MASS Brewery. Definitely worth t...
3,10th-district-brewing-company-abington,10th District Brewing Company,Massachusetts,Abington,491 Washington St,-70.945941,42.105918,http://www.10thdistrictbrewing.com,5.0,Great Beer. I come here often for some good be...
4,10th-district-brewing-company-abington,10th District Brewing Company,Massachusetts,Abington,491 Washington St,-70.945941,42.105918,http://www.10thdistrictbrewing.com,5.0,Micro Brewery with diverse collections . This ...


In [10]:
# Export dataframe to csv for procesing in later notebooks
filepath = Path('../assets/brewery_reviews.csv') 
brewery_reviews_df.to_csv(filepath, index=False)