## **02 Process Reviews**
This notebook processes all of the review data scraped in the previous notebook and outputs a summary dataframe containing the brewery name, review rating, and review text for each individual review.

### **Notebook Objectives**
1. Extract the review data for each brewery from the directory of txt files
2. Wrangle the data into a consistent form (scrape from 'Attraction' vs 'Restaurant' pages have slightly different formatting)
3. Output a dataframe containing the processed review data

In [164]:
import pandas as pd
import re
from pathlib import Path
import random
random.seed(10)

In [160]:
def process_rating(raw):
    # Attraction case
    match1 = re.findall(r'^\d.\d', raw)
    # Restaurant case
    match2 = re.findall(r'\d\d', raw)
    if match1:
        rating = float(match1[0])
        return rating
    elif match2:
        rating = float(match2[0][0] + '.' + match2[0][1])
        return rating
    else:
        raise ValueError(f'Rating processing error: {raw}')

def process_date():
    return NotImplementedError

def process_review(title, text):
    title = title.strip('\n')
    text = text.strip('\n')
    text = text.strip('More')
    review = title + '. ' + text
    return review

def append_df(df, brewery, rating, review):
    name = [brewery] * len(rating)
    reviews_df = pd.concat([df, 
    pd.DataFrame({'name' : name,
                'rating' : rating,
                'review' : review})], ignore_index=True)
    return reviews_df

def append_brewery_data(df, path):
    brewery = re.findall(r'^.+\/([A-Za-z\d_]+)', str(path))[0]
    # Extract data from review text
    name = []
    rating = []
    review = []
    with open(path, 'r') as f:
        while True:
            try:
                raw_date = next(f)
                raw_rating = next(f)
                title = next(f)
                text = next(f)
                rating.append(process_rating(raw_rating))
                review.append(process_review(title, text))
            except StopIteration:
                break
    # Append new data to dataframe
    return append_df(df, brewery, rating, review)

In [154]:
# Initialize df
reviews_df = pd.DataFrame(columns=['name', 'rating', 'review'])

# Extract review data from all txt files in directory
dir_path = '../assets/text/'
paths = Path(dir_path).glob('**/*.txt')
for path in paths:
    reviews_df = append_brewery_data(reviews_df, path)

# Sort alphabetically
reviews_df = reviews_df.sort_values(by=['name'], ignore_index=True)

In [163]:
# Check df
reviews_df.tail(5)

Unnamed: 0,name,rating,review
2584,Wormtown_Brewery,4.0,Fun Brewery on Shrewsbury Street . We enjoyed ...
2585,Wormtown_Brewery,5.0,Great local brewery. Impromptu stop while driv...
2586,Wormtown_Brewery,4.0,"Good beer, small place. The brewery is situate..."
2587,Wormtown_Brewery,5.0,Terrific. Love this brewery. Yummy yummy. Sat ...
2588,Wormtown_Brewery,5.0,Just Good Beer. Stopped by for first time to h...


In [179]:
# Spot check random reviews
index = random.randint(0, len(reviews_df))
print(f'Index: {index}')
print(reviews_df['rating'][index])
print(reviews_df['review'][index])

Index: 2132
4.0
The Jacks Abbey of Hopkinton . Great atmosphere and beer selection. After renovations it’s spacious and serve food as well. You order your food separate from your beer. Two separate stations. Food is OK. My fish and chips had dry-over-fried chips, and “healthy” tasting coleslaw but fish was good. 


In [176]:
# Export dataframe to csv for procesing in later notebooks
filepath = Path('../assets/brewery_reviews_R1.csv') 
reviews_df.to_csv(filepath)