# Title

In [1]:
## Load libraries

In [2]:
import pandas as pd
import requests
import gzip
import json
import io
import numpy as np
import re

## Data Collection and Preprocessing

### Importing Data

First, we import in the data for google reviews in Vermont, USA, and convert it into a dataframe entitled `vt`.

In [3]:
url = 'https://mcauleylab.ucsd.edu/public_datasets/gdrive/googlelocal/review-Vermont_10.json.gz'

response = requests.get(url, stream = True)
response.raise_for_status() 

with gzip.GzipFile(fileobj = io.BytesIO(response.content), mode = 'rb') as gz_file:
    data_list = [json.loads(line) for line in gz_file]

vt = pd.DataFrame(data_list)

In [4]:
print(vt.shape) # 324725 observations and 8 columns.
print(vt.columns)

(324725, 8)
Index(['user_id', 'name', 'time', 'rating', 'text', 'pics', 'resp', 'gmap_id'], dtype='object')


Next, we import in the data for local businesses metadata in Vermont, USA, and convert it into a dataframe entitled `vt_metadata`.

In [5]:
url_metadata = 'https://mcauleylab.ucsd.edu/public_datasets/gdrive/googlelocal/meta-Vermont.json.gz'

response_metadata = requests.get(url_metadata, stream = True)
response_metadata.raise_for_status() 

with gzip.GzipFile(fileobj = io.BytesIO(response_metadata.content), mode = 'rb') as gz_file:
    data_list1 = [json.loads(line) for line in gz_file]

vt_metadata = pd.DataFrame(data_list1)

In [6]:
print(vt_metadata.shape) # 11291 observations and 15 columns
print(vt_metadata.columns)

(11291, 15)
Index(['name', 'address', 'gmap_id', 'description', 'latitude', 'longitude',
       'category', 'avg_rating', 'num_of_reviews', 'price', 'hours', 'MISC',
       'state', 'relative_results', 'url'],
      dtype='object')


### Data Cleaning
Next, we clean and prepare  the two datasets for analysis. We start with the reviews dataset before moving on to the business metadata dataset.

#### Reviews dataset

We'll perform the following steps on the reviews dataset:

1. Drop the irrelevant `name` column to avoid redundancy as users can be identified by `user_id`
2. Convert text in all columns to lower case
3. Clean `text` by replacing multiple spaces with a single space
4. Remove duplicate reviews based on user id, review text comment, business location, and time uploaded, to ensure each review is unique

In [7]:
vt = vt.drop('name', axis = 1) # drop 'name' column as it's not relevant to our analysis
vt.columns = vt.columns.str.lower() # convert all strings to lower case
vt['text'] = vt['text'].str.replace(r'\s+', ' ', regex = True) # clean text 
vt = vt.drop_duplicates(subset = ['user_id', 'text', 'gmap_id', 'time']) # drop duplicate 

Next, as the urls for the pictures in the `pics` column is presented in a nested dictionary, we write a function to collapse all urls into a single python list. We name the new column as `pics_collapsed` and drop the old `pics` column.

In [8]:
def collapse_pics(pic_list):
    if not pic_list:
        return []  
    urls = []
    for pic_dict in pic_list:
        urls.extend(pic_dict.get('url', []))
    return urls

vt['pics_collapsed'] = vt['pics'].apply(collapse_pics)
vt = vt.drop('pics', axis = 1)

Similarly, we write a function to collapse business responses to reviews in the `resp` column, naming the new column as `resp_collapsed` and dropping the old `resp` column.

In [9]:
def extract_texts(resp_entry):
    if isinstance(resp_entry, dict):
        # single response dict
        return resp_entry.get("text", "")
    elif isinstance(resp_entry, list):
        # list of response dicts
        return " ".join([d.get("text", "") for d in resp_entry if isinstance(d, dict)])
    elif isinstance(resp_entry, str):
        # fallback: extract with regex if it's a string
        texts = re.findall(r'"text":\s*"([^"]*)"', resp_entry)
        return " ".join(texts)
    else:
        return ""
    
vt["resp_collapsed"] = vt["resp"].apply(extract_texts)
vt = vt.drop("resp", axis = 1)

The first 3 observations of the reviews dataframe are as such.

In [10]:
print(vt.head(3))

                 user_id           time  rating  \
0  118026874392842649478  1620085852324       5   
1  101532740754036204131  1580309946474       5   
2  115404122636203550540  1605195974445       5   

                                                text  \
0      Always done right from wood stove to screens!   
1  A great company to work with. Their sales and ...   
2  Great place to do business with staff was grea...   

                                 gmap_id pics_collapsed  \
0  0x89e02445cb9db457:0x37f42bff4edf7a43             []   
1  0x89e02445cb9db457:0x37f42bff4edf7a43             []   
2  0x89e02445cb9db457:0x37f42bff4edf7a43             []   

                                      resp_collapsed  
0  Good Evening, Rebecca! Thanks SO much for the ...  
1  Good Afternoon, Peter - Really appreciate the ...  
2  Hi Chad!\n\nThank you so much for the 5-Star r...  


#### Metadata dataset

We'll perform the following steps on the reviews dataset:

1. Drop the irrelevant columns that will not be needed for analysis
2. Convert text in all columns to lower case
3. Clean `description` and `category` columns by replacing multiple spaces with a single space
4. Remove duplicate reviews based on business name and Google Maps id to ensure each observation is unique

In [11]:
vt_metadata = vt_metadata.drop(['address', 'latitude', 'longitude', 'avg_rating', 'num_of_reviews', 'price', 'hours', 'MISC', 'state', 'relative_results'], axis = 1)
vt_metadata.columns = vt_metadata.columns.str.lower() 
vt_metadata['description'] = vt_metadata['description'].str.replace(r'\s+', ' ', regex = True)
vt_metadata['category'] = vt_metadata['category'].str.replace(r'\s+', ' ', regex = True)
vt_metadata = vt_metadata.drop_duplicates(subset = ['name','gmap_id'])

The first 3 observations of the metadata dataframe are as such.

In [12]:
print(vt_metadata.head(3))

                       name                                gmap_id  \
0               Royal Group  0x89e02445cb9db457:0x37f42bff4edf7a43   
1  Foxglove Farm and Forest  0x4cb549e8877cf0d7:0xe8f003e6d73392ae   
2              Carr's Gifts  0x4cb54a301f3518f7:0x39af4eda1efb9117   

  description  category                                                url  
0        None       NaN  https://www.google.com/maps/place//data=!4m2!3...  
1        None       NaN  https://www.google.com/maps/place//data=!4m2!3...  
2        None       NaN  https://www.google.com/maps/place//data=!4m2!3...  


In [14]:
vt_metadata.to_csv('vt_metadata.csv')
vt.to_csv('vt.csv')

### Data Scraping for Business Information

Scraping Process ...
- scrape our own data to fill in blanks for description
- scrape our own data to fill in blanks for category

### Merging Reviews and Metadata

In [None]:
### Importing Data

First, we import the newly scraped data into a dataframe named `name` and drop the index column added by default.

In [None]:
# need to rename these after section 2 ^ is filled
# vt_metadata = pd.read_csv('2_vt_metadata_scraped.csv')
# vt_metadata = vt_metadata.drop(['Unnamed: 0'], axis = 1)

Next, we merge the reviews and metadata dataframes by common key `gmap_id`, save the merged dataframe as `vt_merged`. We drop the  drop the now irrelevant `gmap_id` column.

In [None]:
vt_merged = pd.merge(vt, vt_metadata, on = 'gmap_id', how = 'inner')
vt_merged = vt_merged.drop('gmap_id', axis = 1)

We insert new column called `review_id` such that each review has a unique identifier.

In [None]:
vt_merged['review_id'] = range(len(vt_merged)) 
id_column = vt_merged.pop('review_id')
vt_merged.insert(0, 'review_id', id_column)

### Manual Labelling of Data

We add 7 new boolean columns:
1. `is_image_ad` TRUE if image uploaded is labelled as advertisement
2. `is_image_irrelevant` TRUE if image uploaded is labelled as irrelevant
3. `is_text_ad` TRUE if text comment is labelled as advertisement
4. `is_text_irrelevant` TRUE if text comment is labelled as irrelevant
5. `is_text_rant` TRUE if text comment is labelled as rant from non-visitor
6. `is_review_ad` TRUE if either image or text comment is labelled as advertisement 
7. `is_review_irrelevant` TRUE if either image or text comment is labelled as irrelevant 

In [None]:
vt_merged['is_image_ad'] = None
vt_merged['is_image_ad'] = vt_merged['is_image_ad'].astype(bool)

vt_merged['is_image_irrelevant'] = None
vt_merged['is_image_irrelevant'] = vt_merged['is_image_irrelevant'].astype(bool)

vt_merged['is_text_ad'] = None
vt_merged['is_text_ad'] = vt_merged['is_text_ad'].astype(bool)

vt_merged['is_text_irrelevant'] = None
vt_merged['is_text_irrelevant'] = vt_merged['is_text_irrelevant'].astype(bool)

vt_merged['is_text_rant'] = None
vt_merged['is_text_rant'] = vt_merged['is_text_rant'].astype(bool)

vt_merged["is_review_ad"] = vt_merged["is_text_ad"] | vt_merged["is_image_ad"]

vt_merged['is_review_irrelevant'] = vt_merged["is_image_irrelevant"] | vt_merged["is_text_irrelevant"]

Next, we add 2 columns for quality check of the review.
1. `helpfulness`
    - `not_helpful` (review was relevant but did not add information to the reader)
    - `helpful` (review provided some helpful information to make decisions about the visit)
    - `very_helpful` (review gave crucial or new information that can significantly impact visit decisions)

2. `sensibility` TRUE if the numerical star rating corresponds to the sentiments present in text review

In [None]:
categories = ["not_helpful", "helpful", "very_helpful"]
vt_merged["helpfulness"] = pd.Categorical(
    values=[""] * len(vt_merged), 
    categories = categories,
    ordered = True
)

vt_merged['sensibility'] = None
vt_merged['sensibility'] = vt_merged['sensibility'].astype(bool)

The columns of the dataframe are as such.

In [None]:
print(vt_merged.columns)

### Filtering for Reviews Containing Text and/or Pictures

We assume that a numerical star rating alone does not provide sufficient ustification on why a review would violate any of the three policies. Hence, in this project we only focus on looking at the reviews containing text comments and/or pictures. This filters out around 45% of all reviews.

In [None]:
vt_rating_only = vt_merged[
    vt_merged["text"].isna() & (vt_merged["pics_collapsed"] == "[]")
]

vt_merged = vt_merged.drop(vt_rating_only.index) # contains text and/or pic

print("Rating-only shape:", vt_rating_only.shape)
print("With image or review shape:", vt_with_image_or_review.shape)

We save the final dataframe as `vt_merged.csv`.