Merge the reviews and metadata dataframes by common key 'gmap_id', save as vt_merged and drop 'gmap_id' column. Insert new column called 'review_id' such that each review has a unique id.

In [1]:
import pandas as pd

In [2]:
vt_metadata = pd.read_csv('2_vt_metadata_scraped.csv')
vt = pd.read_csv('3_vt_review.csv')

In [3]:
vt_merged = pd.merge(vt, vt_metadata, on = 'gmap_id', how = 'inner')
vt_merged = vt_merged.drop('gmap_id', axis = 1)
vt_merged['review_id'] = range(len(vt_merged))
column = vt_merged.pop('review_id')
vt_merged.insert(0, 'review_id', column)

Add new columns for violation of policies:

In [4]:
vt_merged['is_image_ad'] = None
vt_merged['is_image_ad'] = vt_merged['is_image_ad'].astype(bool)

vt_merged['is_image_irrelevant'] = None
vt_merged['is_image_irrelevant'] = vt_merged['is_image_irrelevant'].astype(bool)

vt_merged['is_text_ad'] = None
vt_merged['is_text_ad'] = vt_merged['is_text_ad'].astype(bool)

vt_merged['is_text_irrelevant'] = None
vt_merged['is_text_irrelevant'] = vt_merged['is_text_irrelevant'].astype(bool)

vt_merged['is_text_rant'] = None
vt_merged['is_text_rant'] = vt_merged['is_text_rant'].astype(bool)

vt_merged["is_review_ad"] = vt_merged["is_text_ad"] | vt_merged["is_image_ad"]

vt_merged['is_review_irrelevant'] = vt_merged["is_image_irrelevant"] | vt_merged["is_text_irrelevant"]

Add new columns for quality check:
- is the review helpful;
not_helpful: review was relevant but does not add info to the reader
helpful: provides some helpful information to make decisions about the visit
very_helpful: gives crucial or new information that can significantly impact visit decisions
- is the review sensible;

In [5]:
categories = ["not helpful", "helpful", "very helpful"]
vt_merged["helpfulness"] = pd.Categorical(
    values=[""] * len(vt_merged), 
    categories=categories,
    ordered=True
)

vt_merged['sensibility'] = None
vt_merged['sensibility'] = vt_merged['sensibility'].astype(bool)

In [6]:
print(vt_merged.columns)
# print(vt_merged.head)

Index(['review_id', 'Unnamed: 0', 'user_id', 'time', 'rating', 'text',
       'pics_collapsed', 'resp_collapsed', 'unnamed: 0', 'name', 'description',
       'category', 'url', 'image', 'is_image_ad', 'is_image_irrelevant',
       'is_text_ad', 'is_text_irrelevant', 'is_text_rant', 'is_review_ad',
       'is_review_irrelevant', 'helpfulness', 'sensibility'],
      dtype='object')


In [7]:
vt_merged = vt_merged.drop(columns=["Unnamed: 0", "unnamed: 0"])
print(vt_merged.columns)
# print(vt_merged.head)

Index(['review_id', 'user_id', 'time', 'rating', 'text', 'pics_collapsed',
       'resp_collapsed', 'name', 'description', 'category', 'url', 'image',
       'is_image_ad', 'is_image_irrelevant', 'is_text_ad',
       'is_text_irrelevant', 'is_text_rant', 'is_review_ad',
       'is_review_irrelevant', 'helpfulness', 'sensibility'],
      dtype='object')


Splitting the dataframe into two:
- rating only reviews (i.e. no text and no pics)
- everything else = with text OR image

In [8]:
vt_rating_only = vt_merged[
    vt_merged["text"].isna() & (vt_merged["pics_collapsed"] == "[]")
]

vt_with_image_or_review = vt_merged.drop(vt_rating_only.index)

print("Rating-only shape:", vt_rating_only.shape)
print("With image or review shape:", vt_with_image_or_review.shape)


Rating-only shape: (145565, 21)
With image or review shape: (175918, 21)


We assume that rating alone does not provide enough of a justification to show why it would violate any of the three policies. Hence we only focus on looking at the data frame containing pictures or reviews.

In [9]:
vt_with_image_or_review.to_csv('vt_merged.csv')