## 1. Load Data

Load all reviews

In [1]:
import pandas as pd

reviews = pd.read_csv("~/Downloads/yelp-2019-dataset/review-005.csv", usecols=['business_id', 'text', 'stars'])
reviews

Unnamed: 0,business_id,stars,text
0,ujmEBvifdJM6h6RLv4wQIg,1.0,Total bill for this horrible service? Over $8G...
1,NZnhc2sEQy3RmzKTZnqtwQ,5.0,I *adore* Travis at the Hard Rock's new Kelly ...
2,WTqjgwHlXbSFevF32_DJVw,5.0,I have to say that this office really has it t...
3,ikCg8xy5JIg_NGPx-MSIDA,5.0,Went in for a lunch. Steak sandwich was delici...
4,b1b1eb3uo-w561D0ZfCEiQ,1.0,Today was my second out of three sessions I ha...
...,...,...,...
6685897,RXBFk3tVBxiTf3uOt9KExQ,5.0,I have been coming here for years and this pla...
6685898,yA6dKNm_zl1ucZCnwW8ZCg,1.0,I think this owner and the owner of Amy's Baki...
6685899,a192hdM0_UVCYLwPJv1Qwg,5.0,"Off the grid Mexican in Vegas. Very tasty, qua..."
6685900,kOo4ZY2UQAX4j312mzQ8mA,5.0,We hired Taco Naco to cater our family party a...


Load all businesses

In [2]:
businesses = pd.read_csv('~/Downloads/yelp-2019-dataset/business.csv', usecols=['business_id', 'categories'])
businesses

Unnamed: 0,business_id,categories
0,1SWheh84yJXfytovILXOAQ,"Golf, Active Life"
1,QXAEGFB4oINsVuTFxEYKFQ,"Specialty Food, Restaurants, Dim Sum, Imported..."
2,gnKjwL_1w79qoiV3IC_xQQ,"Sushi Bars, Restaurants, Japanese"
3,xvX2CttrVhyG2z1dFg_0xw,"Insurance, Financial Services"
4,HhyxOkGAM07SRYtlQ4wMFQ,"Plumbing, Shopping, Local Services, Home Servi..."
...,...,...
192604,nqb4kWcOwp8bFxzfvaDpZQ,"Water Purification Services, Water Heater Inst..."
192605,vY2nLU5K20Pee-FdG0br1g,"Books, Mags, Music & Video, Shopping"
192606,MiEyUDKTjeci5TMfxVZPpg,"Home Services, Contractors, Landscaping, Mason..."
192607,zNMupayB2jEHVDOji8sxoQ,"Beauty & Spas, Barbers"


## 2. Filter / Clean Data

Filter businesses with "Restaurants" category

In [3]:
businesses = businesses[businesses['categories'].notnull()]
businesses["categories"] = businesses["categories"].str.split(", ")
businesses = businesses[businesses['categories'].map(lambda categories: 'Restaurants' in categories)]

businesses

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  businesses["categories"] = businesses["categories"].str.split(", ")


Unnamed: 0,business_id,categories
1,QXAEGFB4oINsVuTFxEYKFQ,"[Specialty Food, Restaurants, Dim Sum, Importe..."
2,gnKjwL_1w79qoiV3IC_xQQ,"[Sushi Bars, Restaurants, Japanese]"
11,1Dfx3zM-rW4n-31KeC8sJg,"[Restaurants, Breakfast & Brunch, Mexican, Tac..."
13,fweCYi8FmbJXHCqLnwuk8w,"[Italian, Restaurants, Pizza, Chicken Wings]"
17,PZ-LZzSlhSe9utkQYU8pFg,"[Restaurants, Italian]"
...,...,...
192587,oS0CnUbyv0GUoD3L8_3UPQ,"[Restaurants, Thai]"
192589,ghovD5ZTGDQ5Q2U4ERddWw,"[Burgers, Restaurants, Fast Food, American (New)]"
192595,h3QErqS3OZgLJ5Tb6-sLyQ,"[Restaurants, Soup, Chinese, Caribbean]"
192596,KnafX7T6qSAmSrLhd709vA,"[Vietnamese, Soup, Restaurants]"


Filter reviews to only those that were given to Restaurant businesses

In [4]:
reviews = reviews.merge(businesses, on='business_id', how='inner')

Shuffle the filtered reviews

In [5]:
reviews = reviews.sample(frac=1, random_state=42).reset_index(drop=True)

Drop reviews without text

In [6]:
reviews = reviews.dropna()

Reviews after filtered

In [7]:
reviews

Unnamed: 0,business_id,stars,text,categories
0,3uC7Lbc3RgUDTWQlBu4PqQ,5.0,Three words: Damn good pastries.\n\nA few mor...,"[Desserts, Food, French, Sandwiches, Bakeries,..."
1,c-NXKTJ0jrrusTPxJAUwvA,1.0,Easily one of the worst Red Robin locations. T...,"[American (Traditional), Restaurants, Burgers]"
2,j3csEfGzkwnXATdRoZDT-A,2.0,Maybe I am just spoiled with good Mexican food...,"[Mexican, Restaurants]"
3,Q0EZmATxDphzRMszNV2LVg,5.0,This Wildflower is always kept clean and the e...,"[Food, American (New), Restaurants, Breakfast ..."
4,25c15dEPrBrWr4tR1r6sTg,5.0,Favorite bibimbap in the valley! They also hav...,"[Korean, Japanese, Restaurants]"
...,...,...,...,...
4201679,GFoJkebYoK2sigk3H8lUlg,5.0,"Since the bar has changed ownership, there has...","[Brewpubs, Bars, Barbeque, Food, Pubs, Breweri..."
4201680,Bf2fuqWbHd3L-X69FSMvmg,3.0,Not great. Fillings are pretty bland and is no...,"[Restaurants, Mexican]"
4201681,MQXFdfDb1UjrZngnjslmhA,1.0,"I honestly can't review the food here, as we d...","[Steakhouses, Restaurants]"
4201682,Hqs4YNST_ZHbshwyi4bnsQ,4.0,"Amazing pizza, great quality! From the same fa...","[Restaurants, Pizza, American (Traditional), I..."


Statistics of review star ratings

In [8]:
reviews.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
stars,4201683.0,3.717257,1.374043,1.0,3.0,4.0,5.0,5.0


Distribution of review star ratings in proportion

In [9]:
reviews.stars.value_counts(normalize=True).sort_index()

1.0    0.118640
2.0    0.093374
3.0    0.133374
4.0    0.261310
5.0    0.393301
Name: stars, dtype: float64

Distribution of review star ratings in counts

In [10]:
reviews.stars.value_counts().sort_index()

1.0     498489
2.0     392330
3.0     560397
4.0    1097941
5.0    1652526
Name: stars, dtype: int64

## Data sampling

Samples 500,000 reviews

In [11]:
reviews = reviews[0:500000]

reviews

Unnamed: 0,business_id,stars,text,categories
0,3uC7Lbc3RgUDTWQlBu4PqQ,5.0,Three words: Damn good pastries.\n\nA few mor...,"[Desserts, Food, French, Sandwiches, Bakeries,..."
1,c-NXKTJ0jrrusTPxJAUwvA,1.0,Easily one of the worst Red Robin locations. T...,"[American (Traditional), Restaurants, Burgers]"
2,j3csEfGzkwnXATdRoZDT-A,2.0,Maybe I am just spoiled with good Mexican food...,"[Mexican, Restaurants]"
3,Q0EZmATxDphzRMszNV2LVg,5.0,This Wildflower is always kept clean and the e...,"[Food, American (New), Restaurants, Breakfast ..."
4,25c15dEPrBrWr4tR1r6sTg,5.0,Favorite bibimbap in the valley! They also hav...,"[Korean, Japanese, Restaurants]"
...,...,...,...,...
499996,oov-v0b15bJnqI2qRLrDSg,5.0,"New Nak Won is amazing!\n\nFirst off, super aw...","[Korean, Restaurants]"
499997,wR71mnXAJMVmgMBrHevndQ,3.0,I came here for lunch last Sunday. We ordered...,"[Asian Fusion, Chinese, Restaurants]"
499998,5MNYCmCtpBboglFmrjU6yw,4.0,We just tried Rkidds for the first time tonigh...,"[Pizza, Salad, Restaurants, Italian, Chicken W..."
499999,TOi6KY8b7e1Au-TqC-I7lw,5.0,"Yesterday I was served Kobe hot dogs, chipotle...","[Specialty Food, Food, Restaurants, Meat Shops..."


Statistics of review star ratings (sampled)

In [12]:
reviews.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
stars,500000.0,3.7205,1.372392,1.0,3.0,4.0,5.0,5.0


Distribution of review star ratings in proportion (sampled)

In [13]:
reviews.stars.value_counts(normalize=True).sort_index()

1.0    0.118050
2.0    0.092874
3.0    0.133616
4.0    0.261446
5.0    0.394014
Name: stars, dtype: float64

Distribution of review star ratings in counts (sampled)

In [14]:
reviews.stars.value_counts().sort_index()

1.0     59025
2.0     46437
3.0     66808
4.0    130723
5.0    197007
Name: stars, dtype: int64

## Save Data

In [15]:
reviews.to_csv('./data/reviews_500k_imba.csv', index=False)