# Project 5: Group Project
#### Author: Adam Pardo, Brandon Bergeron, Eric Bayless, Ramesh Babu

### 01 - Data Collection and Data Cleaning

##### Instructions: 

Before running this notebook, data files should be downloaded from [Kaggle](https://www.kaggle.com/yelp-dataset/yelp-dataset) and placed in the data folder. Our original sample of 400 restaurants is included in the data folder, and can be used with the [Reviews](#Reviews) portion of this notebook to pull the correspoding reviews for those businesses. If you would like to create your own sample, you can uncomment the code below the city_restaurants() function. For reviews for businesses between 100-300 reviews, you can run the cells in this notebook top to bottom as is. 

In [1]:
# import libraries here 
import pandas as pd
import os

## Businesses

In [2]:
# import Business.json
business_json_path = '../data/yelp_academic_dataset_business.json'
df_business = pd.read_json(business_json_path, lines=True)

In [3]:
# function to pull businesses by city, with option to sample

def city_restaurants(df, city_name, samples=None):
    """
     Takes in the dataframe of yelp businesses, and returns only the restaurants from a given city.
    
    ARGS:
    
        df: dataframe of businesses
        city_name (string): Name of city to pull restaurants from
        samples (int): number of restaurants to sample from df (defalt=None)
        
    """
    #--load restaurants in city_name
    df_city = df[df['city'] == city_name]
    df_city = df_city.dropna(subset=['categories'])
    
    #--keep restaurants with reviews between 100-300
    restaurants = df_city[df_city['categories'].str.contains('Restaurant')]
    restaurants_reduced = restaurants[(restaurants['review_count'] > 100) & (restaurants['review_count'] < 300)]
    restaurants_reduced.reset_index(drop=True, inplace=True)
    
    #--returns all restaurants if no samples
    if samples == None:
        restaurants_reduced.to_csv(f'../data/{city_name.replace(" ", "_")}_restaurants.csv', index = False)
        return restaurants_reduced
    
    else:
        #creation of sample restaurants
        restaurant_samples = pd.concat([restaurants_reduced[restaurants_reduced['is_open'] == 1].sample(samples//2), restaurants_reduced[restaurants_reduced['is_open'] == 0].sample(samples//2)])
        restaurant_samples.reset_index(drop=True, inplace=True)
    
        restaurant_samples.to_csv(f'./data/{city_name.replace(" ", "_")}_{samples}.csv', index = False)
        return restaurant_samples

In [4]:
#--Creation of 400 samples 
#city_restaurants(df_business, 'Las Vegas', 400)

In [5]:
restaurants = city_restaurants(df_business, 'Las Vegas')
restaurants.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,Yr_w9lakJrKMyEG_hI6zbA,Fat Moe's Pizza & Wings,"6125 W Tropicana Ave, Ste F",Las Vegas,NV,89103,36.099361,-115.226636,4.0,141,1,"{'RestaurantsAttire': 'u'casual'', 'Restaurant...","Pizza, Salad, Burgers, Restaurants","{'Monday': '11:0-22:0', 'Tuesday': '11:0-22:0'..."
1,AN0bWhisCf6LN9eHZ7DQ3w,Los Olivos Ristorante,3759 E Desert Inn Rd,Las Vegas,NV,89121,36.129178,-115.092483,5.0,222,1,"{'WiFi': 'u'free'', 'RestaurantsPriceRange2': ...","Restaurants, Italian","{'Monday': '0:0-0:0', 'Tuesday': '16:0-21:0', ..."
2,oUX2bYbqjqST-urKbOHG6w,Loftti Cafe,"7729 S Rainbow Blvd, Ste 9B",Las Vegas,NV,89139,36.047942,-115.244167,4.5,284,1,"{'OutdoorSeating': 'True', 'BusinessParking': ...","Sandwiches, Shaved Ice, Coffee & Tea, Desserts...","{'Monday': '0:0-0:0', 'Tuesday': '8:0-3:0', 'W..."
3,FiW6w5nmhlUoJAyNofb4jg,Fruits and Roots,5020 Blue Diamond Rd,Las Vegas,NV,89139,36.032122,-115.210267,4.5,106,1,"{'BusinessAcceptsCreditCards': 'True', 'DogsAl...","Coffee & Tea, Food Stands, Food, Restaurants, ...",{'Monday': '0:0-0:0'}
4,T0NKethAB-FFR05EeZCzuA,Burger King,6780 N Durango Dr,Las Vegas,NV,89149,36.284274,-115.287331,1.5,127,1,"{'BusinessAcceptsCreditCards': 'True', 'Restau...","Burgers, Fast Food, Restaurants","{'Monday': '6:0-22:0', 'Tuesday': '6:0-22:0', ..."


## Reviews

In [6]:
# creates JsonReader object for iteration over total Reviews

review_json_path = '../data/yelp_academic_dataset_review.json'
size = 500_000

review_reader = pd.read_json(review_json_path, lines=True,
                      dtype={'review_id':str,'user_id':str,
                             'business_id':str,'stars':int,
                             'date':str,'text':str,'useful':int,
                             'funny':int,'cool':int},
                      chunksize=size)

In [7]:
#---Matches restaurant reviews with df of businesses

def business_reviews(reviews, business_filepath, output_filepath):
    """
    Iterates over the total reviews and extracts reviews that match the businesses given
    
    ARGS:
    
        reviews: JsonReader for iterating over total reviews
        business_filepath (string.csv): .csv filepath with desired businesses
        output_filepath (string.csv): .csv filename for reviews to be saved in  
    
    """
    
    #--reads in .csv file of desired businesses
    df_business = pd.read_csv(business_filepath)
    
    #--stores matches on business_id
    chunk_list = []
    
    #--iterates over JsonReader chunks and matches reviews on business_id's
    for chunk_review in reviews:
        chunk_review = chunk_review.drop(['review_id','useful','funny','cool'], axis=1)
        #--renames duplicate column
        chunk_review = chunk_review.rename(columns={'stars': 'review_stars'})
        chunk_merged = pd.merge(df_business, chunk_review, on='business_id', how='left')
        print(f"{chunk_merged.shape[0]} out of {size:,} related reviews")
        chunk_list.append(chunk_merged)
    
    #--combining all saved reviews, dropping missing reviews, and resetting index for output
    df_reviews = pd.concat(chunk_list, ignore_index=True, join='outer', axis=0)
    df_reviews.dropna(subset=['text'], inplace=True)
    df_reviews.reset_index(drop=True, inplace=True)
    
    #--write df to output_filepath, and return df of reviews for inspection
    df_reviews.to_csv(f'{output_filepath}', index = False)
    return df_reviews

# https://towardsdatascience.com/converting-yelp-dataset-to-csv-using-pandas-2a4c8f03bd88

In [8]:
# pulling reviews for 400 sample restaurants.
#business_reviews(review_reader, './data/Las_Vegas_400.csv', 'Las_Vegas_400_reviews.csv')

reviews = business_reviews(review_reader, '../data/Las_Vegas_restaurants.csv', '../data/Las_Vegas_reviews.csv')
reviews.head()

19366 out of 500,000 related reviews
18667 out of 500,000 related reviews
19372 out of 500,000 related reviews
16770 out of 500,000 related reviews
15948 out of 500,000 related reviews
16773 out of 500,000 related reviews
17880 out of 500,000 related reviews
17706 out of 500,000 related reviews
18096 out of 500,000 related reviews
17037 out of 500,000 related reviews
17147 out of 500,000 related reviews
17603 out of 500,000 related reviews
19437 out of 500,000 related reviews
17159 out of 500,000 related reviews
17051 out of 500,000 related reviews
17942 out of 500,000 related reviews
2024 out of 500,000 related reviews


Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours,user_id,review_stars,text,date
0,Yr_w9lakJrKMyEG_hI6zbA,Fat Moe's Pizza & Wings,"6125 W Tropicana Ave, Ste F",Las Vegas,NV,89103,36.099361,-115.226636,4.0,141,1,"{'RestaurantsAttire': ""u'casual'"", 'Restaurant...","Pizza, Salad, Burgers, Restaurants","{'Monday': '11:0-22:0', 'Tuesday': '11:0-22:0'...",flaLsGwkBgWb_o_UVhXbTA,5.0,Love their food and their prices. The owners a...,2017-03-03 21:46:36
1,Yr_w9lakJrKMyEG_hI6zbA,Fat Moe's Pizza & Wings,"6125 W Tropicana Ave, Ste F",Las Vegas,NV,89103,36.099361,-115.226636,4.0,141,1,"{'RestaurantsAttire': ""u'casual'"", 'Restaurant...","Pizza, Salad, Burgers, Restaurants","{'Monday': '11:0-22:0', 'Tuesday': '11:0-22:0'...",yXhy8F43qDmSHuY9ubadtw,5.0,This place has definitely improved since the n...,2015-03-05 15:01:11
2,Yr_w9lakJrKMyEG_hI6zbA,Fat Moe's Pizza & Wings,"6125 W Tropicana Ave, Ste F",Las Vegas,NV,89103,36.099361,-115.226636,4.0,141,1,"{'RestaurantsAttire': ""u'casual'"", 'Restaurant...","Pizza, Salad, Burgers, Restaurants","{'Monday': '11:0-22:0', 'Tuesday': '11:0-22:0'...",Unhr6Ut9xQowzzohevNdxg,4.0,Ordered from here just awhile ago as I was cra...,2015-08-07 23:34:12
3,Yr_w9lakJrKMyEG_hI6zbA,Fat Moe's Pizza & Wings,"6125 W Tropicana Ave, Ste F",Las Vegas,NV,89103,36.099361,-115.226636,4.0,141,1,"{'RestaurantsAttire': ""u'casual'"", 'Restaurant...","Pizza, Salad, Burgers, Restaurants","{'Monday': '11:0-22:0', 'Tuesday': '11:0-22:0'...",LgkNGmjMijSiBlaWCfWtZw,5.0,Really yummy food that is made on the spot and...,2017-09-10 20:57:11
4,Yr_w9lakJrKMyEG_hI6zbA,Fat Moe's Pizza & Wings,"6125 W Tropicana Ave, Ste F",Las Vegas,NV,89103,36.099361,-115.226636,4.0,141,1,"{'RestaurantsAttire': ""u'casual'"", 'Restaurant...","Pizza, Salad, Burgers, Restaurants","{'Monday': '11:0-22:0', 'Tuesday': '11:0-22:0'...",FXB9CtInMisA1cET-Mhg5Q,5.0,We got here by accident since we were looking ...,2017-05-14 04:06:27


## Confirming Matches

In [9]:
def business_id_matcher(business_filepath, review_filepath):
    """
    Checks if reviews were correctly collected for desired businesses
    
    ARGS:
    
        business_filepath (str.csv): filepath of businesses
        review_filepath (str.csv): filepath of reviews
    
    """
    return pd.read_csv(business_filepath)['business_id'].value_counts().sum() == len(pd.read_csv(review_filepath)['business_id'].unique())


In [11]:
business_id_matcher('../data/Las_Vegas_restaurants.csv', '../data/Las_Vegas_reviews.csv')

True