# Extra Examples - Merging

Heres a dataset dumped directly from a database, so we need to stitch it together ourselves.
https://www.kaggle.com/uciml/restaurant-data-with-consumer-ratings

The dataset comes with a README file that outlines where everything comes from which might help.

Lets try to:

1. Merge all restaurant data
2. Merge all user data
3. Merge restaurant data and user data together using user ratings
4. Realise that we've merged too much, and merge user ratings + user profile + geoplaces
5. Use some groupby power and determine the top five restaurants in the dataset

In [1]:
import pandas as pd
import os

files = [f for f in os.listdir() if f.endswith(".csv")]
print(files)

['chefmozaccepts.csv', 'chefmozcuisine.csv', 'chefmozhours4.csv', 'chefmozparking.csv', 'geoplaces2.csv', 'rating_final.csv', 'usercuisine.csv', 'userpayment.csv', 'userprofile.csv']


## Merging restaurant data

In [2]:
for f in files:
    name = f.split('.')[0]
    exec(f'{name} = pd.read_csv("{f}")')

In [3]:
df_restaurant1 = pd.merge(chefmozaccepts, chefmozcuisine, on='placeID', how='left')
df_restaurant2 = df_restaurant1.merge(chefmozhours4, on='placeID', how='left')
df_restaurant3 = df_restaurant2.merge(chefmozparking, on='placeID', how='left')
df_restaurant4 = df_restaurant3.merge(geoplaces2, on='placeID', how='left')

In [4]:
print(f'chefmozaccepts: {chefmozaccepts.shape}')
print(f'chefmozcuisine: {chefmozaccepts.shape}')
print(f'df_restaurant1: {df_restaurant1.shape}')
print(f'chefmozhours4: {chefmozhours4.shape}')
print(f'df_restaurant2: {df_restaurant2.shape}')
print(f'chefmozparking: {chefmozparking.shape}')
print(f'df_restaurant3: {df_restaurant3.shape}')
print(f'geoplaces2: {geoplaces2.shape}')
print(f'df_restaurant4: {df_restaurant4.shape}')

chefmozaccepts: (1314, 2)
chefmozcuisine: (1314, 2)
df_restaurant1: (1631, 3)
chefmozhours4: (2339, 3)
df_restaurant2: (5165, 5)
chefmozparking: (702, 2)
df_restaurant3: (5715, 6)
geoplaces2: (130, 21)
df_restaurant4: (5715, 26)


## Merging User data

In [5]:
df_user = None
for f in files:
    if f.startswith('user'):
        df = pd.read_csv(f)
        if df_user is None:
            df_user = df
        else:
            df_user = df_user.merge(df, on='userID')

## Merging User ratings as well

In [11]:
all_data = pd.merge(df_restaurant4, rating_final, on='placeID')
all_data = pd.merge(all_data, df_user, on='userID')

## Merge Subsets

In [17]:
smart_data = pd.merge(userprofile, rating_final, on='userID')
smart_data = pd.merge(smart_data, geoplaces2, on='placeID')

## Top 5 restaurants based off rating

Note to answer this we didn't actually need the user profile data. But we might use it to remove votes from users that don't satisfy criteria (for example, we might want to make sure the user has been to multiple restaurants, or is a certain age, or doesnt have suspicious voting trends - aka giving everyone a one).

In [20]:
smart_data.columns

Index(['userID', 'latitude_x', 'longitude_x', 'smoker', 'drink_level',
       'dress_preference', 'ambience', 'transport', 'marital_status', 'hijos',
       'birth_year', 'interest', 'personality', 'religion', 'activity',
       'color', 'weight', 'budget', 'height', 'placeID', 'rating',
       'food_rating', 'service_rating', 'latitude_y', 'longitude_y',
       'the_geom_meter', 'name', 'address', 'city', 'state', 'country', 'fax',
       'zip', 'alcohol', 'smoking_area', 'dress_code', 'accessibility',
       'price', 'url', 'Rambience', 'franchise', 'area', 'other_services'],
      dtype='object')

In [33]:
smart_data.groupby(['placeID', 'name','address', 'city']).rating.mean().reset_index().sort_values('rating', ascending=False).head()

Unnamed: 0,placeID,name,address,city,rating
57,132955,emilianos,venustiano carranza,san luis potos,2.0
82,135034,Michiko Restaurant Japones,Cordillera de Los Alpes 160 Lomas 2 Seccion,San Luis Potosi,2.0
62,134986,Restaurant Las Mananitas,Ricardo Linares 107,Cuernavaca,2.0
52,132922,cafe punta del cielo,?,?,1.833333
26,132755,La Estrella de Dimas,Av. de los Pintores,San Luis Potosi,1.8
