In [13]:
# Import the following libraries to get started

import pandas as pd
import numpy as np
import pprint
import seaborn as sns

# Data Exploration for Restaurant Data

## Problem Background

I don't know what these datasets are for, what can they tell me and what I can do with it. Hence, let's use some exploratory data analysis to figure out what we can do with these treasure trove of data.

## What's The Data Set About?

Firstly, let's take a look at what are the files available and their data attributes. We would want to consider the following:

1. What are the data attributes
2. What are their data types (this would be useful for us to determine whether it's categorical or numerical. If numerical, is it continuous or ordinal?)

Answering the above questions would allow us to **guess** the relationship between them. After using some intuition to guess, we can then use data exploratory data analysis techniques to further explore their relationship with each other. These information could allow us to create more accurate predictive models later on

## Available Files


In [14]:
import os
files = os.listdir('./data')
for num, file in enumerate(files):
    print(str(num+1) + ": " + file)

1: chefmozparking.csv
2: chefmozaccepts.csv
3: userpayment.csv
4: geoplaces2.csv
5: rating_final.csv
6: usercuisine.csv
7: chefmozcuisine.csv
8: chefmozhours4.csv
9: userprofile.csv


From the above, we can see that we have 9 files in the data directory. 
Let's take a look at what each one of them contains and what could they possibly tell us

### chefmozparking.csv - Restaurants And The Parking They Have

In [26]:
chefmozparking_df = pd.read_csv('./data/chefmozparking.csv')
print(chefmozparking_df.dtypes)
chefmozparking_df.head()
print("\n")

chefmozparking_df.head()

placeID         int64
parking_lot    object
dtype: object




Unnamed: 0,placeID,parking_lot
0,135111,public
1,135110,none
2,135109,none
3,135108,none
4,135107,none


From the above, we can can see that there are only 2 variables:
1. placeID (Integer)
2. parking_lot (Object)

**parking_lot** seems like a String variable to me however. Let's see what other possible values does it have.

In [23]:
print(pd.unique(chefmozparking_df.parking_lot))

['public' 'none' 'yes' 'valet parking' 'fee' 'street' 'validated parking']


We can see from the above that all the possible values of the variable **parking_lot** are Strings. We can make a smart guess on what they could mean:

- **public** refers to **public parking**
- **none** refers to **no parking available**
- **yes** refers to **there is parking available!**
- **valet parking** refers to **there are valets to park for you**
- **fee** refers to **there is a fee-based carpark**
- **street** refers to **parking is avaialable on the street**
- **validated parking** -- not quite sure what this means!

### chefmozaccepts.csv - Restaurants & The Payment Types They Accept

In [38]:
chefmozaccepts_df = pd.read_csv('./data/chefmozaccepts.csv')
print(chefmozaccepts_df.dtypes)
chefmozaccepts_df.head()
print("\n")

print(str(len(pd.unique(chefmozaccepts_df.Rpayment))) + " types of payment options:\n")
print(pd.unique(chefmozaccepts_df.Rpayment))

chefmozaccepts_df.head()

placeID      int64
Rpayment    object
dtype: object


12 types of payment options:

['cash' 'VISA' 'MasterCard-Eurocard' 'American_Express' 'bank_debit_cards'
 'checks' 'Discover' 'Carte_Blanche' 'Diners_Club' 'Visa'
 'Japan_Credit_Bureau' 'gift_certificates']


Unnamed: 0,placeID,Rpayment
0,135110,cash
1,135110,VISA
2,135110,MasterCard-Eurocard
3,135110,American_Express
4,135110,bank_debit_cards


From the above, we can see that there are **many payment types** for each **placeID**. It is represented by having **multiple rows** of the **same placeID** and the **Rpayment** variable. That variable, by smart guessing and some DataFrames magic tells us that there are 12 types of payment options...

### userpayment.csv - Users & Their Modes of Payments

In [82]:
userpayment_df = pd.read_csv('./data/userpayment.csv')
print(userpayment_df.dtypes)
userpayment_df.head()
print("\n")

userpayment_df.head()

userID      object
Upayment    object
dtype: object




Unnamed: 0,userID,Upayment
0,U1001,cash
1,U1002,cash
2,U1003,cash
3,U1004,cash
4,U1004,bank_debit_cards


From the above, we can see that **userpayment.csv** is about users and their modes of payment. From userID U1004, we can see that the user has **2 types of payments:** cash and bank_debit_cards

In [83]:
groupedPayments = userpayment_df.groupby('userID')
userpayment_df = groupedPayments.aggregate(lambda x: list(x))
print(userpayment_df.dtypes)
userpayment_df.head()

Upayment    object
dtype: object


Unnamed: 0_level_0,Upayment
userID,Unnamed: 1_level_1
U1001,[cash]
U1002,[cash]
U1003,[cash]
U1004,"[cash, bank_debit_cards]"
U1005,[cash]
