### Exploratory analysis of restaurant data

Importing libraries and loading data

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

def split_column(col):
    s_col = col.split(" ")
    return pd.Series([s_col] + s_col)

def get_str_id(i):
    return str(1000+i)[1:]

raw_df = pd.read_csv('city_hotel_features.txt', delimiter='\t')
feature_df = pd.read_csv('features.txt', delimiter='\t')
df = raw_df.join(raw_df["Features"].apply(split_column))
del df['Features']
column_names = ["hotel_name", "city", "features"]
for i in range(30):
    column_names.append("feature_" + str(i))
df.columns = column_names
feature_df.columns = ['cuisine_id', 'feature']
feature_df["cuisine_id"] = feature_df["cuisine_id"].apply(get_str_id)
df.head(3)

Unnamed: 0,hotel_name,city,features,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,...,feature_20,feature_21,feature_22,feature_23,feature_24,feature_25,feature_26,feature_27,feature_28,feature_29
0,Tanner's,atlanta,"[100, 253, 250, 178, 174, 063, 059, 036, 008, ...",100,253,250,178,174,63,59,...,,,,,,,,,,
1,Frijoleros,atlanta,"[250, 062, 132, 174, 063, 197, 071, 142, 234, ...",250,62,132,174,63,197,71,...,,,,,,,,,,
2,Indian Delights,atlanta,"[253, 250, 150, 174, 083, 059, 036, 117, 243, ...",253,250,150,174,83,59,36,...,,,,,,,,,,


Seeing the City distribution:

In [2]:
df["city"].value_counts()

new_york         1200
chicago           676
los_angeles       447
boston            438
san_francisco     414
washington_dc     391
new_orleans       327
atlanta           267
Name: city, dtype: int64

For a feature request:

In [3]:
features_req = ["113", "075", "008", "053", "167", "125"]

Let's score restaurants, and select top 10:

In [4]:
def get_score(features):
    score = 0
    for f in features_req:
        if f in features:
            score += 1
    return score

df["features"].apply(get_score).sort_values(
                    ascending=False)[:10].reset_index().join(
                    df.reset_index(), lsuffix="l", on="index")[["index", "hotel_name"]]

Unnamed: 0,index,hotel_name
0,2332,Il Gattopardo
1,812,Saloon
2,2229,Caffe Cielo
3,3025,Andiamo
4,3030,Bull & Bear
5,3529,Cafe 222
6,2315,Brio
7,3091,Barocco
8,2174,Mezzogiorno
9,2621,Petaluma


Creating a horizontal dataframe for getting feature_id wise view, and then transposing it

In [5]:
df_hor = df.ix[:,0:3].join(df["features"].apply(lambda x: pd.Series(x).value_counts()))

def get_city_dets(df):
    return df.sum()

city_sum = df_hor.groupby("city").apply(get_city_dets)
city_sum_transpose = city_sum.ix[:,3:].transpose().reset_index()
city_sum_transpose.columns =  ["cuisine_id"] + list(city_sum_transpose.columns[1:])
city_sum_transpose = pd.merge(city_sum_transpose, feature_df, on='cuisine_id', how='left')

For a given city, say:

In [6]:
city = 'new_orleans'

we can now see the top N (5) features:

In [7]:
city_sum_transpose.sort_values(city, ascending = False)[0:5][["feature"]]

Unnamed: 0,feature
157,Open on Mondays
158,Open on Sundays
253,Wheelchair Access
205,Excellent Service
75,Excellent Food


For a given feature, we can see city counts too:

In [8]:
feature_id = "008"

df_hor[df_hor[feature_id].notnull()]["city"].value_counts()

new_york         130
chicago           64
washington_dc     60
boston            55
san_francisco     44
atlanta           33
Name: city, dtype: int64

#### Colaborative filtering for similar restaurants 

Build a vector matrix of all the restaurants, and then find all restaurants which minimize cosine product  $$\theta < threshold$$ with respect to a given restaurant.

#### Similar features

Maximizing the likelyhood of co-occurance, for say:

In [9]:
feature = "250"
df_hor[df_hor[feature].notnull()].sum()[3:].sort_values(ascending=False)[1:6].reset_index()["index"]

0    205
1    253
2    075
3    191
4    192
Name: index, dtype: object

we got 5 features similar / co-related to a given feature

#### Scaling
This data is fairly simple and should be easily be scalable and accomodated on machine RAMs.

In [10]:
import sys
print("Current size: " + str(sys.getsizeof(df_hor) / (1024 * 1024)) + " MB")

Current size: 9.32396125793457 MB


This data can easily grow to about 1000 times its size. However, in extreme case, if it does grow as much, all the computations are easily partionable on hotels / cities. Using these as partitions in Spark RDDs or DataFrames, the system can be made to handle more load with horizontal scaling