# Resturant project

## 1. Introduction
Indian cities are incredibly diverse collection of restaurants catering to different palettes and appetites. A large part of marketing for a modern restaurant (or any company) is social media, where the number of "likes" that the company can receive will dictate its brand and image to the general public.

For a new business owner (or existing company) to open a new restaurant in major Indian cities, knowing ahead of time the potential social media image they can have would provide an excellent solution to the ever present business problem of uncertainty. In this case the uncertainty is regarding performance of social media presence.

We can mitigate this uncertainty through leveraging data gathered from FourSquare's API, specifically, we are able to scrape "likes" data of different restaurants directly from the API as well as their location and category of cuisine. The question we will try to address is, how accurately can we predict the amount of "likes" a new restaurant opening in this region can expect to have based on the type of cuisine it will serve and which city in India it will open in. (For the purposes of this analysis, we will contain the geographical scope of analysis to three heavily populated cities in Indian, namely Delhi, Mumbai, and Bangalore).

Leveraging this data will solve the problem as it allows the new business owner (or existing company) to make preemptive business decisions regarding opening the restaurant in terms of whether it is feasible to open one in this region and expect good social media presence, what type of cuisine and which city of three would be the best. This project will analyze and model the data via machine learning through comparing both linear and logistic regressions to see which method will yield better predictive capabilities after training and testing.



In [2]:
#Let us begin by importing the necessary packages.


import numpy as np 

import pandas as pd 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json

from geopy.geocoders import Nominatim 

import requests
from pandas.io.json import json_normalize 


import matplotlib.cm as cm
import matplotlib.colors as colors


import folium 

from urllib.request import urlopen
from bs4 import BeautifulSoup


import matplotlib.pyplot as plt
import pylab as pl

from sklearn import linear_model
from sklearn.metrics import jaccard_similarity_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import log_loss
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error, r2_score
import itertools


print('Libraries imported.')

Libraries imported.


## 2. Data

### 2.1 Data Scraping and Cleaning

In this section we will first retrieve the geographical coordinates of the three cities (Delhi, Mumbai and Bangalore). Then, we will leverage the FourSquare API to obtain URLs that lead to the raw data in JSON form. We will speerately scrape the raw data in these URLs in order to retrieve the following columns: "name", "categories", "latitude", "longitude". and "id" for each city. We can also provide another column ("city") to indicate which city the restaurants are from.

It is important to note that the extracts are not of every restaurant in those cities but rather all of the restaurants within a 2000KM range of the geographical coordinates that geolocator was able to provide. However, the extraction from the FourSquare API actually obtains venue data so it will include venues other than restaurants such as concert halls, stores, libraries etc. As such, this means that the data will need to be further cleaned somewhat manually by removing all of the non-restaurant rows. Once this is complete, we have a shortened by cleaned list to pull "likes" data. The reason the cleaning takes precedence is mainly that pulling the "likes" data is the computing process which takes the longest time in this project so we want to make sure we are not pulling information that will end up being dropped anyways.

The "id" is an important column as it will allow us to further pull the "likes" from the API. We can retreive the "likes" based on the restaurant "id" and then append it to the data frame. Once this is complete, we finally name the dataframe 'raw_dataset' as it is the most complete compiled form before needing any processing for analysis via machine learning.

In [4]:
address1 = 'Delhi, India'

geolocator = Nominatim()
location1 = geolocator.geocode(address1)
latitude1 = location1.latitude
longitude1 = location1.longitude
print('The geograpical coordinate of {} are {}, {}.'.format(address1, latitude1, longitude1))


address2 = 'Mumbai, India'

geolocator = Nominatim()
location2 = geolocator.geocode(address2)
latitude2 = location2.latitude
longitude2 = location2.longitude
print('The geograpical coordinate of {} are {}, {}.'.format(address2, latitude2, longitude2))

address3 = 'Bangalore, India'

geolocator = Nominatim()
location3 = geolocator.geocode(address3)
latitude3 = location3.latitude
longitude3 = location3.longitude
print('The geograpical coordinate of {} are {}, {}.'.format(address3, latitude3, longitude3))

  This is separate from the ipykernel package so we can avoid doing imports until


The geograpical coordinate of Delhi, India are 28.6517178, 77.2219388.


  if sys.path[0] == '':


The geograpical coordinate of Mumbai, India are 18.9387711, 72.8353355.




The geograpical coordinate of Bangalore, India are 12.9791198, 77.5912997.


In [45]:
CLIENT_ID = 'WF2HLZPWOXQ0BDJ4DIO4ESR4QBDE5Q3NFM1Y4KQUQEG2WCR2' # Foursquare ID
CLIENT_SECRET = 'YCZKUFTVKWSJZMIU3P4WZA0ZXPE0R3SEKKSYQXCGB3RE1KZC' # Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)


LIMIT = 200 # limit of number of venues returned by Foursquare API
radius = 2000 # define radius

# create URLs
url1 = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude1, 
    longitude1, 
    radius, 
    LIMIT)


# create URLs
url2 = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude2, 
    longitude2, 
    radius, 
    LIMIT)


# create URLs
url3 = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude3, 
    longitude3, 
    radius, 
    LIMIT)

print(url1, url2, url3)

Your credentails:
CLIENT_ID: WF2HLZPWOXQ0BDJ4DIO4ESR4QBDE5Q3NFM1Y4KQUQEG2WCR2
CLIENT_SECRET:YCZKUFTVKWSJZMIU3P4WZA0ZXPE0R3SEKKSYQXCGB3RE1KZC
https://api.foursquare.com/v2/venues/explore?&client_id=WF2HLZPWOXQ0BDJ4DIO4ESR4QBDE5Q3NFM1Y4KQUQEG2WCR2&client_secret=YCZKUFTVKWSJZMIU3P4WZA0ZXPE0R3SEKKSYQXCGB3RE1KZC&v=20180605&ll=28.6517178,77.2219388&radius=2000&limit=200 https://api.foursquare.com/v2/venues/explore?&client_id=WF2HLZPWOXQ0BDJ4DIO4ESR4QBDE5Q3NFM1Y4KQUQEG2WCR2&client_secret=YCZKUFTVKWSJZMIU3P4WZA0ZXPE0R3SEKKSYQXCGB3RE1KZC&v=20180605&ll=18.9387711,72.8353355&radius=2000&limit=200 https://api.foursquare.com/v2/venues/explore?&client_id=WF2HLZPWOXQ0BDJ4DIO4ESR4QBDE5Q3NFM1Y4KQUQEG2WCR2&client_secret=YCZKUFTVKWSJZMIU3P4WZA0ZXPE0R3SEKKSYQXCGB3RE1KZC&v=20180605&ll=12.9791198,77.5912997&radius=2000&limit=200


In [46]:
# scrape the data from the generated URLs

results1 = requests.get(url1).json()
results1

results2 = requests.get(url2).json()
results2

results3 = requests.get(url3).json()
results3

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']
    

### first city ###    
    
venues1 = results1['response']['groups'][0]['items']
nearby_venues1 = json_normalize(venues1) # flatten JSON

# filter columns
filtered_columns1 = ['venue.name', 'venue.categories', 'venue.location.lat', 
                    'venue.location.lng', 'venue.id']
nearby_venues1 =nearby_venues1.loc[:, filtered_columns1]

# filter the category for each row
nearby_venues1['venue.categories'] = nearby_venues1.apply(get_category_type, axis=1)

# clean columns
nearby_venues1.columns = [col.split(".")[-1] for col in nearby_venues1.columns]


### second city ###

venues2 = results2['response']['groups'][0]['items']
nearby_venues2 = json_normalize(venues2) # flatten JSON

# filter columns
filtered_columns2 = ['venue.name', 'venue.categories', 'venue.location.lat', 
                    'venue.location.lng', 'venue.id']
nearby_venues2 =nearby_venues2.loc[:, filtered_columns2]

# filter the category for each row
nearby_venues2['venue.categories'] = nearby_venues2.apply(get_category_type, axis=1)

# clean columns
nearby_venues2.columns = [col.split(".")[-1] for col in nearby_venues2.columns]


### third city ###

venues3 = results3['response']['groups'][0]['items']
nearby_venues3 = json_normalize(venues3) # flatten JSON

# filter columns
filtered_columns3 = ['venue.name', 'venue.categories', 'venue.location.lat', 
                    'venue.location.lng', 'venue.id']
nearby_venues3 =nearby_venues3.loc[:, filtered_columns3]

# filter the category for each row
nearby_venues3['venue.categories'] = nearby_venues3.apply(get_category_type, axis=1)

# clean columns
nearby_venues3.columns = [col.split(".")[-1] for col in nearby_venues3.columns]





print('{} venues were returned by Foursquare.'.format(nearby_venues1.shape[0]))
print('{} venues were returned by Foursquare.'.format(nearby_venues2.shape[0]))
print('{} venues were returned by Foursquare.'.format(nearby_venues3.shape[0]))

89 venues were returned by Foursquare.
100 venues were returned by Foursquare.
100 venues were returned by Foursquare.


In [49]:
# add locations data to the data sets of each city

nearby_venues1['city'] = 'Delhi'
nearby_venues2['city'] = 'Mumbai'
nearby_venues3['city'] = 'Bangalore'

In [50]:
# combine the three cities into one data set

nearby_venues = nearby_venues1.copy()
nearby_venues = nearby_venues.append(nearby_venues2)
nearby_venues = nearby_venues.append(nearby_venues3)

In [51]:
# check list and manually remove all non-restaurant data

nearby_venues['categories'].unique()

removal_list = ['Concert Hall', 'Opera House', 'Dance Studio',
                'Performing Arts Venue', 'Art Museum', 'Park',
                'Massage Studio', 'Music Venue', 'Bookstore', 'Clothing Store',
                'Boutique', 'Furniture/Home Store', 'Jazz Club',
                'Theater', 'Optical Shop', "Men's Store", 'Rock Club',
                'Gym / Fitness Center', 'Wine Shop', 'Indie Movie Theater',
                'Chocolate Shop', 'Dessert Shop', 'Recreation Center', 
                'Plaza', 'Hotel', 'Luggage Store', 'Farmers Market', 'Gym',
                'Jewelry Store', 'Furniture / Home Store', 'Butcher', 
                'Bakery', 'Marijuana Dispensary', 'Ice Cream Shop',
                'Comic Shop', 'Bagel Shop', 'Spa', 'Liquor Store', 'Bike Shop',
                'Yoga Studio', 'Pedestrian Plaza', 'Candy Store',
                'Park', 'Bookstore', 'Candy Store',  'Jazz Club', 'Art Gallery', 
                 'Supermarket', 'Museum', 'Boutique', 'Plaza', 'Building', 'Bakery',
                 'Historic Site', 'Ice Cream Shop', ' Concert Hall', 'Pharmacy', 
                 'Market', 'Movie Theater', 'Performing Arts Venue', 'Music Venue',
                 'Theater', 'Art Museum', 'Cheese Shop', 'Opera House',
                 'Pedestrian Plaza', 'School', 'Gift Shop', 'Athletics & Sports',
                 'Shoe Repair', 'General Entertainment', 'Stationery Store',
                 'Toy / Game Store', 'Brewery', 'Hotel', 'Theater', 'Music Venue', 'Business Service',
                 'Donut Shop', 'Liquor Store', 'Beer Store',
                 'Lounge', 'Plaza', 'Health Food Store', 'Concert Hall', 
                 'Lingerie Store', 'Gym', 'Mobile Phone Shop',
                 'Chocolate Shop', 'Ice Cream Shop', 'Hostel', 'Convenience Store', 
                 'Park', 'Farmers Market', 'Cosmetics Shop', 'Piano Bar',
                 'Nightclub', 'Massage Studio', 'Comedy Club', 'Concert Hall']

nearby_venues = nearby_venues[~nearby_venues['categories'].isin(removal_list)]

nearby_venues['categories'].unique()

array(['Snack Place', 'Indian Restaurant', 'Food & Drink Shop', 'Mosque',
       'Tibetan Restaurant', 'Restaurant', 'Korean Restaurant',
       'Hardware Store', 'Paper / Office Supplies Store', 'Bar', 'Café',
       'Fast Food Restaurant', 'Food', 'Indian Chinese Restaurant',
       'Motel', 'Pizza Place', 'Flea Market', 'Breakfast Spot',
       'Coffee Shop', 'Sandwich Place', 'Light Rail Station',
       'Smoke Shop', 'Italian Restaurant', 'Road', 'Falafel Restaurant',
       'Chinese Restaurant', 'Parsi Restaurant', 'Cricket Ground',
       'Seafood Restaurant', 'Scenic Lookout', 'Beach',
       'Asian Restaurant', 'History Museum', 'Japanese Restaurant',
       'Music Store', 'Train Station', 'Pub', 'Middle Eastern Restaurant',
       'Gastropub', 'Diner', 'New American Restaurant', 'Multiplex',
       'College Academic Building', 'Stadium', 'Cocktail Bar',
       'Monument / Landmark', 'Hockey Arena', 'Mediterranean Restaurant',
       'Chaat Place', 'BBQ Joint', 'Tea Room', 'Sh

In [52]:
# set up to pull the likes from the API based on venue ID

url_list = []
like_list = []
json_list = []

for i in list(nearby_venues.id):
    venue_url = 'https://api.foursquare.com/v2/venues/{}/likes?client_id={}&client_secret={}&v={}'.format(i, CLIENT_ID, CLIENT_SECRET, VERSION)
    url_list.append(venue_url)
for link in url_list:
    result = requests.get(link).json()
    likes = result['response']['likes']['count']
    like_list.append(likes)
print(like_list)


nearby_venues['likes'] = like_list
nearby_venues.head()

[6, 34, 36, 316, 177, 15, 6, 38, 6, 10, 49, 10, 7, 89, 53, 8, 90, 31, 8, 6, 7, 30, 5, 5, 6, 5, 18, 22, 13, 24, 23, 8, 24, 7, 1, 2, 24, 8, 2, 3, 2, 6, 5, 6, 15, 5, 2, 16, 14, 0, 12, 0, 60, 3, 0, 36, 16, 18, 127, 585, 52, 230, 21, 117, 16, 681, 8, 79, 146, 42, 45, 10, 143, 120, 162, 16, 33, 504, 284, 10, 4, 12, 88, 9, 266, 161, 87, 22, 19, 92, 71, 5, 6, 94, 18, 103, 164, 21, 8, 21, 23, 46, 7, 20, 81, 23, 434, 10, 19, 21, 24, 12, 19, 60, 104, 121, 14, 23, 7, 10, 17, 16, 22, 30, 21, 126, 207, 127, 657, 163, 140, 43, 16, 111, 484, 11, 15, 267, 231, 85, 54, 12, 48, 126, 4, 14, 365, 99, 15, 17, 14, 32, 11, 60, 5, 15, 26, 29, 39, 15, 37, 24, 29, 24, 11, 133, 29, 327, 18, 23, 28, 6, 7, 13, 14, 13, 8, 17, 19, 14, 10, 12, 39, 9, 13, 5, 36, 6]


Unnamed: 0,name,categories,lat,lng,id,city,likes
0,Amritsari Lassi Wala,Snack Place,28.657325,77.224138,5662936e498e19a9801a663f,Delhi,6
2,Kake Di Hatti | काके दी हट्टी,Indian Restaurant,28.65805,77.223377,4d9d759348b6224b70c2249f,Delhi,34
5,Spice Market,Food & Drink Shop,28.657287,77.222595,5280a63211d26b82c4ba65c7,Delhi,36
7,Karim's | करीम | کریم (Karim's),Indian Restaurant,28.649498,77.233691,4b42e3c7f964a520b5da25e3,Delhi,316
8,Jama Masjid |जामा मस्जिद | جامع مسجد (Jama Ma...,Mosque,28.650136,77.233541,4b529b68f964a520b48327e3,Delhi,177


In [53]:
# Now let us rename this raw dataset

raw_dataset = nearby_venues
raw_dataset.head()

Unnamed: 0,name,categories,lat,lng,id,city,likes
0,Amritsari Lassi Wala,Snack Place,28.657325,77.224138,5662936e498e19a9801a663f,Delhi,6
2,Kake Di Hatti | काके दी हट्टी,Indian Restaurant,28.65805,77.223377,4d9d759348b6224b70c2249f,Delhi,34
5,Spice Market,Food & Drink Shop,28.657287,77.222595,5280a63211d26b82c4ba65c7,Delhi,36
7,Karim's | करीम | کریم (Karim's),Indian Restaurant,28.649498,77.233691,4b42e3c7f964a520b5da25e3,Delhi,316
8,Jama Masjid |जामा मस्जिद | جامع مسجد (Jama Ma...,Mosque,28.650136,77.233541,4b529b68f964a520b48327e3,Delhi,177


## 2.2 Data Preparation
The data still needs some more processing before it is suitable for model training and testing. Mainly, the "categories" column contains too many different types of cuisines to allow a model to yield any meaningful results. However, the different types of natural cuisines have natural groupings based on conventionally accepted cultural groupings of cuisine. Broadly speaking, all of the different types of cuisine could be reclassified as European, Latin American, Asian, North American, drinking establishments (bars), or casual establishments such as coffee shops or ice cream parlours. We can implement manual classification as there really aren't that many different types of cuisines.

As this project will compare both linear and logistic regression, it makes sense to have "likes" as both a continuous and categorical (but ordinal) variable. In the case of turning into a categorical variable, we can bin the data based on percentiles and classify them into these ordinal percentile categories. I tried different ways of binning but in the end, splitting the sample into three different bins proved to yield the best classification results from a prediction standpoint.

As the last stage of data preparation, it is important to note that the regressors are categorical variables (3 different cities and 6 different categories of cusines). Hence, they require dummy variable encoding for meaningful analysis. We can accomplish this via one-hot encoding.

In [54]:
# inspecting the raw dataset shows that there may be too many different types of cuisines
raw_dataset['categories'].unique()

array(['Snack Place', 'Indian Restaurant', 'Food & Drink Shop', 'Mosque',
       'Tibetan Restaurant', 'Restaurant', 'Korean Restaurant',
       'Hardware Store', 'Paper / Office Supplies Store', 'Bar', 'Café',
       'Fast Food Restaurant', 'Food', 'Indian Chinese Restaurant',
       'Motel', 'Pizza Place', 'Flea Market', 'Breakfast Spot',
       'Coffee Shop', 'Sandwich Place', 'Light Rail Station',
       'Smoke Shop', 'Italian Restaurant', 'Road', 'Falafel Restaurant',
       'Chinese Restaurant', 'Parsi Restaurant', 'Cricket Ground',
       'Seafood Restaurant', 'Scenic Lookout', 'Beach',
       'Asian Restaurant', 'History Museum', 'Japanese Restaurant',
       'Music Store', 'Train Station', 'Pub', 'Middle Eastern Restaurant',
       'Gastropub', 'Diner', 'New American Restaurant', 'Multiplex',
       'College Academic Building', 'Stadium', 'Cocktail Bar',
       'Monument / Landmark', 'Hockey Arena', 'Mediterranean Restaurant',
       'Chaat Place', 'BBQ Joint', 'Tea Room', 'Sh

In [55]:

euro = ['French Restaurant', 'Scandinavian Restaurant', 'Souvlaki Shop', 
       'Mediterranean Restaurant', 'Italian Restaurant', 'Pizza Place']

latino = ['Mexican Restaurant', 'Latin American Restaurant', 
          'Brazilian Restaurant', 'Taco Place']

bar = ['Beer Bar', 'Cocktail Bar', 'Tiki Bar', 'Wine Bar', 'Hotel Bar',
       'Beer Garden', 'Speakeasy', 'Brewery', 'Pub', 'Bar', 'Gastropub',
       'Hookah Bar']

asian = ['Ramen Restaurant', 'Sushi Restaurant', 'Vietnamese Restaurant',
         'Thai Restaurant', 'Poke Place', 'Indian Restaurant', 
         'Japanese Curry Restaurant', 'Japanese Restaurant', 
         'Indonesian Restaurant', 'Udon Restaurant', 'Noodle House',
         'Falafel Restaurant', 'Filipino Restaurant', 'Turkish Restaurant',
         'Yoshoku Restaurant']

casual = ['Coffee Shop', 'Café', 'Sandwich Place', 'Food Truck',
          'Juice Bar', 'Frozen Yogurt Shop', 'Deli / Bodega', 'Dessert Shop',
          'Hot Dog Joint', 'Burger Joint', 'Breakfast Spot', 
          'Fondue Restaurant']

american = ['Southern / Soul Food Restaurant', 'Food & Drink Shop', 
            'Restaurant', 'American Restaurant', 'BBQ Joint', 
            'Theme Restaurant', 'New American Restaurant',
            'Vegetarian / Vegan Restaurant', 'Seafood Restaurant']

def conditions(s):
    if s['categories'] in euro:
        return 'euro'
    if s['categories'] in latino:
        return 'latino'
    if s['categories'] in asian:
        return 'asian'
    if s['categories'] in casual:
        return 'casual'
    if s['categories'] in american:
        return 'american'
    if s['categories'] in bar:
        return 'bar'


raw_dataset['categories_classified']=raw_dataset.apply(conditions, axis=1)
raw_dataset

Unnamed: 0,name,categories,lat,lng,id,city,likes,categories_classified
0,Amritsari Lassi Wala,Snack Place,28.657325,77.224138,5662936e498e19a9801a663f,Delhi,6,
2,Kake Di Hatti | काके दी हट्टी,Indian Restaurant,28.65805,77.223377,4d9d759348b6224b70c2249f,Delhi,34,asian
5,Spice Market,Food & Drink Shop,28.657287,77.222595,5280a63211d26b82c4ba65c7,Delhi,36,american
7,Karim's | करीम | کریم (Karim's),Indian Restaurant,28.649498,77.233691,4b42e3c7f964a520b5da25e3,Delhi,316,asian
8,Jama Masjid |जामा मस्जिद | جامع مسجد (Jama Ma...,Mosque,28.650136,77.233541,4b529b68f964a520b48327e3,Delhi,177,
9,The Drunkyard Cafe,Tibetan Restaurant,28.641451,77.215506,5654478f498e405c3e4e0d6a,Delhi,15,
11,The Indian Grill Restaurant,Restaurant,28.646141,77.215133,537dcc4b498ec171ba24b0d0,Delhi,6,american
14,Sagar Ratna,Indian Restaurant,28.635487,77.22065,4cb876d7f50e224bd2d6e6fb,Delhi,38,asian
15,Shyam Sweets,Snack Place,28.65053,77.230303,4e744f38fa760597007d45e4,Delhi,6,
17,쉼터,Korean Restaurant,28.641495,77.213152,4ef092caa69ddc7bcc2172fe,Delhi,10,


In [56]:
# double check to make sure categories_classified has been created correctly

pd.crosstab(index=raw_dataset["categories_classified"],
            columns="count")


col_0,count
categories_classified,Unnamed: 1_level_1
american,15
asian,47
bar,15
casual,26
euro,13


In [58]:
# classify the likes into different ranking levels
# lets first see where to bin the data
# we can try different ways of binning the data, I find it yields substantially different results

#print(np.percentile(raw_dataset['likes'], 20))
#print(np.percentile(raw_dataset['likes'], 40))
#print(np.percentile(raw_dataset['likes'], 60))
print(np.percentile(raw_dataset['likes'], 80))


#print(np.percentile(raw_dataset['likes'], 33))
#print(np.percentile(raw_dataset['likes'], 66))


print(np.percentile(raw_dataset['likes'], 76))

89.6
61.32000000000005


In [81]:
# create a function to bin for us

# def rankings(s):
#     if s['likes']<=34:
#         return 1
#     if s['likes']<=78:
#         return 2
#     if s['likes']<=154.6:
#         return 3
#     if s['likes']<=326.4:
#         return 4
#     if s['likes']>326.4:
#         return 5

def rankings(s):
    if s['likes']<=20:
        return 1
    if s['likes']<=70:
        return 2
    if s['likes']>70:
        return 3
    
    
# def rankings(s):
#     if s['likes']<=111.5:
#         return 0
#     if s['likes']>111.5:
#         return 1

raw_dataset['ranking']=raw_dataset.apply(rankings, axis=1)
raw_dataset.head()

Unnamed: 0,name,categories,lat,lng,id,city,likes,categories_classified,ranking
0,Amritsari Lassi Wala,Snack Place,28.657325,77.224138,5662936e498e19a9801a663f,Delhi,6,,1
2,Kake Di Hatti | काके दी हट्टी,Indian Restaurant,28.65805,77.223377,4d9d759348b6224b70c2249f,Delhi,34,asian,2
5,Spice Market,Food & Drink Shop,28.657287,77.222595,5280a63211d26b82c4ba65c7,Delhi,36,american,2
7,Karim's | करीम | کریم (Karim's),Indian Restaurant,28.649498,77.233691,4b42e3c7f964a520b5da25e3,Delhi,316,asian,3
8,Jama Masjid |जामा मस्जिद | جامع مسجد (Jama Ma...,Mosque,28.650136,77.233541,4b529b68f964a520b48327e3,Delhi,177,,3


In [82]:
# create dummies for linear regression modelling

# one hot encoding
reg_dataset = pd.get_dummies(raw_dataset[['categories_classified', 
                                          'city',]], 
                               prefix="", 
                               prefix_sep="")

# add name, ranking, and likes columns back to dataframe
reg_dataset['ranking'] = raw_dataset['ranking']
reg_dataset['likes'] = raw_dataset['likes']
reg_dataset['name'] = raw_dataset['name']

# move name column to the first column
reg_columns = [reg_dataset.columns[-1]] + list(reg_dataset.columns[:-1])
reg_dataset = reg_dataset[reg_columns]


reg_dataset.head()

Unnamed: 0,name,american,asian,bar,casual,euro,Bangalore,Delhi,Mumbai,ranking,likes
0,Amritsari Lassi Wala,0,0,0,0,0,0,1,0,1,6
2,Kake Di Hatti | काके दी हट्टी,0,1,0,0,0,0,1,0,2,34
5,Spice Market,1,0,0,0,0,0,1,0,2,36
7,Karim's | करीम | کریم (Karim's),0,1,0,0,0,0,1,0,3,316
8,Jama Masjid |जामा मस्जिद | جامع مسجد (Jama Ma...,0,0,0,0,0,0,1,0,3,177
