# Assignment for Week 4
# Find best place to open a restaurant

The client is an entrepreneur that would like to open a restaurant in Florence. He's giving two options for his business: he would like to open an Italian restaurant in CAP area 50127 or a Pizza restaurant in CAP area 50145. He's asking which of this two restaurants would be much more appreciated and therefore which one will result as better business.

## Building DataFrame
Since Postal Codes of Florence ranges between 50121 and 50145 I build dataframe finding Latitude and Longitude for each different Postal Code.

In [1]:
import sys
#!{sys.executable} -m pip install geocoder
import geocoder

def latlon(postal_code):
    global latitude, longitude
    lat_lng_coords = None

    while(lat_lng_coords is None):
      g = geocoder.arcgis('{}, Firenze, Italy'.format(postal_code))
      lat_lng_coords = g.latlng

    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]

    
    return latitude,longitude

In [272]:
import pandas as pd
import sys
df_cap=pd.DataFrame(columns=['CAP','Latitude','Longitude'])

for cp,i in zip(range(50121,50145),range(0,50145-50121)):
    latlon(cp)
    df_cap.loc[i,'CAP']=cp
    df_cap.loc[i,'Latitude']=latitude
    df_cap.loc[i,'Longitude']=longitude
    sys.stdout.write('\r'+'Done '+str(i+1)+' of '+str(50145-50121))
    sys.stdout.flush()
print('\nAll data were scraped!')
df_cap.head()

Done 24 of 24
All data were scraped!


Unnamed: 0,CAP,Latitude,Longitude
0,50121,43.7744,11.2645
1,50122,43.7714,11.2616
2,50123,43.7743,11.2472
3,50124,43.7497,11.2241
4,50125,43.7479,11.2585


Now i plot data using folium library:

In [273]:
import sys
#!{sys.executable} -m pip install folium


import folium
map_florence = folium.Map(location=[43.7696, 11.2558], zoom_start=12)

# add markers to map
for lat, lng,cp in zip(df_cap['Latitude'], df_cap['Longitude'],df_cap['CAP']):
    label = '{}'.format(cp)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_florence)  
    
map_florence

## Get nearby venues from FourSquare
Now it's the time to retrieve data from FourSquare. In this section I'll get 500 nearby venues for each point that i plot in the above map.

In [4]:
import requests
CLIENT_ID = 'AMQOC0PAD201KRERHODMBXYHTVPMF1UMLLO5UDZVUB4NWOPH' # your Foursquare ID
CLIENT_SECRET = 'HDTMQKXWJD3ISPTMZPT0KL1GUYYFT1ABL4GVVIXBWL0BDDX4' # your Foursquare Secret
VERSION = '20180604'
intent='food'
LIMIT=10000
price='1'
def getNearbyVenues(names, latitudes, longitudes, radius=10000):
    
    venues_list=[]
    for name, lat, lng, i in zip(names, latitudes, longitudes, range(1,len(names))):
        
        sys.stdout.write('\r'+'Getting nearby values for '+str(i)+' of '+str(len(names)-1))
        sys.stdout.flush()
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&intent={}&price={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT,
            intent,
            price
            )
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['CAP', 
                  'CAP Latitude', 
                  'CAP Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [274]:
Florence_venues = getNearbyVenues(names=df_cap['CAP'],
                                   latitudes=df_cap['Latitude'],
                                   longitudes=df_cap['Longitude']
                                  )
Florence_venues.head()

Getting nearby values for 23 of 23

Unnamed: 0,CAP,CAP Latitude,CAP Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,50121,43.774415,11.264521,Panini Toscani,43.77311,11.25739,Sandwich Place
1,50121,43.774415,11.264521,All'Antico Vinaio,43.768511,11.257318,Sandwich Place
2,50121,43.774415,11.264521,Pino's Sandwiches,43.770501,11.261799,Sandwich Place
3,50121,43.774415,11.264521,Sandwichic,43.777179,11.256242,Sandwich Place
4,50121,43.774415,11.264521,Vecchio Forno,43.77713,11.256177,Bakery


In [275]:
Florence_venues=Florence_venues[Florence_venues['Venue Category'].str.contains('estaurant')]
Florence_venues.shape

(34, 7)

Only 34 restaurants are present in the data I downloaded with Foursquare API.This is a very small number if compared with the real situation.
I decided then to made a python script for scraping data from Tripadvisor. With this script (it took a while to run) I was able to scrape a lot of useful data for this project like price, kind of cousine, ranking, reviews etc. 
Below I attach (in small red) the script I made and run.

<font size="5" color="red"><a href='https://drive.google.com/file/d/1qc3kheIpHTFvGofYXJ7q1OoCn0lhQqlL/view?usp=sharing'>Click to download the scraper py file!</font>

Filtering found venues selecting only restaurants:

In [268]:
import pandas as pd
import requests
from io import StringIO

dwn_url='https://raw.githubusercontent.com/fpannacci/Coursera_Capstone/main/Florence_T.A._data.csv'
url = requests.get(dwn_url).text
csv_raw = StringIO(url)
data=pd.read_csv(csv_raw).iloc[:,1:]
data.head()

Unnamed: 0,Restaurant,Ranking,Reviews,Price,Range_price,Cousine_type,Address,Phone,Site,Area,Near,Url
0,Caffè Italiano,40,1.083 recensioni,€€-€€€,,"Italiana, Pizza, Mediterranea, Toscana, Italia...","Via Isola delle Stinche, 11R-13R, 50122, Firen...",+39 055 289080,,Duomo,"0,3 km da Piazza del Duomo",https://www.tripadvisor.it/Restaurant_Review-g...
1,Gustarium,50,1.990 recensioni,€,,"Pizza, Mediterranea, Italiana, Fast food, Salu...","Via Dei Cimatori 24/r, 50122, Firenze Italia",+39 055 283469,,Duomo,"0,1 km da Palazzo Vecchio",https://www.tripadvisor.it/Restaurant_Review-g...
2,Degusteria Italiana agli Uffizi,50,689 recensioni,€€-€€€,,"Mediterranea, Salutistica, Wine Bar, Italiana,...",Via Lambertesca 7r A Pochi Metri Dal Museo Uff...,+39 055 493 9867,,Centro storico di Firenze,"0,1 km da Gallerie Degli Uffizi",https://www.tripadvisor.it/Restaurant_Review-g...
3,180 Grammi Burgers & Beers,50,80 recensioni,,,Italiana,"Via Pietrapiana 11/ R Via Pietrapiana 11/ R, 5...",+39 055 386 0476,,Santa Croce,"0,6 km da Piazza del Duomo",https://www.tripadvisor.it/Restaurant_Review-g...
4,Pitti Express,50,85 recensioni,€,,"Caffè, Wine Bar, Toscana, Salutistica, Fast fo...","Via Romana 40/r, 50125, Firenze Italia",+39 055 627 8178,,Oltrarno,"0,6 km da Ponte Vecchio",https://www.tripadvisor.it/Restaurant_Review-g...


Finding Latitude and Longitude for each venue:

In [265]:
def latlon_comp_addr(complete_address):
    global latitude, longitude
    lat_lng_coords = None

    while(lat_lng_coords is None):
        #g = geocoder.mapquest('{}'.format(complete_address), key='Wxgd9xJHn8bLtY8h7xuEANsfhGN7n0GH')
        g = geocoder.arcgis('{}'.format(complete_address))
        lat_lng_coords = g.latlng

    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]

    
    return latitude,longitude

columns=list(data.columns)
columns.extend(['Latitude','Longitude'])

Since the dimension of data is large, in order to run the notebook faster, I left the original code as comment and after the first run I exported as csv and then reimported.

In [280]:
'''
df=data.iloc[:,:]
df['Latitude']=''
df['Longitude']=''

for addr,i in zip(data['Address'],data.index):
    latlon_comp_addr(addr)
    df.loc[i,'Latitude']=latitude
    df.loc[i,'Longitude']=longitude
    sys.stdout.write('\r'+'Done '+str(i+1)+' of '+str(len(data.index))
    sys.stdout.flush()
print('\nAll data were scraped!')
'''


dwn_url='https://raw.githubusercontent.com/fpannacci/Coursera_Capstone/main/data_Lon_Lat.csv'
url = requests.get(dwn_url).text
csv_raw = StringIO(url)
df=pd.read_csv(csv_raw).iloc[:,1:]

df.head()


Unnamed: 0,Restaurant,Ranking,Reviews,Price,Range_price,Cousine_type,Address,Phone,Site,Area,Near,Url,Latitude,Longitude
0,CaffÃ¨ Italiano,40,1.083 recensioni,â¬â¬-â¬â¬â¬,,"Italiana, Pizza, Mediterranea, Toscana, Italia...","Via Isola delle Stinche, 11R-13R, 50122, Firen...",+39 055 289080,,Duomo,"0,3 km da Piazza del Duomo",https://www.tripadvisor.it/Restaurant_Review-g...,43.770131,11.260304
1,Gustarium,50,1.990 recensioni,â¬,,"Pizza, Mediterranea, Italiana, Fast food, Salu...","Via Dei Cimatori 24/r, 50122, Firenze Italia",+39 055 283469,,Duomo,"0,1 km da Palazzo Vecchio",https://www.tripadvisor.it/Restaurant_Review-g...,43.770654,11.25557
2,Degusteria Italiana agli Uffizi,50,689 recensioni,â¬â¬-â¬â¬â¬,,"Mediterranea, Salutistica, Wine Bar, Italiana,...",Via Lambertesca 7r A Pochi Metri Dal Museo Uff...,+39 055 493 9867,,Centro storico di Firenze,"0,1 km da Gallerie Degli Uffizi",https://www.tripadvisor.it/Restaurant_Review-g...,43.768834,11.254358
3,180 Grammi Burgers & Beers,50,80 recensioni,,,Italiana,"Via Pietrapiana 11/ R Via Pietrapiana 11/ R, 5...",+39 055 386 0476,,Santa Croce,"0,6 km da Piazza del Duomo",https://www.tripadvisor.it/Restaurant_Review-g...,43.77141,11.26581
4,Pitti Express,50,85 recensioni,â¬,,"CaffÃ¨, Wine Bar, Toscana, Salutistica, Fast f...","Via Romana 40/r, 50125, Firenze Italia",+39 055 627 8178,,Oltrarno,"0,6 km da Ponte Vecchio",https://www.tripadvisor.it/Restaurant_Review-g...,43.76319,11.24422


In [281]:
venues_map = folium.Map(location=[43.7696, 11.2558], zoom_start=12)

# add markers to map
for lat, lng,cp in zip(df['Latitude'], df['Longitude'],df['Restaurant']):
    label = '{}'.format(cp)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='#A79AFF',
        fill=True,
        fill_color='#ECD4FF',
        fill_opacity=0.7,
        parse_html=False).add_to(venues_map)  
    
venues_map

# Data Wrangling
In this section I will clean data

In [283]:
df['CAP']=''
for i,n in zip(df['Address'],df.index):
    for s in i.split(','):
        if s.find('50')>0:
            df['CAP'][n]=s

df['Price']=df['Price'].astype(str).str.replace(u"â¬", u"€")
df['Restaurant']=df['Restaurant'].str.replace(u"ÃÂ¨", u"è")
df.drop(columns=['Range_price','Site'], axis=1, inplace=True)
df['Price'].fillna(str(df['Price'].mode()[0]),inplace=True)
df.head()

Unnamed: 0,Restaurant,Ranking,Reviews,Price,Cousine_type,Address,Phone,Area,Near,Url,Latitude,Longitude,CAP
0,CaffÃ¨ Italiano,40,1.083 recensioni,€€-€€€,"Italiana, Pizza, Mediterranea, Toscana, Italia...","Via Isola delle Stinche, 11R-13R, 50122, Firen...",+39 055 289080,Duomo,"0,3 km da Piazza del Duomo",https://www.tripadvisor.it/Restaurant_Review-g...,43.770131,11.260304,50122
1,Gustarium,50,1.990 recensioni,€,"Pizza, Mediterranea, Italiana, Fast food, Salu...","Via Dei Cimatori 24/r, 50122, Firenze Italia",+39 055 283469,Duomo,"0,1 km da Palazzo Vecchio",https://www.tripadvisor.it/Restaurant_Review-g...,43.770654,11.25557,50122
2,Degusteria Italiana agli Uffizi,50,689 recensioni,€€-€€€,"Mediterranea, Salutistica, Wine Bar, Italiana,...",Via Lambertesca 7r A Pochi Metri Dal Museo Uff...,+39 055 493 9867,Centro storico di Firenze,"0,1 km da Gallerie Degli Uffizi",https://www.tripadvisor.it/Restaurant_Review-g...,43.768834,11.254358,50122
3,180 Grammi Burgers & Beers,50,80 recensioni,,Italiana,"Via Pietrapiana 11/ R Via Pietrapiana 11/ R, 5...",+39 055 386 0476,Santa Croce,"0,6 km da Piazza del Duomo",https://www.tripadvisor.it/Restaurant_Review-g...,43.77141,11.26581,50121
4,Pitti Express,50,85 recensioni,€,"CaffÃ¨, Wine Bar, Toscana, Salutistica, Fast f...","Via Romana 40/r, 50125, Firenze Italia",+39 055 627 8178,Oltrarno,"0,6 km da Ponte Vecchio",https://www.tripadvisor.it/Restaurant_Review-g...,43.76319,11.24422,50125


I dropped that restaurant for which I was unable to scrape cousine type

In [284]:
df_bak=df
df=df.drop_duplicates()
df=df.dropna(subset=['Cousine_type'])
df=df.dropna(subset=['CAP'])
df['CAP']=df['CAP'].str.replace('50135 Settignano','50135')
df['CAP'].value_counts()[:10]

 50122    249
 50123    233
 50125     89
 50124     59
 50129     57
           54
 50121     53
 50127     34
 50100     30
 50144     22
Name: CAP, dtype: int64

Cleaning data:

In [287]:
dic={'€':1,'€-€€':2,'€€':3,'€€-€€€':4,'€€€':5,'€€€-€€€€':6,'€€€€':7,'€€€€-€€€€€':8,'€€€€€':9}
df['Price']=df['Price'].map(dic)
df['Reviews']=df['Reviews'].astype(str).str.replace(' recensioni','')
df=df[['Restaurant','Ranking','Reviews','Price','Cousine_type','Latitude','Longitude','CAP']]
df.Ranking=df.Ranking.astype(str).str.replace(',','.').astype(float)
df.Reviews=df.Reviews.astype(str).str.replace('.','').astype(int)
df['Price']=df['Price'].fillna(round(df['Price'].mean()))
df.Price=df.Price.astype(int)
df.head()

Unnamed: 0,Restaurant,Ranking,Reviews,Price,Cousine_type,Latitude,Longitude,CAP
0,CaffÃ¨ Italiano,4.0,1083,4,"Italiana, Pizza, Mediterranea, Toscana, Italia...",43.770131,11.260304,50122
1,Gustarium,5.0,1990,1,"Pizza, Mediterranea, Italiana, Fast food, Salu...",43.770654,11.25557,50122
2,Degusteria Italiana agli Uffizi,5.0,689,4,"Mediterranea, Salutistica, Wine Bar, Italiana,...",43.768834,11.254358,50122
3,180 Grammi Burgers & Beers,5.0,80,3,Italiana,43.77141,11.26581,50121
4,Pitti Express,5.0,85,1,"CaffÃ¨, Wine Bar, Toscana, Salutistica, Fast f...",43.76319,11.24422,50125


In [288]:
df['Ranking'].value_counts()

4.5    544
4.0    332
5.0    116
3.5     24
Name: Ranking, dtype: int64

The analysis of the Ranking shows that the minimum rank is 3.5. This means that no negative rank is present in the database (rank<2,5).
Here below I build %qual that is an indicator representing the normalized product between "Ranking" and "Reviews". I assume that "Reviews" is a reliable indicator of the turnout in different restaurants and "Ranking" a reliable indicator of the quality. %qual will be the dependent variable y in further machine learning applications.

In [289]:
point=df['Ranking']*df['Reviews'].astype(float)
norm_point=(point-point.min())/(point.max()-point.min())*100
df['%qual']=norm_point
df.sort_values(by='%qual',ascending=False)
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,Restaurant,Ranking,Reviews,Price,Cousine_type,Latitude,Longitude,CAP,%qual
0,CaffÃ¨ Italiano,4.0,1083,4,"Italiana, Pizza, Mediterranea, Toscana, Italia...",43.770131,11.260304,50122,7.206921
1,Gustarium,5.0,1990,1,"Pizza, Mediterranea, Italiana, Fast food, Salu...",43.770654,11.25557,50122,16.607543
2,Degusteria Italiana agli Uffizi,5.0,689,4,"Mediterranea, Salutistica, Wine Bar, Italiana,...",43.768834,11.254358,50122,5.7227
3,180 Grammi Burgers & Beers,5.0,80,3,Italiana,43.77141,11.26581,50121,0.627489
4,Pitti Express,5.0,85,1,"CaffÃ¨, Wine Bar, Toscana, Salutistica, Fast f...",43.76319,11.24422,50125,0.669322


Finally I'm going to assign a class for each CAP value

In [291]:
caps=df['CAP'].unique()
cap_class=list(range(1,30))
di = {caps[i]: cap_class[i] for i in range(len(caps))} 
df['CAP_class']=df['CAP'].map(di)
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Unnamed: 0,Restaurant,Ranking,Reviews,Price,Cousine_type,Latitude,Longitude,CAP,%qual,CAP_class
0,CaffÃ¨ Italiano,4.0,1083,4,"Italiana, Pizza, Mediterranea, Toscana, Italia...",43.770131,11.260304,50122,7.206921,1
1,Gustarium,5.0,1990,1,"Pizza, Mediterranea, Italiana, Fast food, Salu...",43.770654,11.255570,50122,16.607543,1
2,Degusteria Italiana agli Uffizi,5.0,689,4,"Mediterranea, Salutistica, Wine Bar, Italiana,...",43.768834,11.254358,50122,5.722700,1
3,180 Grammi Burgers & Beers,5.0,80,3,Italiana,43.771410,11.265810,50121,0.627489,2
4,Pitti Express,5.0,85,1,"CaffÃ¨, Wine Bar, Toscana, Salutistica, Fast f...",43.763190,11.244220,50125,0.669322,3
5,Mangiapepe,5.0,347,4,"Italiana, Mediterranea, Europea, Toscana, Ital...",43.770810,11.266070,50122,2.861350,1
6,Cibleo,4.5,82,7,Asiatica,43.771138,11.266512,50122,0.575617,1
7,I' Girone De' Ghiotti,5.0,3293,1,"Italiana, Fast food, Mediterranea, Europea, Ga...",43.770489,11.255685,50122,27.509120,1
8,Pizzagnolo - Pizza e Sfizi,5.0,76,4,"Pizza, Italiana",43.770570,11.262670,50122,0.594023,1
10,SENZ'ALTRO Bistrot,5.0,1094,4,Salutistica,43.770850,11.270300,50121,9.111141,2


# Analyzing venues
In this section I'll try to reply to the answer:"What type of Cousine mostly influence %qual parameter?"
I'll do a hot encoding and then a regression.

In [292]:
df_cous_type_onehot=df.copy()
df_cous_type_onehot['Cousine_type']=df_cous_type_onehot['Cousine_type'].str.split(',')
for index, row in df_cous_type_onehot.iterrows(): 
    for types in row['Cousine_type']:
        df_cous_type_onehot.at[index, types] = 1
df_cous_type_onehot = df_cous_type_onehot.fillna(0) 
df_cous_type_onehot.head()
#df_cous_type_onehot.columns.unique()

Unnamed: 0,Restaurant,Ranking,Reviews,Price,Cousine_type,Latitude,Longitude,CAP,%qual,CAP_class,...,Israeliana,Pub,Peruviana,Francese,Hawaiana,Napoletana,Persiana,Australiana,Tedesca,Araba
0,CaffÃ¨ Italiano,4.0,1083,4,"[Italiana, Pizza, Mediterranea, Toscana, I...",43.770131,11.260304,50122,7.206921,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Gustarium,5.0,1990,1,"[Pizza, Mediterranea, Italiana, Fast food, ...",43.770654,11.25557,50122,16.607543,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Degusteria Italiana agli Uffizi,5.0,689,4,"[Mediterranea, Salutistica, Wine Bar, Itali...",43.768834,11.254358,50122,5.7227,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,180 Grammi Burgers & Beers,5.0,80,3,[Italiana],43.77141,11.26581,50121,0.627489,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Pitti Express,5.0,85,1,"[CaffÃ¨, Wine Bar, Toscana, Salutistica, F...",43.76319,11.24422,50125,0.669322,3,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In the Above tabl I grouped all cousins_type divided for CAP.
In the table below is shown the frequency of each restaurant in each CAP area

In [293]:
florence_grouped = df_cous_type_onehot.groupby('%qual').mean().reset_index()[1:].reset_index(drop=True)
florence_grouped.head()

Unnamed: 0,%qual,Ranking,Reviews,Price,Latitude,Longitude,CAP_class,Italiana,Pizza,Mediterranea,...,Israeliana,Pub,Peruviana,Francese,Hawaiana,Napoletana,Persiana,Australiana,Tedesca,Araba
0,0.125498,5.0,20.0,4.0,43.772967,11.265307,2.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.133864,5.0,21.0,4.0,43.76941,11.205815,18.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.142231,5.0,22.0,1.0,43.78335,11.23324,15.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.146414,4.5,25.0,7.0,43.769645,11.242646,7.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.150597,5.0,23.0,4.0,43.77324,11.25261,4.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [294]:
qual_class=[]
for i in df_cous_type_onehot['%qual']:
    if i<=10.0:
        qual_class.append(1)
    elif i>10.0 and i<=20.0:
        qual_class.append(2)
    elif i>20.0 and i<=30.0:
        qual_class.append(3)
    elif i>30.0 and i<=40.0:
        qual_class.append(4)
    elif i>40.0 and i<=50.0:
        qual_class.append(5)
    elif i>50.0 and i<=60.0:
        qual_class.append(6)
    elif i>60.0 and i<=70.0:
        qual_class.append(7)
    elif i>70.0 and i<=80.0:
        qual_class.append(8)
    elif i>80.0 and i<=90.0:
        qual_class.append(9)
    elif i>=90.0:
        qual_class.append(10)
    else:
        qual_class.append(0)
df_cous_type_onehot['%qual_class']=qual_class
df_cous_type_onehot.sort_values(by='%qual_class', ascending=True).head()
df_cous_type_onehot[df_cous_type_onehot['%qual_class']==10]

Unnamed: 0,Restaurant,Ranking,Reviews,Price,Cousine_type,Latitude,Longitude,CAP,%qual,CAP_class,...,Pub,Peruviana,Francese,Hawaiana,Napoletana,Persiana,Australiana,Tedesca,Araba,%qual_class
69,Trattoria ZÃ ZÃ,4.5,13286,4,"[Steakhouse, Mediterranea, Italiana, Toscan...",43.776421,11.254459,50123,100.0,4,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10


Preparing data for SVM prediction

In [297]:
import numpy as np
X=np.asarray(df_cous_type_onehot.iloc[:,9:-1])
y=np.asarray(df_cous_type_onehot.iloc[:,-1])

Dividing data in train and test sets

In [298]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.2,random_state=4)

In [299]:
from sklearn import svm
from sklearn.metrics import f1_score

clf=svm.SVC(kernel='rbf',gamma='auto')
clf.fit(X_train,y_train)
yhat=clf.predict(X_test)
yhat

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1])

Finding f1-score for the model

In [300]:
print('F1-score is: '+str(f1_score(y_test, yhat, average='weighted')))

F1-score is: 0.8768115942028986


Finding best C-value for the model:

In [191]:
from sklearn import svm
from sklearn.metrics import f1_score

cval=[800,900,1000,2000,5000,10000,15000,20000,30000,40000]
scores=[]
for c in cval:
    clf=svm.SVC(kernel='rbf',gamma='auto',C=c)
    clf.fit(X_train,y_train)
    yhat=clf.predict(X_test)
    scores.append(f1_score(y_test, yhat, average='weighted'))
score_df=pd.DataFrame(columns=['C-value','F1_score'])
score_df['C-value']=cval
score_df['F1_score']=scores

best_C=score_df['C-value'][score_df[score_df['F1_score']==score_df['F1_score'].max()].index[0]]
print('Best C-Value for the model is: '+str(best_C))
score_df

Best C-Value for the model is: 20000


Unnamed: 0,C-value,F1_score
0,800,0.878349
1,900,0.878349
2,1000,0.878349
3,2000,0.878349
4,5000,0.87451
5,10000,0.87451
6,15000,0.868539
7,20000,0.881244
8,30000,0.881244
9,40000,0.877845


Modeling with a C-value of 20000

In [301]:
from sklearn import svm
from sklearn.metrics import f1_score

clf=svm.SVC(kernel='rbf',gamma='auto',C=20000)
clf.fit(X_train,y_train)
yhat=clf.predict(X_test)
yhat

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1,
       1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 2, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1])

Now that I have the model I can predict wheter it's better to open a "Toscana" restaurant in CAP 50127 or a "Pizza" restaurant in 50145

In [302]:
print('Class for 50127 is: '+str(di[' 50127']))
print('Class for 50127 is: '+str(di[' 50145']))

Class for 50127 is: 10
Class for 50127 is: 24


Defining function to build arrays:

In [337]:
def array_cal(CAP,cous_type):
    CAP=' '+str(CAP)
    ls=['CAP_class','Italiana',' Pizza',' Mediterranea',' Toscana',' Italiana (centro)','Pizza',' Italiana',' Fast food',' Salutistica',' Birreria',' Internazionale','Mediterranea',' Wine Bar','CaffÃ¨',' Cibo di strada',' Europea','Asiatica',' Gastronomia','Salutistica','Toscana','Steakhouse',' Contemporanea',' Steakhouse',' Grill',' Birrerie con ristorante',' Gastropub','Pesce',' Pesce',' Bar',' Barbecue',' Ristoranti con bar',' CaffÃ¨','Indiana',' Napoletana',' Campana',' Italiana (sud)',' Asiatica','Birreria','Cinese','Contemporanea','Ristoranti con bar',' Giapponese',' Sushi','Americana','Europea','Internazionale',' Americana del sud-ovest',' Centro americana',' Fusion',' Pub',' Turca',' Romana',' Laziale',' Mediorientale','Giapponese',' Zuppe',' Vietnamita',' Tailandese','Birrerie con ristorante','Fast food','Cibo di strada','Italiana (nord)','Persiana',' Argentina','Messicana',' Americana','Bar',' Spagnola',' Catalana',' Greca',' Italiana (nord)',' Siciliana','Barbecue',' Latino americana',' Venezuelana',' Sudamericana','Armena',' Caucasica',' Georgiana','Italiana (centro)',' Ligure',' Irlandese','Spagnola',' Araba','Wine Bar',' Cinese','Libanese','Tailandese','Etiope',' Africana',' Cubana',' Caraibica','Irlandese',' Coreana',' Diner','Peruviana',' Cingalese','Brasiliana',' Israeliana','Pub',' Peruviana',' Francese',' Hawaiana','Napoletana',' Persiana',' Australiana',' Tedesca','Araba']
    arr=[]
    for i in ls:
        if i in cous_type:
            arr.append(1)
        else:
            arr.append(0)
    return np.array([arr])

Below I'm going to predict quality values for the two types of restaurant required by the customer: "Toscana" restaurant in 50127 area or "Pizza" restaurant in 50145 area.

In [338]:
Toscana=array_cal(50127,['Toscana'])
Pizza=array_cal(50145,['Pizza'])
print('Predicted class for "Toscana" restaurant is: '+str(clf.predict(Toscana)[0]))
print('Predicted class for "Pizza" restaurant is: '+str(clf.predict(Pizza)[0]))

Predicted class for "Toscana" restaurant is: 2
Predicted class for "Pizza" restaurant is: 1


The analysis turned out that is better to open a "Toscana" restaurant in 50127 area as it seems to have a higher value class.