# Travel around the world from Berlin

The aim of this project is to analyze whether it is possible to 'travel' around the world without moving from Berlin. Meaning: Is there at least one restaurant from each country in the world in Berlin?

One of the keys of this project will be to get the data from the Foursquare API. Before getting to that point we need to get a list of all countries and of their denomination: eg Italy - italian. This will be really helpful when performing searches through the mentioned api.

In [1]:
import pandas as pd
import requests
import json
import pycountry_convert as pc
import getpass
pd.set_option('display.max_rows', 6000)

### Creating a list of countries using the library pycountry_convert

In [2]:
countries=[val['alpha_2'] for key,val in pc.map_countries().items()]
countries=list(dict.fromkeys(countries))
countries=[pc.country_alpha2_to_country_name(c,cn_name_format="default") for c in countries]

In [3]:
countries=countries+['Malvinas',"Macao","Kurdistan"]

In [4]:
countries=pd.DataFrame(countries,columns=['country'])

##### Changing some names of countries

In [5]:
to_change_country={'Bahamas, The':'Bahamas',
                   'Brunei Darussalam':'Brunei',
                   'Moldova, Republic of':'Moldova',
                   'Bolivia, Plurinational State of':'Bolivia',
                   'Åland Islands':'Aland Islands',
                   'Falkland Islands (Malvinas)':'Malvinas',
                   'Micronesia, Federated States of':'Micronesia',
                   'Cocos (Keeling) Islands':'Cocos Islands',
                   'Congo, The Democratic Republic of the':'Dem. Rep. Congo',
                   'Egypt, Arab Rep.':'Egypt',
                   'Micronesia, Fed. Sts.':'Micronesia',
                   'Gambia, The':'Gambia',
                   'Hong Kong SAR, China':'Hong Kong',
                   'Hong Kong, China':'Hong Kong',
                   'Iran, Islamic Republic of':'Iran',
                   "Lao People's Democratic Republic":"Laos",
                   'Kyrgyz Republic':'Kyrgyzstan',
                   'Syrian Arab Republic':'Syria',
                   'Korea, Rep.':'South Korea',
                   'Saint Martin (French part)':'St. Martin',
                   'Sint Maarten (Dutch part)':'St. Martin',
                   'Macedonia, FYR':'Macedonia',
                   'Korea, Republic of':'Korea',
                   'Korea, Dem. Rep.':'North Korea',
                   "Korea, Democratic People's Republic of":"North Korea",
                   'Palestine, State of':'Palestine',
                   'Russian Federation':'Russia',
                   'Slovak Republic':'Slovakia',
                   'Syrian Arab Republic':'Syria',
                   'Czechia':'Czech Republic',
                   'Taiwan, Province of China':"Taiwan",
                   'Venezuela, Bolivarian Republic of':'Venezuela',
                   'Viet Nam':'Vietnam',
                   'Réunion':'Reunion',
                   'Virgin Islands, British':'Virgin Islands',
                   'Virgin Islands, U.S.':'Virgin Islands',
                   'Tanzania, United Republic of':'Tanzania',
                   'Yemen, Rep.':'Yemen'
                  }

##### Replcaing old names with new names

In [6]:
countries.replace({"country": to_change_country},inplace=True)
countries["new"]=1
countries.head()

Unnamed: 0,country,new
0,Aruba,1
1,Afghanistan,1
2,Angola,1
3,Anguilla,1
4,Aland Islands,1


### Importing a list of denominations by country using a list based on the work done by knowitall in github
https://github.com/knowitall/chunkedextractor/blob/master/src/main/resources/edu/knowitall/chunkedextractor/demonyms.csv

In [7]:
deno=pd.read_csv("input/denonyms.csv",sep="\t",header=None,names=["deno","country"])
deno.head()

Unnamed: 0,deno,country
0,Aalborgenser,Aalborg
1,Aberdonian,Aberdeen
2,Abkhaz,Abkhazia
3,Abkhazian,Abkhazia
4,Abrenian,Abra


### Merging both list but keeping countries that have no match in the deno df

In [8]:
list_countries=countries.merge(deno,how="left",left_on="country",right_on="country")
list_countries.head()

Unnamed: 0,country,new,deno
0,Aruba,1,Aruba
1,Afghanistan,1,Afghan
2,Angola,1,Angolan
3,Anguilla,1,Anguilla
4,Aland Islands,1,Aland Islands


### Importing a list of the categories of restaurants defined by Foursquare

In [9]:
rest_categories=pd.read_csv("input/rest_categories.csv",header=None,names=["type","food","id"])
rest_categories.head()

Unnamed: 0,type,food,id
0,country,Food,4d4b7105d754a06374d81259
1,country,Afghan,503288ae91d4c4b30a586d67
2,country,African,4bf58dd8d48988d1c8941735
3,country,Ethiopian,4bf58dd8d48988d10a941735
4,country,American,4bf58dd8d48988d14e941735


#### Merging the category of restaurants df with the list of countries df

In [10]:
restaurants=list_countries.merge(rest_categories,how="outer",left_on="deno",right_on="food")
restaurants.fillna("empty",inplace=True)
restaurants=restaurants.drop_duplicates().reset_index(drop=True)
restaurants.head()

Unnamed: 0,country,new,deno,type,food,id
0,Aruba,1,Aruba,empty,empty,empty
1,Afghanistan,1,Afghan,country,Afghan,503288ae91d4c4b30a586d67
2,Angola,1,Angolan,empty,empty,empty
3,Anguilla,1,Anguilla,empty,empty,empty
4,Aland Islands,1,Aland Islands,empty,empty,empty


##### We export the restaurants df to pickle 'cause we might need it later to check how many countries are not represented

In [11]:
restaurants.to_pickle("data/countries_complete.pkl")

## Calling the API doing a query by country and by denomination

In [103]:
client_id=getpass.getpass()

········


In [104]:
client_secret=getpass.getpass()

········


In [111]:
url = 'https://api.foursquare.com/v2/venues/search'

new_df=pd.DataFrame(columns=["name","lat","lng",'query',"category"],dtype=object)
for rest in list(range(248)):
    if (restaurants["country"].iloc[rest]==restaurants["deno"].iloc[rest])|(restaurants["deno"].iloc[rest]=="empty"):
        query=restaurants["country"].iloc[rest]
        params = dict(
            client_id=client_id,
            client_secret=client_secret,
            v='20191203',
            near="Berlin",
            categoryId="4d4b7105d754a06374d81259",
            sw="13.0284,52.3712",
            ne="13.7141,52.6504",
            query=query,
            limit=50,
            offset=50)
        resp = requests.get(url=url, params=params)
        data = json.loads(resp.text)
        
        for x in range(len(data["response"]["venues"])):
            name=data["response"]["venues"][x]["name"]
            lat=data["response"]["venues"][x]["location"]["lat"]
            lng=data["response"]["venues"][x]["location"]["lng"]
            cat=data["response"]["venues"][x]["categories"][0]["name"]
            data_df=pd.DataFrame(data=[[name,lat,lng,query,cat]],columns=["name","lat","lng",'query',"category"],dtype=object)
            new_df=new_df.append(data_df,ignore_index=True)
    
    else:
        #looking by counrty eg Italy
        query=restaurants["country"].iloc[rest]
        params = dict(
            client_id=client_id,
            client_secret=client_secret,
            v='20191203',
            near="Berlin",
            categoryId="4d4b7105d754a06374d81259",
            sw="13.0284,52.3712",
            ne="13.7141,52.6504",
            query=query,
            limit=50,
            offset=50)
        resp = requests.get(url=url, params=params)
        data = json.loads(resp.text)
        
        for x in range(len(data["response"]["venues"])):
            name=data["response"]["venues"][x]["name"]
            lat=data["response"]["venues"][x]["location"]["lat"]
            lng=data["response"]["venues"][x]["location"]["lng"]
            cat=data["response"]["venues"][x]["categories"][0]["name"]
            data_df=pd.DataFrame(data=[[name,lat,lng,query,cat]],columns=["name","lat","lng",'query',"category"],dtype=object)
            new_df=new_df.append(data_df,ignore_index=True)
        
        #looking by nationality eg italian
        query2=restaurants["deno"].iloc[rest]
        params = dict(
            client_id=client_id,
            client_secret=client_secret,
            v='20191203',
            near="Berlin",
            categoryId="4d4b7105d754a06374d81259",
            sw="13.0284,52.3712",
            ne="13.7141,52.6504",
            query=query2,
            limit=50,
            offset=50)
        resp = requests.get(url=url, params=params)
        data = json.loads(resp.text)
        try:
            for x in range(len(data["response"]["venues"])):
                name=data["response"]["venues"][x]["name"]
                lat=data["response"]["venues"][x]["location"]["lat"]
                lng=data["response"]["venues"][x]["location"]["lng"]
                cat=data["response"]["venues"][x]["categories"][0]["name"]
                data_df=pd.DataFrame(data=[[name,lat,lng,query,cat]],columns=["name","lat","lng",'query',"category"],dtype=object)
                new_df=new_df.append(data_df,ignore_index=True)
        except:
            continue

KeyError: 'venues'

In [112]:
data

{'meta': {'code': 429,
  'errorType': 'quota_exceeded',
  'errorDetail': 'Quota exceeded',
  'requestId': '5dea6d5729ce6a001b847590'},
 'response': {}}

### Saving the resulting df into pkl so we dont need to call again the API if restarting the kernel

In [15]:
new_df.to_pickle("data/data_restaurants_api.pkl")

In [11]:
ber_rest=pd.read_pickle("data/data_restaurants_api.pkl")
ber_rest.head()

Unnamed: 0,name,lat,lng,query,category
0,Safrani Afghan Und Persian Halal Food,52.5103,13.3055,Afghanistan,Afghan Restaurant
1,Al-Ándalus,52.5335,13.4283,Aland Islands,Tapas Restaurant
2,Al-andalous,52.4849,13.4342,Aland Islands,Falafel Restaurant
3,Al Andalos,52.4857,13.4313,Aland Islands,Falafel Restaurant
4,American Diner,52.5346,13.4193,American Samoa,Diner


## Calling the API using the ids of the different categories defined by Foursquare

In [19]:
url = 'https://api.foursquare.com/v2/venues/search'

cat_new_df=pd.DataFrame(columns=["name","lat","lng",'query',"category"],dtype=object)

for rest in range(len(restaurants)):
    if (restaurants["id"].iloc[rest]!="empty"):
        categoryId=restaurants["id"].iloc[rest]
        query=restaurants["food"].iloc[rest]
        params = dict(
                client_id=client_id,
                client_secret=client_secret,
                v='20191203',
                near="Berlin",
                categoryId=categoryId,
                sw="13.0284,52.3712",
                ne="13.7141,52.6504",
                limit=50,
                offset=50)
        resp = requests.get(url=url, params=params)
        data = json.loads(resp.text)
            
        for x in range(len(data["response"]["venues"])):
            name=data["response"]["venues"][x]["name"]
            lat=data["response"]["venues"][x]["location"]["lat"]
            lng=data["response"]["venues"][x]["location"]["lng"]
            try:
                cat=data["response"]["venues"][x]["categories"][0]["name"]
            except:
                print(f"not possible capturing category for {rest},{x}")
            data_df=pd.DataFrame(data=[[name,lat,lng,query,cat]],columns=["name","lat","lng",'query',"category"],dtype=object)
            cat_new_df=cat_new_df.append(data_df,ignore_index=True)

not possible capturing category for 274,4


#### Saving the resulting df into pkl so we dont need to call again the API if restarting the kernel

In [22]:
cat_new_df.to_pickle("data/cat_data_restaurants_api.pkl")

In [12]:
cat_ber_rest=pd.read_pickle("data/cat_data_restaurants_api.pkl")
cat_ber_rest.tail()

Unnamed: 0,name,lat,lng,query,category
2585,Little Green Rabbit,52.5142,13.3955,Soup Place,Salad Place
2586,Suppengrün,52.512,13.4127,Soup Place,Soup Place
2587,Saigoncomnieu,52.4987,13.3571,Soup Place,Vietnamese Restaurant
2588,Suppen-Cult,52.536,13.4224,Soup Place,Soup Place
2589,Die Löffelei,52.5028,13.3655,Soup Place,Soup Place


#### Unifying both dfs to have one single df

In [13]:
ber_rest_complete=ber_rest.append(cat_ber_rest,ignore_index=True)
ber_rest_complete.head()

Unnamed: 0,name,lat,lng,query,category
0,Safrani Afghan Und Persian Halal Food,52.5103,13.3055,Afghanistan,Afghan Restaurant
1,Al-Ándalus,52.5335,13.4283,Aland Islands,Tapas Restaurant
2,Al-andalous,52.4849,13.4342,Aland Islands,Falafel Restaurant
3,Al Andalos,52.4857,13.4313,Aland Islands,Falafel Restaurant
4,American Diner,52.5346,13.4193,American Samoa,Diner


# Cleaning data

In [14]:
countries_dict=restaurants[["country","deno"]].set_index("deno").to_dict()["country"]

In [15]:
countries_dict={key:val for key,val in countries_dict.items() if key!=val}

## Cleaning dfs separately

### first df

In [16]:
ber_rest.head()

Unnamed: 0,name,lat,lng,query,category
0,Safrani Afghan Und Persian Halal Food,52.5103,13.3055,Afghanistan,Afghan Restaurant
1,Al-Ándalus,52.5335,13.4283,Aland Islands,Tapas Restaurant
2,Al-andalous,52.4849,13.4342,Aland Islands,Falafel Restaurant
3,Al Andalos,52.4857,13.4313,Aland Islands,Falafel Restaurant
4,American Diner,52.5346,13.4193,American Samoa,Diner


In [18]:
countries_dict_reverse={val:key for key,val in countries_dict.items()}

In [19]:
ber_rest_clean=pd.DataFrame(columns=ber_rest.columns,dtype=object)
for x in range(len(ber_rest)):
    if ber_rest["query"].iloc[x] in list(countries_dict.values())+list(countries_dict.keys()):
        name_split=ber_rest["name"].iloc[x].replace("-"," ").replace("'"," ").lower().split(" ")
        country=ber_rest["query"].iloc[x]
        a=countries_dict_reverse[country]
        if a.lower() in name_split or country.lower() in name_split:
            ber_rest_clean=ber_rest_clean.append(ber_rest.iloc[x].to_frame().transpose(),ignore_index=True)
        else:
            name=ber_rest["name"].iloc[x]
            lat=ber_rest["lat"].iloc[x]
            lng=ber_rest["lng"].iloc[x]
            query="other"
            cat=ber_rest["category"].iloc[x]
            temp_df=pd.DataFrame(data=[[name,lat,lng,query,cat]],columns=["name","lat","lng",'query',"category"],dtype=object)
            ber_rest_clean=ber_rest_clean.append(temp_df, ignore_index=True)
    else:
        name_split=ber_rest["name"].iloc[x].replace("-"," ").replace("'"," ").lower().split(" ")
        country=ber_rest["query"].iloc[x]
        if country.lower() in name_split:
            ber_rest_clean=ber_rest_clean.append(ber_rest.iloc[x].to_frame().transpose(),ignore_index=True)
        else:
            name=ber_rest["name"].iloc[x]
            lat=ber_rest["lat"].iloc[x]
            lng=ber_rest["lng"].iloc[x]
            query="other"
            cat=ber_rest["category"].iloc[x]
            temp_df=pd.DataFrame(data=[[name,lat,lng,query,cat]],columns=["name","lat","lng",'query',"category"],dtype=object)
            ber_rest_clean=ber_rest_clean.append(temp_df, ignore_index=True)
            

In [20]:
ber_rest_clean=ber_rest_clean.drop_duplicates().reset_index(drop=True)

In [21]:
ber_rest_clean=ber_rest_clean.assign(new_category="")

In [22]:
for x in range(len(ber_rest_clean)): 
    if ber_rest_clean["query"].iloc[x]!="other":
        if ber_rest_clean["query"].iloc[x] in list(countries_dict.values())+list(countries_dict.keys()):
            country=ber_rest_clean["query"].iloc[x]
            ber_rest_clean["new_category"].iloc[x]=countries_dict_reverse[country]+" Restaurant"
        else:
            country=ber_rest_clean["query"].iloc[x]
            ber_rest_clean["new_category"].iloc[x]=country+" Restaurant"
    else:
        ber_rest_clean["new_category"].iloc[x]=ber_rest_clean["category"].iloc[x]+" Restaurant"

In [23]:
ber_rest_clean.head()

Unnamed: 0,name,lat,lng,query,category,new_category
0,Safrani Afghan Und Persian Halal Food,52.5103,13.3055,Afghanistan,Afghan Restaurant,Afghan Restaurant
1,Al-Ándalus,52.5335,13.4283,other,Tapas Restaurant,Tapas Restaurant Restaurant
2,Al-andalous,52.4849,13.4342,other,Falafel Restaurant,Falafel Restaurant Restaurant
3,Al Andalos,52.4857,13.4313,other,Falafel Restaurant,Falafel Restaurant Restaurant
4,American Diner,52.5346,13.4193,other,Diner,Diner Restaurant


### second df

In [24]:
cat_ber_rest.head()

Unnamed: 0,name,lat,lng,query,category
0,Kabul Bistro,52.4694,13.3328,Afghan,Afghan Restaurant
1,Restorant Macedonia (Berlin),52.5047,13.3036,Afghan,Afghan Restaurant
2,chili chutney,52.502,13.4322,Afghan,Afghan Restaurant
3,Safrani Afghan Und Persian Halal Food,52.5103,13.3055,Afghan,Afghan Restaurant
4,Fruhstuckbar,52.524,13.3683,Afghan,Afghan Restaurant


In [25]:
cat_ber_rest=cat_ber_rest.replace({"query":countries_dict})
cat_ber_rest.head()

Unnamed: 0,name,lat,lng,query,category
0,Kabul Bistro,52.4694,13.3328,Afghanistan,Afghan Restaurant
1,Restorant Macedonia (Berlin),52.5047,13.3036,Afghanistan,Afghan Restaurant
2,chili chutney,52.502,13.4322,Afghanistan,Afghan Restaurant
3,Safrani Afghan Und Persian Halal Food,52.5103,13.3055,Afghanistan,Afghan Restaurant
4,Fruhstuckbar,52.524,13.3683,Afghanistan,Afghan Restaurant


In [26]:
cat_ber_rest=cat_ber_rest.assign(new_category="")

In [27]:
for x in range(len(cat_ber_rest)): 
    if cat_ber_rest["query"].iloc[x] in list(countries_dict.values())+list(countries_dict.keys()):
        country=cat_ber_rest["query"].iloc[x]
        cat_ber_rest["new_category"].iloc[x]=countries_dict_reverse[country]+" Restaurant"
    else:
        country=cat_ber_rest["category"].iloc[x]
        cat_ber_rest["new_category"].iloc[x]=country+" Restaurant"

## final merging of restaurant df

In [124]:
final_rest=ber_rest_clean.append(cat_ber_rest,ignore_index=True)

In [125]:
final_rest.head()

Unnamed: 0,name,lat,lng,query,category,new_category
0,Safrani Afghan Und Persian Halal Food,52.5103,13.3055,Afghanistan,Afghan Restaurant,Afghan Restaurant
1,Al-Ándalus,52.5335,13.4283,other,Tapas Restaurant,Tapas Restaurant Restaurant
2,Al-andalous,52.4849,13.4342,other,Falafel Restaurant,Falafel Restaurant Restaurant
3,Al Andalos,52.4857,13.4313,other,Falafel Restaurant,Falafel Restaurant Restaurant
4,American Diner,52.5346,13.4193,other,Diner,Diner Restaurant


In [126]:
final_rest["new_category"]=final_rest["new_category"].str.replace("Restaurant","").str.strip()

In [127]:
final_rest.dtypes

name            object
lat             object
lng             object
query           object
category        object
new_category    object
dtype: object

In [128]:
final_rest['lat']=final_rest['lat'].astype(float)
final_rest['lng']=final_rest['lng'].astype(float)

In [129]:
final_rest.dtypes

name             object
lat             float64
lng             float64
query            object
category         object
new_category     object
dtype: object

In [130]:
dict_to_change={
    'Burrito Place':'Mexican',
    'Bavarian':'German',
    'Irish Pub':'Irish',
    'Ramen':'Japanese',
    'Sushi':'Japanese',
    'Dim Sum':'Chinese',
    'Cantonese':'Hongkongese',
    'Noodle House':'Thai',
    'Caucasian':'Georgian',
    'Empanada':'Argentinian',
    'Himalayan':'Nepali',
    'South American':'Latin American',
    'Jewish':'Israeli',
    'Fondue':'Swiss',
    'Falafel':'Turkish',
    'Shawarma Place':'Lebanese',
    'Trattoria/Osteria':'Italian',
    'Poke Place':'Hawaiian',
    'Persian':'Iranian',
    'Kebab':'Turkish',
    'Doner':'Turkish',
    'Cajun / Creole':"American",
    'Pizza Place':'Italian',
    '':'Other'
}

In [131]:
final_rest=final_rest.replace({'new_category':dict_to_change})

In [132]:
final_rest.head()

Unnamed: 0,name,lat,lng,query,category,new_category
0,Safrani Afghan Und Persian Halal Food,52.510331,13.305543,Afghanistan,Afghan Restaurant,Afghan
1,Al-Ándalus,52.533491,13.428347,other,Tapas Restaurant,Tapas
2,Al-andalous,52.484928,13.434201,other,Falafel Restaurant,Turkish
3,Al Andalos,52.485735,13.431296,other,Falafel Restaurant,Turkish
4,American Diner,52.534623,13.419337,other,Diner,Diner


### Cleaning some of the categories that are not specific enough and we would lose information of the countries otherwise

Cleaning East Europe

In [133]:
mace=final_rest[final_rest["name"].str.contains('Macedonia')].index
for x in mace:   
    final_rest["new_category"].iloc[x]='Macedonian'

croat=final_rest[final_rest["name"].str.contains('Dubrov')|final_rest["name"].str.contains('Kroatis')|final_rest["name"].str.contains('Dalma')|final_rest["name"].str.contains('Zagreb')|final_rest["name"].str.contains('Borik')|final_rest["name"].str.contains('Dorfaue')|final_rest["name"].str.contains('Kadena')|final_rest["name"].str.contains('Konoba')|final_rest["name"].str.contains('Marjan')].index
for x in croat:
    final_rest["new_category"].iloc[x]='Croatian'

bulg=final_rest[final_rest["name"].str.contains('Mehana')|final_rest["name"].str.contains('Sofia')].index
for x in bulg:
    final_rest["new_category"].iloc[x]='Bulgarian'
    
rus=final_rest[final_rest["name"].str.contains('4YOU')|final_rest["name"].str.contains('Avantg')].index
for x in rus:
    final_rest["new_category"].iloc[x]='Russian'
    
pol=final_rest[final_rest["name"].str.contains('Polnische')].index
for x in pol:   
    final_rest["new_category"].iloc[x]='Polish'
    
hun=final_rest[final_rest["name"].str.contains('Goulash')].index
for x in hun:   
    final_rest["new_category"].iloc[x]='Hungarian'

aze=final_rest[final_rest["name"].str.contains('Baku')].index
for x in aze:   
    final_rest["new_category"].iloc[x]='Azerbaijani'
    
latv=final_rest[final_rest["name"].str.contains('KIRSONS')].index
for x in latv:   
    final_rest["new_category"].iloc[x]='Latvian'
    
ser=final_rest[final_rest["name"].str.contains('Zadar')].index
for x in ser:   
    final_rest["new_category"].iloc[x]='Serbian'

Cleaning Latin America

In [134]:
chil=final_rest[final_rest["name"].str.contains('La tía rica')].index
for x in chil:   
    final_rest["new_category"].iloc[x]='Chilean'

mex=final_rest[final_rest["name"].str.contains('El Sabor')].index
for x in mex:   
    final_rest["new_category"].iloc[x]='Mexican'
    
uru=final_rest[final_rest["name"].str.contains('Pecados')|final_rest["name"].str.contains('New Montevideo')].index
for x in uru:   
    final_rest["new_category"].iloc[x]='Uruguayan'
    
arg=final_rest[final_rest["name"].str.contains('La Despensa')].index
for x in arg:   
    final_rest["new_category"].iloc[x]='Argentinian'
    
peru=final_rest[final_rest["name"].str.contains('Peruanische')|final_rest["name"].str.contains('Serrano')|final_rest["name"].str.contains('Paracas')].index
for x in peru:
    final_rest["new_category"].iloc[x]='Peruvian'

chile=final_rest[final_rest["name"]=='El Chilenito'].index
final_rest["new_category"].iloc[chile]="Chilean"

Cleaning Scandinavian

In [135]:
swe=final_rest[final_rest["name"].str.contains('IKEA')].index
for x in swe:   
    final_rest["new_category"].iloc[x]='Swedish'
    
dan=final_rest[final_rest["name"].str.contains('Saeson Nordic Cuisine')].index
for x in dan:   
    final_rest["new_category"].iloc[x]='Danish'
    
nor=final_rest[final_rest["name"].str.contains('Munch')|final_rest["name"].str.contains('Felleshus')].index
for x in nor:
    final_rest["new_category"].iloc[x]='Norwegian'

Cleaning Asia

In [136]:
hk=final_rest[final_rest["name"].str.contains('Hong Kong')|final_rest["name"].str.contains('HongKong')|final_rest["name"].str.contains('Hongkong')].index
for x in hk:
    final_rest["new_category"].iloc[x]='Hongkongese'
    
thai=final_rest[final_rest["name"].str.contains('Cocos')|final_rest["name"].str.contains('cocos')|final_rest["name"].str.contains('Goodtime')].index
for x in thai:
    final_rest["new_category"].iloc[x]='Thai'
    
viet=final_rest[final_rest["name"].str.contains('Viet')|final_rest["name"].str.contains('viet')|final_rest["name"].str.contains('Ngoc Han')|final_rest["name"].str.contains('New Friends')].index
for x in viet:
    final_rest["new_category"].iloc[x]='Vietnamese'
    
chin=final_rest[final_rest["name"].str.contains('Hongfu')|final_rest["name"].str.contains('Phan Gia')|final_rest["name"].str.contains('New Asia Food')|final_rest["name"].str.contains('Wok to go')].index
for x in chin:
    final_rest["new_category"].iloc[x]='Chinese'
    
taiw=final_rest[final_rest["name"].str.contains('Taiwanesische')].index
final_rest["new_category"].iloc[taiw]="Taiwanese"

Cleaning Caribbean

In [137]:
cub=final_rest[final_rest["name"].str.contains('Habana')].index
for x in cub:
    final_rest["new_category"].iloc[x]="Cuban"
    
jam=final_rest[final_rest["name"].str.contains('Ya-Man')|final_rest["name"].str.contains('Rosa Caleta')].index
for x in jam:
    final_rest["new_category"].iloc[x]="Jamaican"


Cleaning Mediterranean

In [138]:
turk=final_rest[final_rest["name"].str.contains('Meyan')|final_rest["name"].str.contains('No Bananas')].index
for x in turk:
    final_rest["new_category"].iloc[x]="Turkish"

isr=final_rest[final_rest["name"].str.contains('Layla')|final_rest["name"].str.contains('Nanoosh')].index
for x in isr:
    final_rest["new_category"].iloc[x]="Israeli"

Cleaning African

In [139]:
eg=final_rest[final_rest["name"].str.contains('Ägyptische')|final_rest["name"].str.contains('Maroush')].index
for x in eg:
    final_rest["new_category"].iloc[x]="Egyptian"
    
gh=final_rest[final_rest["name"].str.contains('African Kingdom')|final_rest["name"].str.contains('Dedeede')|final_rest["name"].str.contains('Ghana')].index
for x in gh:
    final_rest["new_category"].iloc[x]="Ghanaian"
    
kn=final_rest[final_rest["name"].str.contains('Tembo African Kitchen')].index
for x in kn:
    final_rest["new_category"].iloc[x]="Kenyan"
    
ng=final_rest[final_rest["name"].str.contains('Makenene')].index
for x in ng:
    final_rest["new_category"].iloc[x]="Nigerian"

sudan=final_rest[final_rest["name"].str.contains('Sudanesische')|final_rest["name"].str.contains('Sahara')|final_rest["name"].str.contains('Kordofan')|final_rest["name"].str.contains('2 Cool')|final_rest["name"].str.contains('Basmah')|final_rest["name"].str.contains('Khartoum')|final_rest["name"].str.contains('Darfur')].index
for x in sudan:
    final_rest["new_category"].iloc[x]="Sudanese"
    
eri=final_rest[final_rest["name"].str.contains('Savanna')|final_rest["name"].str.contains('TZOM')].index
for x in eri:
    final_rest["new_category"].iloc[x]="Eritrean"

sen=final_rest[final_rest["name"].str.contains('Senegambia')].index
for x in sen:
    final_rest["new_category"].iloc[x]="Senegalese"

eti=final_rest[final_rest["name"].str.contains('Blue Nile')].index
for x in eti:
    final_rest["new_category"].iloc[x]="Ethiopian"

Cleaning Middle East

In [140]:
leb=final_rest[final_rest["name"].str.contains('Qadmous')|final_rest["name"].str.contains('Beirut')|final_rest["name"].str.contains('Tyrus')|final_rest["name"].str.contains('Al Safa')|final_rest["name"].str.contains('Fatoush')|final_rest["name"].str.contains("King's Chicken")].index
for x in leb:
    final_rest["new_category"].iloc[x]="Lebanese"
    
sir=final_rest[final_rest["name"].str.contains('Gilgamesch')|final_rest["name"].str.contains('Shaam Restaurant')].index
for x in sir:
    final_rest["new_category"].iloc[x]="Syrian"

tur=final_rest[final_rest["name"].str.contains('Esra')|final_rest["name"].str.contains('Unka Köfte Burger')].index
for x in tur:
    final_rest["new_category"].iloc[x]="Turkish"
    
mar=final_rest[final_rest["name"].str.contains('Casalot')].index
for x in mar:
    final_rest["new_category"].iloc[x]="Moroccan"



Cleaning other

In [141]:
other=final_rest[final_rest["name"].str.contains('gehts wieder?')|final_rest["new_category"].str.contains('Scandinavian')|final_rest["name"].str.contains("Jamal's Kuche")|final_rest["name"].str.contains('Outback')|final_rest["name"].str.contains('Bosphorus Cafe')|final_rest["name"].str.contains('#60')|final_rest["name"].str.contains('Ok Girl')|final_rest["name"].str.contains('Betrif')|final_rest["name"].str.contains('Malta Grill')].index
for x in other:
    final_rest["new_category"].iloc[x]="Other"
    
viet=final_rest[final_rest["name"].str.contains('YaMe NumNums')].index
for x in viet:
    final_rest["new_category"].iloc[x]="Vietnamese"
    
pal=final_rest[final_rest["name"].str.contains('Palestine Berlin')].index
for x in pal:
    final_rest["name"].iloc[x]='Al Hamra'
    final_rest["new_category"].iloc[x]="Palestinian"
    final_rest["lat"].iloc[x]=52.541788
    final_rest["lng"].iloc[x]=13.421064


In [142]:
list_areas=["Latin American","Asian","Mediterranean","Caribbean","African",'Eastern European','Middle Eastern']

In [143]:
final_rest=final_rest.assign(country=0)
final_rest=final_rest.assign(area=0)
for x in range(len(final_rest)): 
    if final_rest["new_category"].iloc[x] in list(countries_dict.values())+list(countries_dict.keys()):
        final_rest["country"].iloc[x]=1
    elif final_rest["new_category"].iloc[x] in list_areas:
        final_rest["area"].iloc[x]=1

In [144]:
final_rest.head()

Unnamed: 0,name,lat,lng,query,category,new_category,country,area
0,Safrani Afghan Und Persian Halal Food,52.510331,13.305543,Afghanistan,Afghan Restaurant,Afghan,1,0
1,Al-Ándalus,52.533491,13.428347,other,Tapas Restaurant,Tapas,0,0
2,Al-andalous,52.484928,13.434201,other,Falafel Restaurant,Turkish,1,0
3,Al Andalos,52.485735,13.431296,other,Falafel Restaurant,Turkish,1,0
4,American Diner,52.534623,13.419337,other,Diner,Diner,0,0


### Adding the continent to the dataframe

In [145]:
def continent(country):
    if country in list(countries_dict.values())+list(countries_dict.keys()):
        country2=countries_dict[country]
    if country=="Korean":
        country2="South Korea"
    if country=="Macao":
        country2="China"
    if country=="Malvinas":
        country2="Argentina"
    if country=="Kurdish":
        country2="Turkey"
    if country in list_areas:
        return country
    try:
        return pc.country_alpha2_to_continent_code(pc.country_name_to_country_alpha2(country2))
    except:
        return "no_continent"

In [146]:
final_rest=final_rest.assign(continent=final_rest["new_category"].apply(lambda x: continent(x)))

In [147]:
final_rest["continent"].value_counts()

no_continent        1227
AS                  1070
EU                   662
NA                   166
SA                   137
AF                    78
Eastern European      21
African               18
Asian                 11
Caribbean              7
OC                     5
Middle Eastern         4
Latin American         2
Name: continent, dtype: int64

In [148]:
continent_change={'AS':'Asia','EU':'Europe','NA':'North America','SA':'South America',
                  'African':'Africa','Eastern European':'Europe','Asian':'Asia','AF':'Africa',
                  'OC':'Oceania','Middle Eastern':'Asia','Latin American':'South America',
                  'Scandinavian':'Europe','Caribbean':'South America','Mediterranean':'Asia'}

In [149]:
final_rest=final_rest.replace({'continent':continent_change})

In [150]:
final_rest["continent"].value_counts()

no_continent     1227
Asia             1085
Europe            683
North America     166
South America     146
Africa             96
Oceania             5
Name: continent, dtype: int64

In [151]:
final_rest.tail()

Unnamed: 0,name,lat,lng,query,category,new_category,country,area,continent
3403,Little Green Rabbit,52.514194,13.395463,Soup Place,Salad Place,Salad Place,0,0,no_continent
3404,Suppengrün,52.511992,13.412739,Soup Place,Soup Place,Soup Place,0,0,no_continent
3405,Saigoncomnieu,52.49871,13.357113,Soup Place,Vietnamese Restaurant,Vietnamese,1,0,Asia
3406,Suppen-Cult,52.535983,13.422372,Soup Place,Soup Place,Soup Place,0,0,no_continent
3407,Die Löffelei,52.502833,13.365484,Soup Place,Soup Place,Soup Place,0,0,no_continent


### Dropping duplicates

In [152]:
final_rest_clean=final_rest.copy()

In [153]:
final_rest_clean.drop_duplicates(["name","lat", "lng"],keep='first',inplace=True)

In [154]:
final_rest_clean=final_rest_clean.reset_index(drop=True)

### exporting to pickle the final df with all the restaurants

In [155]:
final_rest_clean.to_pickle("data/clean_restaurants.pkl")