# Ciencia de Datos Aplicada: Proyecto Final
Este cuaderno será utilizado para el proyecto final correspondiente al curso "Ciencia de Datos Aplicada" impartido en Coursera.

## Introducción

Toronto y Nueva York son dos ciudades de fama internacional, con un sin fin de atractivos turisticos. Nuestro cliente esta interesado en aprovechar esta afluencia de público mediante la instalación de un establecimiento cuyo rubro sea "servicio de comida", sin embargo no esta seguro de a que subgrupo específico dedicarse ni en que zona de la ciudad establecerse. El objetivo de este informe es, mediante el estudio de dos sets de datos, establecer las categorías y lugares óptimos a elegir para satisfacer su demanda.

## Datos
Para desarrollar esto, se cuentan con dos sets de datos respecto a los distritos tanto de NY como de Toronto y el acceso a la base de datos de Foursquare para obtener los servicios que se ofrecen en dichas ciudades.

In [1]:
import numpy as np
import pandas as pd
import requests
import re
from bs4 import BeautifulSoup

Extraigo todo el código HTML del sitio y lo imprimo con tal de buscar las lineas de código que necesito

In [2]:
web_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(web_url)

Extraigo el fragmento necesario del código y lo convierto en un data frame para luego imprimirlo

In [3]:
soup = BeautifulSoup(response.text, 'html.parser')
table= soup.findAll('table')[0]
df = pd.read_html(str(table))[0]

Como las celdas del data frame anterior corresponden a strings, los separo para extraer la información necesaria. La información es almacenada en diccionarios teniendo cuidado de asingar np.nan a los valores "Not assigned".

In [4]:
j=0
Postal_Code = {}
Borough = {}
Neighborhood = {}
for i in range(len(df.iloc[0,:])):
    for place in df.iloc[:,i]:
        Postal_Code[j] = place[0:3]
        if place[3:] == "Not assigned":
            Borough[j] = np.nan
            Neighborhood[j] = np.nan
        else:
            S = place[0:place.find(')')].replace(" / ", ", ")
            Borough[j] = S[3:].split("(")[0]
            Neighborhood[j] = S.split("(")[1]
        j = j+1

Construyo un nuevo data frame basandome en los diccionarios para código posta, distrito y vecindario. Elimino las filas que contengan np.nan.

In [5]:
P = list(Postal_Code.values())
B = list(Borough.values())
N = list(Neighborhood.values())
table_df = pd.DataFrame(P)
table_df.columns = ["PostalCode"]
table_df['Borough'] = B
table_df['Neighborhood'] = N
table_df.dropna(subset=['Borough'], inplace=True)

In [6]:
table_df.shape

(103, 3)

Utilizamos este paquete para poder leer el archivo con los datos de longitud y latitud (no me función leerlo directamente con pandas). Imprimo la forma del archivo para corroborar que tiene la misma dimension que mi data frame original.

In [7]:
import io

In [8]:
URL = 'http://cocl.us/Geospatial_data'
r = requests.get(URL, allow_redirects=True)
data_file = io.StringIO(r.text)
ll = pd.read_csv(data_file)
ll.shape

(103, 3)

Inserto los datos de Latitud y Longitud

In [9]:
table_df["Latitude"] = ll["Latitude"]
table_df["Longitude"] = ll["Longitude"]
table_df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
1,M1B,Scarborough,"Malvern, Rouge",43.784535,-79.160497
2,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.763573,-79.188711
3,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.770992,-79.216917
4,M1G,Scarborough,Woburn,43.773136,-79.239476
5,M1H,Scarborough,Cedarbrae,43.744734,-79.239476
6,M1J,Scarborough,Scarborough Village,43.727929,-79.262029
7,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.711112,-79.284577
8,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",43.716316,-79.239476
9,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West",43.692657,-79.264848
10,M1N,Scarborough,"Birch Cliff, Cliffside West",43.75741,-79.273304


In [10]:
def fill_city(response,city,ll):
    try:
        a = range(len(response['results']))
    except:
        a = [0]
    for i in a: 
        try:
            fsq_id = response['results'][i]['fsq_id']
        except:
            fsq_id = np.nan

        try:
            Category_id = response['results'][i]['categories'][0]['id']
        except:
            Category_id = np.nan

        try:
            Category = response['results'][i]['categories'][0]['name']
        except:
            Category = np.nan

        try:
            Borough = response['results'][i]['location']['locality']
        except:
            Borough = np.nan

        try:
            Distance = response['results'][i]['distance']
        except:
            Distance = np.nan

        city = city.append({'fsq_id':fsq_id,'Category_id':Category_id,'Category':Category,'Borough':Borough,'Distance':Distance,'Ref':ll}, ignore_index=True)

    return city

In [11]:
columns = ['fsq_id','Category_id','Category','Borough','Distance','Ref']
Toronto = pd.DataFrame(columns=columns)

for j in range(len(table_df["Latitude"])):    
    neighborhood_latitude = table_df.iloc[j,3]
    neighborhood_longitude = table_df.iloc[j,4]
    LIMIT = 50 # limit of number of venues returned by Foursquare API
    radius = 500 # define radius
    ll = str(neighborhood_latitude) + "," + str(neighborhood_longitude)

    url = "https://api.foursquare.com/v3/places/search"
    params = {
      "ll":ll,
      "radius":radius,
      "sort":"DISTANCE",
      "limit":LIMIT  
    }
    headers = {
        "Accept": "application/json",
        "Authorization": "fsq3uXGuFt52TJnl7MeC/1Wm64gvxs60C4fcv/q1Cb2xuyA="
    }

    response = requests.request("GET", url, params=params, headers=headers).json()
    Toronto = fill_city(response,Toronto,ll)
Toronto.dropna(subset=['fsq_id','Category_id','Category','Borough','Distance','Ref'], inplace=True)
Toronto = Toronto.drop_duplicates(subset=['fsq_id'])
Toronto = Toronto[(Toronto['Category_id']>=13000) & (Toronto['Category_id']<14000)].reset_index()
Toronto.head()

Unnamed: 0,index,fsq_id,Category_id,Category,Borough,Distance,Ref
0,30,10103b2b0f34456fabf3ae6c,13064,Pizzeria,Scarborough,211,"43.7635726,-79.1887115"
1,34,7485f175aa1e4b9f71ad0467,13065,Restaurant,Scarborough,249,"43.7635726,-79.1887115"
2,36,6174b5dd763d48164d49f4d9,13065,Restaurant,Scarborough,254,"43.7635726,-79.1887115"
3,39,31b2cdeda7504c3015f9294f,13032,"Cafes, Coffee, and Tea Houses",Scarborough,264,"43.7635726,-79.1887115"
4,41,13a8fb2bfdff4e077e161df3,13145,Fast Food Restaurant,Scarborough,291,"43.7635726,-79.1887115"


In [30]:
District_Toronto = Toronto[Toronto['Borough'] == 'Toronto']
District_Toronto['Category'].value_counts(normalize=True)[:20]

Café                             0.086420
Restaurant                       0.061728
Bakery                           0.054321
Pizzeria                         0.051852
Deli                             0.044444
Cafes, Coffee, and Tea Houses    0.039506
Diner                            0.037037
Coffee Shop                      0.034568
Bar                              0.029630
Burger Joint                     0.024691
Chinese Restaurant               0.022222
Italian Restaurant               0.022222
American Restaurant              0.019753
Pub                              0.019753
Sushi Restaurant                 0.019753
Thai Restaurant                  0.017284
Ice Cream Parlor                 0.017284
Fried Chicken Joint              0.017284
Cocktail Bar                     0.017284
Asian Restaurant                 0.014815
Name: Category, dtype: float64

In [13]:
import json
NY_url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json'
NYresponse = requests.get(NY_url).json()
NY_data = NYresponse['features']
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude']
neighborhoods = pd.DataFrame(columns=column_names)

In [14]:
for data in NY_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [22]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [23]:
columns = ['fsq_id','Category_id','Category','Borough','Distance','Ref']
NY = pd.DataFrame(columns=columns)

for j in range(len(neighborhoods['Latitude'])):    
    neighborhood_latitude = neighborhoods.iloc[j,2]
    neighborhood_longitude = neighborhoods.iloc[j,3]
    LIMIT = 50 # limit of number of venues returned by Foursquare API
    radius = 500 # define radius
    ll = str(neighborhood_latitude) + "," + str(neighborhood_longitude)

    url = "https://api.foursquare.com/v3/places/search"
    params = {
      "ll":ll,
      "radius":radius,
      "sort":"DISTANCE",
      "limit":LIMIT  
    }
    headers = {
        "Accept": "application/json",
        "Authorization": "fsq3uXGuFt52TJnl7MeC/1Wm64gvxs60C4fcv/q1Cb2xuyA="
    }

    response = requests.request("GET", url, params=params, headers=headers).json()
    NY = fill_city(response,NY,ll)
NY.dropna(subset=['fsq_id','Category_id','Category','Borough','Distance','Ref'], inplace=True)
NY = NY.drop_duplicates(subset=['fsq_id'])
NY = NY[(NY['Category_id']>=13000) & (NY['Category_id']<14000)].reset_index()
NY.head()

Unnamed: 0,index,fsq_id,Category_id,Category,Borough,Distance,Ref
0,1,4c537892fd2ea593cb077a28,13046,Ice Cream Parlor,Bronx,127,"40.89470517661,-73.84720052054902"
1,21,4f32458019836c91c7c734ff,13039,Deli,Bronx,326,"40.89470517661,-73.84720052054902"
2,23,6f93e0c92a9b44cf1c61f9c2,13065,Restaurant,New York,331,"40.89470517661,-73.84720052054902"
3,40,537cb882438c4dd02a4eddbe,13065,Restaurant,Bronx,417,"40.89470517661,-73.84720052054902"
4,49,763c9059e609468d24d2e6ab,13064,Pizzeria,Bronx,445,"40.89470517661,-73.84720052054902"


In [25]:
NY = NY.drop('index',axis=1)
NY.head()

Unnamed: 0,fsq_id,Category_id,Category,Borough,Distance,Ref
0,4c537892fd2ea593cb077a28,13046,Ice Cream Parlor,Bronx,127,"40.89470517661,-73.84720052054902"
1,4f32458019836c91c7c734ff,13039,Deli,Bronx,326,"40.89470517661,-73.84720052054902"
2,6f93e0c92a9b44cf1c61f9c2,13065,Restaurant,New York,331,"40.89470517661,-73.84720052054902"
3,537cb882438c4dd02a4eddbe,13065,Restaurant,Bronx,417,"40.89470517661,-73.84720052054902"
4,763c9059e609468d24d2e6ab,13064,Pizzeria,Bronx,445,"40.89470517661,-73.84720052054902"


In [27]:
NY.groupby(['Borough']).count()

Unnamed: 0_level_0,fsq_id,Category_id,Category,Distance,Ref
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Arverne,9,9,9,9,9
Astoria,52,52,52,52,52
Bayside,20,20,20,20,20
Bedford-Stuyvesant,1,1,1,1,1
Beechhurst,1,1,1,1,1
Belle Harbor,4,4,4,4,4
Bellerose,3,3,3,3,3
Breezy Point,5,5,5,5,5
Briarwood,2,2,2,2,2
Broad Channel,7,7,7,7,7


In [31]:
District_Brooklyn = NY[NY['Borough'] == 'Brooklyn']
District_Brooklyn['Category'].value_counts(normalize=True)[:20]

Pizzeria                0.092272
Chinese Restaurant      0.071511
Deli                    0.070358
Bakery                  0.059977
Restaurant              0.044983
Café                    0.043829
Coffee Shop             0.041522
American Restaurant     0.041522
Bar                     0.039216
Bagel Shop              0.039216
Cocktail Bar            0.038062
Caribbean Restaurant    0.027682
Burger Joint            0.027682
Ice Cream Parlor        0.021915
Mexican Restaurant      0.020761
Sushi Restaurant        0.019608
Asian Restaurant        0.018454
Italian Restaurant      0.017301
Fast Food Restaurant    0.016148
Wine Bar                0.012687
Name: Category, dtype: float64

In [32]:
District_New_York = NY[NY['Borough'] == 'New York']
District_New_York['Category'].value_counts(normalize=True)[:20]

Restaurant              0.090555
Bakery                  0.059396
Coffee Shop             0.053554
Deli                    0.051607
Pizzeria                0.050633
Cocktail Bar            0.043817
American Restaurant     0.042843
Café                    0.037975
Asian Restaurant        0.033106
Bar                     0.033106
Bagel Shop              0.030185
Burger Joint            0.030185
Italian Restaurant      0.027264
Chinese Restaurant      0.026290
Sushi Restaurant        0.022395
Ice Cream Parlor        0.021422
Wine Bar                0.019474
BBQ Joint               0.015579
Mexican Restaurant      0.014606
Fast Food Restaurant    0.013632
Name: Category, dtype: float64