# Capstone Project - The Battle of Neighborhoods

Now that you have been equipped with the skills and the tools to use location data to explore a geographical location, over the course of two weeks, you will have the opportunity to be as creative as you want and come up with an idea to leverage the Foursquare location data to explore or compare neighborhoods or cities of your choice or to come up with a problem that you can use the Foursquare location data to solve. If you cannot think of an idea or a problem, here are some ideas to get you started:

    In Module 3, we explored New York City and the city of Toronto and segmented and clustered their neighborhoods. Both cities are very diverse and are the financial capitals of their respective countries. One interesting idea would be to compare the neighborhoods of the two cities and determine how similar or dissimilar they are. Is New York City more like Toronto or Paris or some other multicultural city? I will leave it to you to refine this idea.
    In a city of your choice, if someone is looking to open a restaurant, where would you recommend that they open it? Similarly, if a contractor is trying to start their own business, where would you recommend that they setup their office?

These are just a couple of many ideas and problems that can be solved using location data in addition to other datasets. No matter what you decide to do, make sure to provide sufficient justification of why you think what you want to do or solve is important and why would a client or a group of people be interested in your project.
Review criteria

This capstone project will be graded by your peers. This capstone project is worth 70% of your total grade. The project will be completed over the course of 2 weeks.  Week 1 submissions will be worth 30% whereas week 2 submissions will be worth 40% of your total grade.

__For this week, you will required to submit the following:__

    A description of the problem and a discussion of the background. (15 marks)
    A description of the data and how it will be used to solve the problem. (15 marks)

__For the second week, the final deliverables of the project will be:__

1. A link to your Notebook on your Github repository, showing your code. (15 marks)  _&rarr; Jupyter Notebook_

2. A full report consisting of all of the following components (15 marks):  _&rarr; Text as .pdf_

    __Introduction__ where you discuss the business problem and who would be interested in this project.
    
    __Data__ where you describe the data that will be used to solve the problem and the source of the data.
    
    __Methodology section__ which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, if any, and what machine learnings were used and why.
    
    __Results section__ where you discuss the results.
    
    __Discussion section__ where you discuss any observations you noted and any recommendations you can make based on the results.
    
    __Conclusion section__ where you conclude the report.

3. Your choice of a presentation or blogpost. (10 marks)  _&rarr; PowerPoint as .pdf_

***
# WEEK 4

Project Title: "Give your project a descriptive title

Clearly define a problem or an idea of your choice, where you would need to leverage the Foursquare location data to solve or execute. Remember that data science problems always target an audience and are meant to help a group of stakeholders solve a problem, so make sure that you explicitly describe your audience and why they would care about your problem.

This submission will eventually become your __Introduction/Business Problem__ section in your final report. So I recommend that you push the report (having your Introduction/Business Problem section only for now) to your Github repository and submit a link to it.

URL

Describe the data that you will be using to solve the problem or execute your idea. Remember that you will need to use the Foursquare location data to solve the problem or execute your idea. You can absolutely use other datasets in combination with the Foursquare location data. So make sure that you provide adequate explanation and discussion, with examples, of the data that you will be using, even if it is only Foursquare location data.

This submission will eventually become your __Data__ section in your final report. So I recommend that you push the report (having your Data section) to your Github repository and submit a link to it.

URL
***

***
# WEEK 5

In this week, you will continue working on your capstone project. Please remember by the end of this week, you will need to submit the following:

    A full report consisting of all of the following components (15 marks):

    Introduction where you discuss the business problem and who would be interested in this project.
    Data where you describe the data that will be used to solve the problem and the source of the data.
    Methodology section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, if any, and what machine learnings were used and why.
    Results section where you discuss the results.
    Discussion section where you discuss any observations you noted and any recommendations you can make based on the results.
    Conclusion section where you conclude the report.

2. A link to your Notebook on your Github repository pushed showing your code. (15 marks)

3. Your choice of a presentation or blogpost. (10 marks)

Here are examples of previous outstanding submissions that should give you an idea of what your report would look like, what your notebook would look like in terms of clean, clear, and well-commented code, and what your presentation would look like or your blogpost would look like:

    Report: Report Template
    Notebook: Notebook Template
    Presentation: Presentation Template
    Blogpost: Blogpost template


Project Title

Please submit a link to your Notebook.
URL

Please submit a link to your report.
URL

Please submit a link to either your presentation or blogpost.
URL



***

In [None]:
# Sample code from Week 3 >>> @@@@@@@@@@@@@@@@@@@@ REMOVE LATER!!!
import pandas as pd
import numpy as np
import requests

from bs4 import BeautifulSoup


source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

soup = BeautifulSoup(source, 'html5lib')

postal_codes_dict = {} # initialize an empty dictionary to save the data in
for table_cell in soup.find_all('td'):
    try:
        postal_code = table_cell.p.b.text # get the postal code
        postal_code_investigate = table_cell.span.text
        neighborhoods_data = table_cell.span.text # get the rest of the data in the cell
        borough = neighborhoods_data.split('(')[0] # get the borough in the cell
        
        # if the cell is not assigned then ignore it
        if neighborhoods_data == 'Not assigned':
            neighborhoods = []
        # else process the data and add it to the dictionary
        else:
            postal_codes_dict[postal_code] = {}
            
            try:
                neighborhoods = neighborhoods_data.split('(')[1]
            
                # remove parantheses from neighborhoods string
                neighborhoods = neighborhoods.replace('(', ' ')
                neighborhoods = neighborhoods.replace(')', ' ')

                neighborhoods_names = neighborhoods.split('/')
                neighborhoods_clean = ', '.join([name.strip() for name in neighborhoods_names])
            except:
                borough = borough.strip('\n')
                neighborhoods_clean = borough
 
            # add borough and neighborhood to dictionary
            postal_codes_dict[postal_code]['borough'] = borough
            postal_codes_dict[postal_code]['neighborhoods'] = neighborhoods_clean
    except:
        pass
    
# create an empty dataframe
columns = ['PostalCode', 'Borough', 'Neighborhood']
toronto_data = pd.DataFrame(columns=columns)
toronto_data

# populate dataframe with data from dictionary
for ind, postal_code in enumerate(postal_codes_dict):
    borough = postal_codes_dict[postal_code]['borough']
    neighborhood = postal_codes_dict[postal_code]['neighborhoods']
    toronto_data = toronto_data.append({"PostalCode": postal_code, 
                                        "Borough": borough, 
                                        "Neighborhood": neighborhood},
                                        ignore_index=True)

# print number of rows of dataframe
toronto_data.shape[0]

# Sample code from Week 3 >>> @@@@@@@@@@@@@@@@@@@@ REMOVE LATER!!!
...so we will use the Geocoder Python package instead: https://geocoder.readthedocs.io/index.html.

The problem with this Package is you have to be persistent sometimes in order to get the geographical coordinates of a given postal code. So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be None, and then make the call again and you would get the coordinates. So, in order to make sure that you get the coordinates for all of our neighborhoods, you can run a while loop for each postal code. Taking  postal code M5G as an example, your code would look something like this:

In [None]:
# Sample code from Week 3 >>> @@@@@@@@@@@@@@@@@@@@ REMOVE LATER!!!
import geocoder # import geocoder

# initialize your variable to None
lat_lng_coords = None

# loop until you get the coordinates
while(lat_lng_coords is None):
  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
  lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]

Nombre 	Rosario

Código Postal 	S2000, S2001, S2002, S2003, S2004, S2005, S2006, S2007, S2008, S2009, S2010, S2011, S2012, S2013

https://www.azcodigopostal.com/argentina/pplace-rosario-82084/

https://nuestraciudad.info/portal/Rosario.SF/Barrios

https://en.wikipedia.org/wiki/Rosario,_Santa_Fe

https://en.wikipedia.org/wiki/Districts_of_Rosario

https://en.wikipedia.org/wiki/Postal_codes_in_Argentina

# MY WORK STARTS HERE:

In [42]:
import requests
import os
import pandas as pd

import numpy as np
import json

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In [45]:
# create dataframe with coordinates data for Rosario's neighborhoods
neighborhoods = pd.read_csv('Barrios_de_Rosario.csv')
neighborhoods

Unnamed: 0,Neighborhood,District,Latitude,Longitude
0,14 de Octubre,Distrito Sudoeste,-32.996944,-60.663056
1,17 de Agosto,Distrito Sudoeste,-33.011944,-60.66
2,Abasto,Distrito Centro,-32.961667,-60.646944
3,Alberdi,Distrito Norte,-32.897222,-60.698333
4,Alberto Olmedo,Distrito Centro,-32.933611,-60.658056
5,Alvear,Distrito Sudoeste,-32.986944,-60.677778
6,Antártida Argentina,Distrito Noroeste,-32.936944,-60.763056
7,Azcuénaga,"Distrito Noroeste, Distrito Oeste",-32.9425,-60.698333
8,Belgrano,Distrito Noroeste,-32.940833,-60.713333
9,Bella Vista,"Distrito Oeste, Distrito Sudoeste",-32.966667,-60.680833


In [37]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

In [38]:
address = 'Rosario, Santa Fe, Argentina'

geolocator = Nominatim(user_agent="rosario_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of the city of Rosario are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of the city of Rosario are -32.9595004, -60.6615415.


In [46]:
print('The dataframe has {} districts and {} neighborhoods.'.format(
        len(neighborhoods['District'].unique()),
        len(neighborhoods['Neighborhood'].unique())
    )
)

The dataframe has 14 districts and 46 neighborhoods.


Rosario has 6 districts __&rarr; data cleaning >>>>>>>>>>> ToDo!!!__

Once I have this ready &rarr; create map showing Rosario's 6 districts

In [None]:
# map of Rosario's Districts

In [None]:
# I'm here!!!

In [None]:
# chech out example notebook from Germany

### Explore Neighborhoods in Rosario

In [39]:
# Define Foursquare Credentials and Version
CLIENT_ID = 'OC30BIT0T1C1DAJAXAUJGG1H3MQ1XOOZULL5ZWMARRVZFR0K' # your Foursquare ID
CLIENT_SECRET = 'NGO53AGQLJYHRMVAADNXU2QXJX1SLPK5M3HSMNTOWZQUMJZY' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

In [40]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    """ names, latitudes, longitudes are columns from a dataframe. """
    venues_list = []
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            lng,
            radius,
            LIMIT)
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        # return only relevant information for each nearby venue
        venues_list.append([(name, lat, lng,
            v['venue']['name'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood',
                  'Neighborhood Latitude',
                  'Neighborhood Longitude',
                  'Venue',
                  'Venue Latitude',
                  'Venue Longitude',
                  'Venue Category']
    
    return(nearby_venues)

In [43]:
# Run the above function on each neighborhood and create a new dataframe called toronto_venues.
rosario_venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude'])

14 de Octubre 
17 de Agosto
Abasto
Alberdi
Alberto Olmedo
Alvear
Antártida Argentina
Azcuénaga
Belgrano
Bella Vista
Celedonio Escalada
Cinco Esquinas
Domingo Faustino Sarmiento
Domingo Matheu
España y Hospitales
Esteban Echeverría
Fisherton
General Las Heras
Godoy
Jorge Cura
La Cerámica y Cuyo
La Guardia
Larrea y Empalme Graneros
Las Flores
Las Malvinas
Latinoamérica
Lisandro de la Torre
Ludueña
Luis Agote
Mercedes de San Martín
Nuestra Sra. de Lourdes
Parque Casado
Parque Field
Presidente Roque Saenz Peña
Remedios de Escalada de San Martín
República de la Sexta
Rosario Centro
Rucci
Saladillo Sur
San Martín
Tango
Tiro Suizo
Triángulo y Moderno
Victoria Walsh
Villa Manuelita
Villa Urquiza


In [44]:
print(rosario_venues.shape)
rosario_venues.head()

(212, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,14 de Octubre,-32.996944,-60.663056,Cafe La Fazenda Planta Industrial,-32.998745,-60.661983,Breakfast Spot
1,14 de Octubre,-32.996944,-60.663056,Roque Vicente Gioiosa SRL,-32.999975,-60.665331,Construction & Landscaping
2,14 de Octubre,-32.996944,-60.663056,Super 27,-33.000357,-60.664901,Supermarket
3,14 de Octubre,-32.996944,-60.663056,Productos Nutrirte - Mariana,-32.992881,-60.663724,Health Food Store
4,17 de Agosto,-33.011944,-60.66,Le Gula,-33.010215,-60.66241,Restaurant


### &rarr; My goal is to find the neighborhood that's most attractive to turists / young people / students, for potential real state investor who wish to buy apartments and rent them to:
- Turists (through a service like AirB&B)
- Young single professionals
- College students

In [None]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood']

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

In [None]:
# Let's see the new dataframe size.
toronto_onehot.shape

In [None]:
# Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
pd.set_option('display.max_rows', 10)

toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

### Cluster Neighborhoods

In [None]:
# elbow method to find k!!!