<h1 align=center><font size = 7> Correlation between Neighborhoods economical situation and popular venues in New York City</font></h1>

# 1. Introduction

Neighborhoods are complex systems and various different variables are at play, the goal of this study is to gather data and consider social economic variables and their correlations.  
For this pourpose median household income, median rent (for 2-Bed room apartment will be used as reference) and most popular venues will be taken into account. 
This study will be applied to the city of New York.  
Our goal is to answer 2 main problems:
1. Analyze the affordability of comparable neighborhoods (by venues offering) to better inform a person interested in moving.
2. Establish (if any) a correlation between median income and popular venues to help a prospect entrepreneur deciding the best location for his/her business. 

# 2. Methodology 

## 2.1 Data sources

* The geography of NYC neighborhoods will be based on the dataset over at https://geo.nyu.edu/catalog/nyu_2451_34572 that has already been downloaded to https://cocl.us/new_york_dataset and is reachable with a wget.
* Data for the median income and rent cost (for a 2BR apartment) by neighborhood will be gathered from the website renthop.com (https://www.renthop.com/study/assets/new-york-city-cost-of-living-2017/nyc-2br-median-rent-and-income-table.html and  https://www.renthop.com/study/assets/new-york-city-cost-of-living-2017/nyc_2br.json).  
* The venues data will be retrieved from FourSquare.com using the API provided.  

## 2.2 Data retrieval 

Proceding with the necessary import

In [None]:
# uncomment the following line if geopy isn't installed
#!conda install -c conda-forge geopy --yes 

# uncomment the following line if folium isn't installed
#!conda install -c conda-forge folium=0.5.0 --yes 


# install BeautifulSoup and requests
!pip install beautifulsoup4
!pip install requests
!pip install lxml

# library to handle data in a vectorized manner
import numpy as np 

# library for data analsysis
import pandas as pd 
from pandas import json_normalize # tranform JSON file into a pandas dataframe

# import BeautifulSoup
from bs4 import BeautifulSoup

# library to handle JSON files
import json

# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim 

# library to handle http requests
import requests 

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

# map rendering library
import folium 


print('Libraries imported.')

Let's proceed with the download of the dataset of New York Neighborhoods Geography which comes as a Json file.

In [None]:
url = 'https://cocl.us/new_york_dataset'
filename = 'newyork_data.json'
response = requests.get(url)
file = open(filename, "w")
file.write(response.text)
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)
newyork_data

Analyzing the content we can notice many information are available in this set, for the sake of this study from this josn we need to retrieve the Borough and Neighborhood name along with it's Latitude and Longitude.  
These informations are available in each feature instance at __properties.borough__, __properties.name__, __geometry.coordinates__ respectively.  
Let's create the Pandas DF and fill it with the relevant information.

In [27]:
neighborhoods_data = newyork_data['features']

# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


We can now proceed with the retrieval of the data about neighborhoods median income and rent price.  

The content of the table is generated dynamically but luckily a json with all the information is available at https://www.renthop.com/study/assets/new-york-city-cost-of-living-2017/nyc_2br.json so we will proceed to parse said json and store it into a df.

In [None]:
url = 'https://www.renthop.com/study/assets/new-york-city-cost-of-living-2017/nyc_2br.json'
filename = 'newyork_income_data.json'
response = requests.get(url)
file = open(filename, "w")
file.write(response.text)
with open('newyork_income_data.json') as json_income_data:
    newyork_income_data = json.load(json_income_data)
newyork_income_data

This one is much more straightforward than the previous json we parsed and we can just access the node 'data' and let json_normalize create the Data Frame for us.

In [29]:
neighborhoods_income_data = newyork_income_data['data']

neighborhoods_income = json_normalize(neighborhoods_income_data)
neighborhoods_income.head()

Unnamed: 0,Neighborhood,Borough,house_income,median2,perc_income,income4_2Br
0,Upper East Side-Carnegie Hill,Manhattan,155213,3555.0,0.274848,142200
1,Great Kills,Staten Island,88868,2050.0,0.276815,82000
2,Whitestone,Queens,80546,1950.0,0.290517,78000
3,New Dorp-Midland Beach,Staten Island,78100,1900.0,0.291933,76000
4,Bayside-Bayside Hills,Queens,79120,2176.0,0.33003,87040
