![alt text](https://miro.medium.com/max/755/1*Aydiurid-v3wwHQbTZ5Kaw.jpeg "The cities that form the triangle, North Carolina")

<a name="top"></a>
# GUIDE TO LIVING IN RESEARCH TRIANGLE PARK, NC

*This project will attempt to create a **guide on neighborhoods for visitors or recent movers to Research Triangle Park, North Carolina (including Durham, Raleigh, Chapel Hill)**. The project will use K-means clustering, along with other Python data analysis libraries, to show similar neighborhoods within these 3 cities of the region.*

## Table of Content

### 2.  [Data](#section2)
  2b. [Data collection](#section2b)  




In [1]:
### ALL LIBRARIES WILL BE IMPORTED HERE IN ONE PLACE FOR EASE TO MANAGE

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors
plt.style.use('ggplot')
%matplotlib inline

import seaborn as sns
import folium

from bs4 import BeautifulSoup
import requests
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
import re # import regular expression to help with web scraping

from sklearn.cluster import KMeans

print('All libraries imported successfully!')

All libraries imported successfully!


<a name='section2b'></a>
#### 2b. Data collection
##### Retrieving neighborhoods
 In order to locate the neighborhoods on the map, we need to first identify the list of those places. The website **[City-Data.com](http://www.city-data.com/)** compiles neighborhood's geographic data for cities across the United States. In each city, they have a list of neighborhoods that can be scrape 
 with Beautiful Soup. Therefore, I will scrape the lists of neighborhoods from each city site and combine them into a single dataframe.

In [4]:
# Scrape the web for neighborhood names for each city.

# First, put the city names and URL containing the data into a dictionary to iterate:
city_dict = {'Durham':'http://www.city-data.com/nbmaps/neigh-Durham-North-Carolina.html',
             'Raleigh':'http://www.city-data.com/nbmaps/neigh-Raleigh-North-Carolina.html',
             'Cary':'http://www.city-data.com/nbmaps/neigh-Cary-North-Carolina.html',
             'Chapel Hill':'http://www.city-data.com/nbmaps/neigh-Chapel-Hill-North-Carolina.html',
             'Apex':'http://www.city-data.com/nbmaps/neigh-Apex-North-Carolina.html'}


# After that, define the function used to extract neighborhood names from the website for each city
def neighborhood_scrape(city, url):
    html = requests.get(url)
    soup = BeautifulSoup(html.text, 'html.parser')
    a_tag = soup.find_all('a', href=re.compile('#N'))
    nbh_list = [item.string for item in a_tag]
    df = pd.DataFrame({'City':[str(city) for i in range(len(nbh_list))], 'Neighborhood':nbh_list})
    return df


# Finally, loop over the dictionary and add the results to a combined dataframe
df = pd.DataFrame()
for k, v in city_dict.items():
    df_temp = neighborhood_scrape(k, v)
    df = df.append(df_temp)

print('{} neighborhoods retrieved for the top 5 most populous cities in the Triangle area.'.format(df.shape[0]))

df.head()


Unnamed: 0,City,Neighborhood
0,Durham,Albright Community
1,Durham,American Tobacco District
2,Durham,Auburn
3,Durham,Breedlove
4,Durham,Brightleaf District


In [5]:
# Let's see how many neighborhoods we retrieved for each city
df['Neighborhood'].groupby(df['City']).count().sort_values(ascending=False)

City
Raleigh        200
Cary           200
Apex           200
Durham          67
Chapel Hill     58
Name: Neighborhood, dtype: int64

##### Getting neighborhood coordinates from geocoding services
Google, Bing (Microsoft), and serveral other companies that provide mapping service also have offers for geocoding. However, they often require registering for an account and even charge for the service. Therefore, for this project, I will use the ***Nominatim*** module within **Geopy** library to obtain the coordinates of the neighborhood. This module is built on OpenStreetMap, which is an opensource location service and is free to use.

In [6]:
# Define a function to retrieve latitude and longitudes of the neighborhoods
def get_coordinates(neighborhood, city):
    address = str(neighborhood) + ', ' + str(city) + ', NC'

    location = Nominatim(user_agent="hkhuu@elon.edu").geocode(address)
    if location == None:
        return [np.nan, np.nan]
    else:
        return [location.latitude, location.longitude]

    
# Try out the function to obtain the coordinates of Research Triangle Park area in Durham
rtp_coordinates = get_coordinates('Research Triangle Park', 'Durham')
LATITUDE, LONGITUDE = rtp_coordinates[0], rtp_coordinates[1]
print(LATITUDE, LONGITUDE)

35.89212155 -78.87154641285423


The function is working! I will now proceed to apply the function to each neighborhood in our dataframe and add the location data.

In [7]:
df['Latitude/Longitude'] = df.apply(lambda x: get_coordinates(x['Neighborhood'], x['City']), axis=1)
df.head()

Unnamed: 0,City,Neighborhood,Latitude/Longitude
0,Durham,Albright Community,"[nan, nan]"
1,Durham,American Tobacco District,"[35.99479205, -78.90463781810374]"
2,Durham,Auburn,"[35.9152775, -78.91324753679288]"
3,Durham,Breedlove,"[35.98605795, -78.8316742822716]"
4,Durham,Brightleaf District,"[nan, nan]"


Let's clean up the dataframe. I will separate the latitudes and longitude to their own column, then drop all missing data. After that, we will recheck to see how many neighborhoods we end up with.

In [11]:
# Replacing 'Latitude/Longitude' column with separated latitudes and longitudes
df_nbh = df
df_nbh[['Latitude','Longitude']] = pd.DataFrame(df_nbh['Latitude/Longitude'].tolist(), index= df.index)
df_nbh = df.drop('Latitude/Longitude', axis=1)
df_nbh.head()


# Drop missing values
df_nbh = df_nbh.dropna(subset=['Latitude', 'Longitude'], axis=0)

print('The resulting data set has {} neighborhoods.'.format(df_nbh.shape[0]))

The resulting data set has 412 neighborhoods.


##### Get venues data using Foursquare API

The final piece of data we need is the venues around each of these neighborhood. This will be done with the help of Foursquare API. The template for the explore API is as follow:

```python
https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}
```
Due to the number of neighborhood we are pulling data for, I will limit the number of venues to 50 places, within the radius of 500 meter of a neighborhood.

***First***, let's define a function we can use to automate the API call.

In [17]:
CLIENT_ID = 'SRBMCEOKYUNGCPXTOWBDDOZKU1WHQRZELUZAVZVLVKLU2WAQ'
CLIENT_SECRET = 'MS5DQS5SS4SD1P25HMTE5Z05L5DFZ0GT1RYHINHWW22YBPZT'
VERSION = '20180605'


def getNearbyVenues(names, latitudes, longitudes, radius=500, limit=50):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                             'Neighborhood Latitude', 
                             'Neighborhood Longitude', 
                             'Venue', 
                             'Venue Latitude', 
                             'Venue Longitude', 
                             'Venue Category']
    
    return(nearby_venues)

***Now***, let's make the API call to get the recommended venues.

In [None]:
triangle_venues = getNearbyVenues(names=df_nbh['Neighborhood'],
                                  latitudes=df_nbh['Latitude'],
                                  longitudes=df_nbh['Longitude']
                                 )

In [19]:
print(triangle_venues.shape)
triangle_venues.head()

(2515, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,American Tobacco District,35.994792,-78.904638,American Tobacco Campus,35.993042,-78.905091,Neighborhood
1,American Tobacco District,35.994792,-78.904638,Durham Performing Arts Center (DPAC),35.993701,-78.90219,Concert Hall
2,American Tobacco District,35.994792,-78.904638,Lucky's Delicatessen,35.997015,-78.904597,Deli / Bodega
3,American Tobacco District,35.994792,-78.904638,Pizzeria Toro,35.996998,-78.903716,Pizza Place
4,American Tobacco District,35.994792,-78.904638,Viceroy,35.996744,-78.90365,Gastropub


**As this data collection part deal with complex functions and loops that takes significant amount of time to compile, I will save all relevant datasets into .csv format to be called quickly for analysis.**

In [21]:
df.to_csv('main_df.csv', index=False)
df_nbh.to_csv('neighborhood_df.csv', index=False)
triangle_venues.to_csv('triangle_venues.csv', index=False)