# Capstone Project – The Battle of neighborhoods (Week1)

## A description of the problem and a discussion of the background.

##### 1 - Description of the problem: 

People starting new businesses have different requirements and studies to conduct before launching their project. To start a restaurant, people need to find the right place. The definition of right place incorporate different aspects; places with high people density, not flooded with restaurant, and including other facilities that might lead people to think about food. For instance, malls, schools and companies incorporate people doing different activities, yet, all might need to have a meal at a certain point of their day. Doing, the proper research might help the business owner locate contender boroughs in a city where the busines will have much more chances to succeed.

##### 2- Business understanding: 

Given the reality of the targeted business and its success requirements, Can we provide a short list of boroughs to help business owner decide where to start its business?

##### 3 - Interested people: 

This problem is very interesting for all the people deciding to start their own restaurant. This approach can be also adopted for other kind of business like gyms and entertainment services.


## A description of the data and how it will be used to solve the problem

##### Data is the most important element of any data science project, almost 60% of relevancy of any machine learning model is related to the relevancy of the dataset and the different features it incorporates.

###### For our use case, the data needed can be summarized as follow: 

1-	For a targeted city, The boroughs and neighborhoods and their corresponding locations.

2-	List of schools and companies for each borough and neighborhood.

3-	List of entertainment venues for each borough and neighbors (Malls, Gyms, gaming rooms …)

4-	List of restaurants for each borough and neighborhood

Assuming the guy is willing to open a restaurant in Toronto, we will scrape the data of postal codes of the city of Toronto and get the location of the different boroughs, for each we will use the foursquare to explore the location and get different venues in it. Later, we will filter the venues to extract those that will help us in our analysis such as restaurants, stations, malls and other entertainment facilities that usually attract people.

Processing this data, will help us identify subset of boroughs where we have the most of facilities that might lead people to go and eat, this also should be tempered by the number of restaurant in the borough so that you don’t open your business in a place already crowded with restaurants.

###### Data sources: 
1 - Toronto PostCOde: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

2 - Foursquare API

###### Data usage steps:
1 - Location of the targeted city and visualize using map and folium 

2 - Beautifulsoup and scrap data related to borough and neighboros of the atrgeted city 

3 - use open dataset to fetch the location of each neighbor 

4 - Foursquare to explore each neighbor 

5 - extract usefull venues in neighbor e.g., restourants, hotels, gyms, parks, stores 

6 - aggregate venues by neighbors and derive scores to highlight the relevancy of the neighbor 

7 - derive subset of neighbors per borough with the highest prospects


In [None]:
import numpy as np 
import pandas as pd
import bs4
import requests
#!pip install geocoder
#import geocoder 
!conda install -c conda-forge folium=0.5.0 --yes
import folium
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

###### lets first see the targeted city on the map 

In [None]:
#Toronto
latitude=43.653226
longitude=-79.3831843
venues_map = folium.Map(location=[latitude, longitude], zoom_start=13) # generate map centred around the Conrad Hotel

venues_map

###### lets get the postalcodes for the different borough and neighborhoods

In [None]:
#lets assign the url to the wiki page and insatntiate a beautifulsoup object with the get request output
url ='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
toronto_wiki = requests.get(url).text
soup_object_toronto=bs4.BeautifulSoup(toronto_wiki,'lxml')
#first we locate the table 
table=soup_object_toronto.find('table')

#we parse the table and extract only three variables - 'PostalCode','Borough','Neighborhood'

Toronto_post_bor_neigh = []
for row in table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        Toronto_post_bor_neigh.append(cell)
#we creata dataframe from the extracted nisted list
df=pd.DataFrame(Toronto_post_bor_neigh)
#preprocessing
df1 = df[df.Borough != 'Not assigned']
df2 = df1.groupby(['PostalCode','Borough'], sort=False).agg(', '.join)
df2.reset_index(inplace=True)

df2['Neighborhood'] = np.where(df2['Neighborhood'] == 'Not assigned',df2['Borough'], df2['Neighborhood'])
df2.head(10)

###### lets add the location of each neigborhood 

In [None]:
######## use the csv file 
#load the csv in daatframe
la_lo_csv = pd.read_csv('https://cocl.us/Geospatial_data')

# now lets match the postal Code with the PostcodeID and get a merged dataframe
df_lo_la_toronto = pd.merge(df2, la_lo_csv, left_on='PostalCode', right_on='Postal Code')
df_lo_la_toronto.head(10)
#lets drop the Post code its just duplicated 
df_lo_la_toronto=df_lo_la_toronto.drop('Postal Code',axis=1)
df_lo_la_toronto.head(10)

###### lets use foursquare ond explore the different neighbors 

In [None]:
CLIENT_ID = 'TJI0OMJ0INFZVVD1SQBTJQTYXWKYBJXG4J4OD4WPGPOUKA4P' # your Foursquare ID
CLIENT_SECRET = '1QLSVKLKZDOVWZCY2LUMNNTFJKOMONH2ZP334AVRHT4XE1IA' # your Foursquare Secret
ACCESS_TOKEN = 'VENHRI5EXTAK5QFACKIRNJOCK5LRXTV0RXS2UGKCJAEZBIQ0' # your FourSquare Access Token
VERSION = '20180604'
LIMIT = 100
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

In [None]:
latitude = 43.667856
longitude = -79.532242
radius=1000
neighborhood_url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, LIMIT)
neighborhood_url


In [None]:
results = requests.get(neighborhood_url).json()
'There are {} in the nieghborhood.'.format(len(results['response']['groups'][0]['items']))

In [None]:
items = results['response']['groups'][0]['items']
items[0]

###### lets process the json file 

In [None]:
import json # library to handle JSON files
import requests # library to handle requests
from pandas.io.json import json_normalize
dataframe = json_normalize(items) # flatten JSON

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# filter columns
filtered_columns = ['venue.name', 'venue.categories'] + [col for col in dataframe.columns if col.startswith('venue.location.')] + ['venue.id']
dataframe_filtered = dataframe.loc[:, filtered_columns]

# filter the category for each row
dataframe_filtered['venue.categories'] = dataframe_filtered.apply(get_category_type, axis=1)

# clean columns
dataframe_filtered.columns = [col.split('.')[-1] for col in dataframe_filtered.columns]

dataframe_filtered['Borough'] = 'North York'
dataframe_filtered['Neighborhood'] = 'Lawrence Manor, Lawrence Heights'

###### after getting all the venues for each neighborhood now lets remove the non related features and keep only the interesting information 

In [None]:
data_pure = dataframe_filtered[['name','categories','lat','lng','distance','Borough','Neighborhood']]
data_pure.head()

In [None]:
latitude=43.653226
longitude=-79.3831843
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)


# add markers to the map
markers_colors = []
for lat, lon, poi in zip(data_pure['lat'], data_pure['lng'], data_pure['Neighborhood']):
    #label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        fill=True,
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [None]:
#get venues for each neighborhood
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
#each nieghborhood get most freq 

# one hot encoding
manhattan_onehot = pd.get_dummies(df_man[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
manhattan_onehot['Neighborhood'] = df_man['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [manhattan_onehot.columns[-1]] + list(manhattan_onehot.columns[:-1])
manhattan_onehot = manhattan_onehot[fixed_columns]

manhattan_onehot.head()


num_top_venues = 5

for hood in manhattan_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = manhattan_grouped[manhattan_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

In [None]:
#mean freq of each categor