# Coursera Capstone Project

## Table of contents
-  WEEK 1
    -  Introduction
    -  Business problem and targeted audience
    -  Data sources
-  WEEK 2
    -  Coding
    -  <a href="#agglomerative_clustering">Link to Report</a>
    -  <a href="#agglomerative_clustering">Link to Presentations</a>  

# WEEK 1. 
## Introduction
My Capstone Project is focused on developing ML models to recommend the most suited LA locations as expansion opportunities for Patsy's pizzeria, one of the most iconic pizzerias of New York city.  

Since its origination in 1933, Patsy’s was able to build a loyal client base and became a regular hang out place for many celebrities including Frank Sinatra, Dean Martin and Tony Bennett. Despite its popularity, Patsy’s still operates at a single location at 2287 First Avenue New York, NY 10035. 

The project ML models will allow comparing various neighborhoods (on zip code level) based on common sets of features and recommend the best match to current successful location. 


## Business problem and targeted audience.
The model would help resolving a common problem of identifying the optimal operating location to expand a successful business model. 

While large corporations employ a team of data engineers and scientists to address the problem, there is a large underserved market of small businesses that are deprived of those analytical capabilities. 

Although, the focus of the project is Patsy's Pizzeria, the model could be leveraged by other businesses to address common expansion challenges.


## Data sources.

### Open source data and feature engineering. 
The following sources were used to generate new insights (features) with total number of businesses and population income for current location of Patsy's zip code and all zip codes of Los Angeles 
-  Lat/long for all US zip codes (https://gist.github.com/erichurst/7882666)
-  Operating US businesses based on zip codes (https://www2.census.gov/programs-surveys/cbp/datasets/2016/zbp16detail.zip?#)
-  IRS tax return (household) stats based on zip codes (https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-2016-zip-code-data-soi).
    
### Venue and location data from FourSquare
FourSquare was used to obtain common business categories that are relevant to the identified geographic location. 

# WEEK 1. 
## Coding

In [None]:
import numpy as np 
import pandas as pd
from scipy import ndimage 
from scipy.cluster import hierarchy 
from scipy.spatial import distance_matrix 
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors 
from matplotlib import pyplot as plt 
from sklearn import manifold, datasets 
from sklearn.cluster import AgglomerativeClustering 
from sklearn.datasets.samples_generator import make_blobs 
# import k-means from clustering stage
from sklearn.cluster import KMeans
import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import requests # library to handle requests

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
#import folium # map rendering library

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
# from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
%matplotlib inline
print('Libraries imported.')

In [None]:
# The code was removed by Watson Studio for sharing.

#### body variable below points to the uploaded file zip_code_data.csv

<h2 id="data_cleaning">Data Cleaning</h2>

In [116]:
df_data = df_data.dropna()
df_data = df_data.reset_index(drop=True)

### Feature selection

In [117]:
#featureset = df_data_1[['under_25','between_25_50','between_50_75','between_75_100','between_100_200','over_200','bus_cnt']]
featureset = df_data[['IND_INCOME_25', 'IND_INCOME_25_50', 'IND_INCOME_50_75', 'IND_INCOME_75_100', 
                        'IND_INCOME_100_200', 'IND_INCOME_OVER_200', 'BUS_EMPL_1_4', 'BUS_EMPL_5_9', 
                        'BUS_EMPL_10_19', 'BUS_EMPL_20_49', 'BUS_EMPL_50_99', 'BUS_EMPL_100_249', 
                        'BUS_EMPL_250_499', 'BUS_EMPL_500_999', 'BUS_EMPL_OVER_1000']]

# featureset = df_data[['IND_INCOME_25', 'IND_INCOME_25_50', 'IND_INCOME_50_75', 'IND_INCOME_75_100', 
#                         'IND_INCOME_100_200', 'IND_INCOME_OVER_200', 'BUS_TOTAL']]

### Normalization

In [118]:
from sklearn.preprocessing import MinMaxScaler
x = featureset.values #returns a numpy array
min_max_scaler = MinMaxScaler()
feature_mtx = min_max_scaler.fit_transform(x)
feature_mtx [0:5]



array([[ 0.62705667,  0.50961538,  0.52161383,  0.39252336,  0.27826087,
         0.13333333,  0.25164864,  0.3277027 ,  0.34946237,  0.36024845,
         0.42372881,  0.30952381,  0.14285714,  0.07692308,  0.09090909],
       [ 1.        ,  0.8282967 ,  0.47550432,  0.21962617,  0.04927536,
         0.        ,  0.03323661,  0.08783784,  0.07526882,  0.08074534,
         0.03389831,  0.        ,  0.07142857,  0.        ,  0.        ],
       [ 0.68372943,  0.62087912,  0.42363112,  0.31775701,  0.19710145,
         0.1625    ,  0.35267739,  0.25337838,  0.32795699,  0.2484472 ,
         0.22033898,  0.16666667,  0.07142857,  0.07692308,  0.        ],
       [ 0.63711152,  0.4739011 ,  0.26801153,  0.14485981,  0.07826087,
         0.025     ,  0.34159852,  0.52027027,  0.66666667,  0.62111801,
         0.49152542,  0.35714286,  0.64285714,  0.07692308,  0.18181818],
       [ 0.55210238,  0.51236264,  0.52737752,  0.46261682,  0.4       ,
         0.18333333,  0.10630441,  0.1722973 , 

<h2 id="clustering_using_scipy">Clustering using Scipy</h2>

In [119]:
import scipy
leng = feature_mtx.shape[0]
D = scipy.zeros([leng,leng])
for i in range(leng):
    for j in range(leng):
        D[i,j] = scipy.spatial.distance.euclidean(feature_mtx[i], feature_mtx[j])

In [120]:
import pylab
import scipy.cluster.hierarchy
Z = hierarchy.linkage(D, 'complete')

  app.launch_new_instance()


In [121]:
from scipy.cluster.hierarchy import fcluster
max_d = 3
clusters = fcluster(Z, max_d, criterion='distance')
clusters

array([ 11,  27,  85, 114,  79, 111, 105,  21,  26,   8,  85,  85,   9,
        18,  25,  15,  69,  29,  20,  87,  84,  86,  26,  90,  36,  35,
       111,  86,  71,   6,  13,  78,  33,  54,  52,  23,  81,  82,  83,
        40,  36,   4,  93,  14,  76,  70,  29,  10,  80,  12,  48,  78,
        65,  76,  92,  40,  44,  97,  89,  21, 116,   4,   8,  23,   5,
        47,  75,  51,  47,  43,  63,  94, 106,  68,  37,  64,  70, 110,
       112,  11,  94,  51,  49, 115,  29, 100,  17,  17,  19,  41,  17,
        53,  49,  97,  52,  16,  64,  47, 104, 116,  67,   6,  70,   5,
        94,  10,  64,  71, 115,  97,  14, 100, 117, 116,  81,  23,  20,
        47,  17, 101,  16,  79,  11,  48,  47,  48,  96,  81,  95,  77,
        51,  46,  32,  73,  33,  47,  24,   5,  91,  94,  29,  46,  98,
        92,  30,  71,   2,  77,  55,  41,   6,  87,  86,  40,  15,  66,
        91,  55,  41,  83,  80,  80,  74,  82, 118,  40,  44,  40,  81,
        11,  83, 103,  18,  54,   3,   6,  96,  43,  84,  73,   

In [122]:
cluster_output = pd.DataFrame({'ZIPCODE':df_data.ZIP_CODE.tolist() , 'cluster':clusters})
# cluster_output
cluster_6 = cluster_output['cluster'] == 11
df_6 = cluster_output[cluster_6]
df_6.head(500)

Unnamed: 0,ZIPCODE,cluster
0,10035,11
79,91205,11
122,91803,11
169,92395,11
257,93637,11
279,93955,11


In [None]:
cluster_output

In [None]:
fig = pylab.figure(figsize=(18,50))
def llf(id):
    return '[%s]' % (df_data['ZIP_CODE'][id])
    
dendro = hierarchy.dendrogram(Z, 
                              leaf_label_func=llf, 
                              leaf_rotation=0, 
                              leaf_font_size =12,
#                               truncate_mode='lastp',  # show only the last p merged clusters
#                               p=22,  # show only the last p merged clusters                              
                              orientation = 'top')

Let's focus on the last 5 zip codes that were identified to be close to the current business location. 

In [210]:
body = client_932e651001094171ad8152ab24eb579c.get_object(Bucket='courseramlassignment-donotdelete-pr-sul542raczimxp',Key='tbl_final_list.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_final_list = pd.read_csv(body)

df_final_list['LAT'] = df_final_list['LAT'].astype(float)
df_final_list['LONG'] = df_final_list['LONG'].astype(float)
df_final_list['ZIPCODE'] = df_final_list['ZIPCODE'].astype('str')
df_final_list['ZIPCODE'] = df_final_list['ZIPCODE'].map(lambda x: x.rstrip('.0'))


# df_final_list['ZIPCODE'] = df_final_list['ZIPCODE'].astype('str')

# Eliminate current NY location. this dataframe will be used later to produce map 
df_final_list1 = df_final_list.drop(df_final_list.index[[0]])
df_final_list1.dropna(how='any', inplace=True)
df_final_list1

Unnamed: 0,ZIPCODE,LAT,LONG,IND_INCOME_25,IND_INCOME_25_50,IND_INCOME_50_75,IND_INCOME_75_100,IND_INCOME_100_200,IND_INCOME_OVER_200,BUS_TOTAL,BUS_EMPL_1_4,BUS_EMPL_5_9,BUS_EMPL_10_19,BUS_EMPL_20_49,BUS_EMPL_50_99,BUS_EMPL_100_249,BUS_EMPL_250_499,BUS_EMPL_500_999,BUS_EMPL_OVER_1000
1,91205,34.13658,-118.245839,7350.0,4240.0,1980.0,1090.0,1220.0,240.0,4488.0,2832.0,696.0,468.0,348.0,102.0,30.0,12.0,0.0,0.0
2,91803,34.074736,-118.145959,5910.0,3770.0,1900.0,1180.0,1560.0,310.0,4682.0,2954.0,708.0,504.0,342.0,108.0,48.0,18.0,0.0,0.0
3,92395,34.501472,-117.292048,7660.0,4110.0,2040.0,1100.0,1160.0,220.0,3608.0,1790.0,780.0,516.0,354.0,84.0,54.0,18.0,12.0,0.0
4,93637,36.918079,-120.185933,6120.0,4520.0,2110.0,1090.0,1210.0,270.0,3790.0,1738.0,876.0,540.0,408.0,138.0,60.0,18.0,6.0,6.0
5,93955,36.61441,-121.786901,5470.0,4470.0,2210.0,1150.0,1270.0,200.0,3546.0,1828.0,770.0,426.0,342.0,132.0,36.0,6.0,6.0,0.0


In [211]:
#CLIENT_ID = 'your-client-ID' # your Foursquare ID
#CLIENT_SECRET = 'your-client-secret' # your Foursquare Secret
CLIENT_ID = 'ULJRGANRNHIU04F4IZB2KLTYSP2ZOGQSPTTGP1PI1O2VV404' # your Foursquare ID
CLIENT_SECRET = 'MCOFPMTAKHOCAHGKTUNQHXQF0N4CL1LUKJH2OLNKECEDJNVA' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: ULJRGANRNHIU04F4IZB2KLTYSP2ZOGQSPTTGP1PI1O2VV404
CLIENT_SECRET:MCOFPMTAKHOCAHGKTUNQHXQF0N4CL1LUKJH2OLNKECEDJNVA


In [212]:
# type your answer here
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
current_zip = 10035

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&near={}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    current_zip, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=ULJRGANRNHIU04F4IZB2KLTYSP2ZOGQSPTTGP1PI1O2VV404&client_secret=MCOFPMTAKHOCAHGKTUNQHXQF0N4CL1LUKJH2OLNKECEDJNVA&v=20180605&near=10035&radius=500&limit=100'

In [228]:
def getNearbyVenues (zipcode):
    LIMIT = 200 # limit of number of venues returned by Foursquare API
    radius = 1000 # define radius
    
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&near={}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    zipcode, 
    radius, 
    LIMIT)
    
    return url

In [229]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [230]:
venues_list=[]
nearby_venues_all = pd.DataFrame([])
for index, row in df_final_list1.iterrows():
    
    zipcode=row['ZIPCODE'] 
#     print(zipcode)
    url = getNearbyVenues(zipcode)
    results = requests.get(url).json()
#     print(results)  
    venues = results['response']['groups'][0]['items']    

    nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
    filtered_columns = ['venue.name', 'venue.categories', 'venue.location.postalCode']
#     filtered_columns = ['venue.name', 'venue.categories', zipcode]
    
    nearby_venues =nearby_venues.loc[:, filtered_columns]
    nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
    nearby_venues['target_zipcode'] = zipcode
    nearby_venues_all = nearby_venues_all.append(nearby_venues)

#     venues_list.append([(
#         v['venue']['name'],
#         v['venue']['categories'][0]['name'],
#         v['venue']['location']['postalCode']) for v in nearby_venues])     
    
# # filter the category for each row
# nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# # clean columns
# nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]




In [236]:
nearby_venues_all.columns = ['venue_name', 'venue_category', 'venue_zipcode', 'target_zipcode']
# nearby_venues_all.columns = ['venue_name', 'venue_category', 'target_zipcode']
venues_by_zip = nearby_venues_all[['venue_name', 'venue_category', 'target_zipcode']]

print(venues_by_zip.shape)
venues_by_zip.head(200)

(120, 3)


Unnamed: 0,venue_name,venue_category,target_zipcode
0,Maple Park,Park,91205
1,Red Ribbon Bakeshop,Bakery,91205
2,Kanpai Ramen,Ramen Restaurant,91205
3,Kim's Kitchen,Korean Restaurant,91205
4,Mission Wine & Spirit,Wine Shop,91205
5,Karina's Cake House,Bakery,91205
6,Tarme Mediterranean Grill,Mediterranean Restaurant,91205
7,Indra Thai Restaurant,Thai Restaurant,91205
8,Art's Bakery,Bakery,91205
9,Mario's Italian Deli & Market,Italian Restaurant,91205


## Analyze Each Neighborhood

In [237]:
venues_by_zip.dropna(how='any', inplace=True)
venues_by_zip.groupby('venue_category').count()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Unnamed: 0_level_0,venue_name,target_zipcode
venue_category,Unnamed: 1_level_1,Unnamed: 2_level_1
American Restaurant,1,1
Argentinian Restaurant,1,1
Asian Restaurant,5,5
Bakery,4,4
Bank,1,1
Basketball Court,1,1
Bowling Alley,1,1
Burger Joint,2,2
Business Service,3,3
Café,1,1


In [239]:
# one hot encoding
onehot = pd.get_dummies(venues_by_zip[['venue_category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
onehot['target_zipcode'] = venues_by_zip['target_zipcode'] 

# move neighborhood column to the first column
fixed_columns = [onehot.columns[-1]] + list(onehot.columns[:-1])
onehot = onehot[fixed_columns]

onehot.head()

Unnamed: 0,target_zipcode,American Restaurant,Argentinian Restaurant,Asian Restaurant,Bakery,Bank,Basketball Court,Bowling Alley,Burger Joint,Business Service,Café,Cajun / Creole Restaurant,Chinese Restaurant,Chiropractor,Cocktail Bar,Coffee Shop,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cuban Restaurant,Department Store,Donut Shop,Electronics Store,Farm,Fast Food Restaurant,Fried Chicken Joint,Garden,Gift Shop,Golf Course,Greek Restaurant,Grocery Store,Gun Range,Gym / Fitness Center,Health & Beauty Service,Hookah Bar,Hot Dog Joint,Indonesian Restaurant,Intersection,Italian Restaurant,Karaoke Bar,Kebab Restaurant,Korean Restaurant,Market,Martial Arts Dojo,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Mobile Phone Shop,Motel,Motorcycle Shop,Noodle House,Optical Shop,Park,Persian Restaurant,Pharmacy,Pilates Studio,Pizza Place,Ramen Restaurant,Rental Car Location,Restaurant,Sandwich Place,Spa,Supermarket,Szechuan Restaurant,Taco Place,Thai Restaurant,Trail,Train Station,Video Store,Vietnamese Restaurant,Wine Shop,Wings Joint
0,91205,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,91205,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,91205,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,91205,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,91205,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
5,91205,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,91205,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,91205,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
8,91205,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,91205,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [241]:
zipcode_grouped = onehot.groupby('target_zipcode').mean().reset_index()
zipcode_grouped

Unnamed: 0,target_zipcode,American Restaurant,Argentinian Restaurant,Asian Restaurant,Bakery,Bank,Basketball Court,Bowling Alley,Burger Joint,Business Service,Café,Cajun / Creole Restaurant,Chinese Restaurant,Chiropractor,Cocktail Bar,Coffee Shop,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cuban Restaurant,Department Store,Donut Shop,Electronics Store,Farm,Fast Food Restaurant,Fried Chicken Joint,Garden,Gift Shop,Golf Course,Greek Restaurant,Grocery Store,Gun Range,Gym / Fitness Center,Health & Beauty Service,Hookah Bar,Hot Dog Joint,Indonesian Restaurant,Intersection,Italian Restaurant,Karaoke Bar,Kebab Restaurant,Korean Restaurant,Market,Martial Arts Dojo,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Mobile Phone Shop,Motel,Motorcycle Shop,Noodle House,Optical Shop,Park,Persian Restaurant,Pharmacy,Pilates Studio,Pizza Place,Ramen Restaurant,Rental Car Location,Restaurant,Sandwich Place,Spa,Supermarket,Szechuan Restaurant,Taco Place,Thai Restaurant,Trail,Train Station,Video Store,Vietnamese Restaurant,Wine Shop,Wings Joint
0,91205,0.0,0.017544,0.035088,0.052632,0.0,0.0,0.017544,0.017544,0.0,0.0,0.0,0.017544,0.017544,0.017544,0.017544,0.0,0.035088,0.017544,0.017544,0.0,0.035088,0.0,0.0,0.070175,0.0,0.0,0.0,0.0,0.017544,0.017544,0.017544,0.0,0.017544,0.017544,0.0,0.0,0.0,0.017544,0.0,0.017544,0.017544,0.0,0.017544,0.035088,0.0,0.017544,0.0,0.017544,0.017544,0.0,0.0,0.035088,0.017544,0.017544,0.017544,0.035088,0.017544,0.017544,0.017544,0.035088,0.017544,0.017544,0.0,0.0,0.052632,0.017544,0.0,0.017544,0.0,0.017544,0.017544
1,91803,0.0,0.0,0.068182,0.022727,0.022727,0.0,0.0,0.022727,0.0,0.0,0.022727,0.022727,0.0,0.0,0.022727,0.0,0.022727,0.0,0.0,0.022727,0.022727,0.0,0.0,0.045455,0.045455,0.022727,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022727,0.022727,0.022727,0.045455,0.0,0.0,0.0,0.022727,0.0,0.0,0.068182,0.0,0.022727,0.0,0.0,0.022727,0.022727,0.022727,0.0,0.022727,0.0,0.068182,0.0,0.0,0.0,0.022727,0.045455,0.0,0.045455,0.022727,0.0,0.0,0.022727,0.022727,0.045455,0.0,0.0
2,92395,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.142857,0.285714,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,93637,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,93955,0.125,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.375,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0


In [242]:
zipcode_grouped.shape

(5, 72)

#### Identify the top 5 most common venues for every zipcode

In [243]:
num_top_venues = 5

for hood in zipcode_grouped['target_zipcode']:
    print("----"+hood+"----")
    temp = zipcode_grouped[zipcode_grouped['target_zipcode'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----91205----
                  venue  freq
0  Fast Food Restaurant  0.07
1                Bakery  0.05
2       Thai Restaurant  0.05
3     Convenience Store  0.04
4      Asian Restaurant  0.04


----91803----
                 venue  freq
0     Asian Restaurant  0.07
1   Mexican Restaurant  0.07
2          Pizza Place  0.07
3                  Spa  0.05
4  Szechuan Restaurant  0.05


----92395----
                        venue  freq
0           Convenience Store  0.29
1                 Pizza Place  0.14
2                 Karaoke Bar  0.14
3  Construction & Landscaping  0.14
4                        Café  0.14


----93637----
                  venue  freq
0                  Farm  0.25
1  Gym / Fitness Center  0.25
2    Mexican Restaurant  0.25
3             Gift Shop  0.25
4   American Restaurant  0.00


----93955----
                 venue  freq
0     Business Service  0.38
1  American Restaurant  0.12
2    Korean Restaurant  0.12
3     Basketball Court  0.12
4                Trail  0.1

### identify top 10 most commomn venues

In [245]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [246]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['target_zipcode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
venues_sorted = pd.DataFrame(columns=columns)
venues_sorted['target_zipcode'] = zipcode_grouped['target_zipcode']

for ind in np.arange(zipcode_grouped.shape[0]):
    venues_sorted.iloc[ind, 1:] = return_most_common_venues(zipcode_grouped.iloc[ind, :], num_top_venues)

venues_sorted.head()

Unnamed: 0,target_zipcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,91205,Fast Food Restaurant,Bakery,Thai Restaurant,Convenience Store,Pizza Place,Mediterranean Restaurant,Sandwich Place,Park,Asian Restaurant,Donut Shop
1,91803,Asian Restaurant,Mexican Restaurant,Pizza Place,Fried Chicken Joint,Italian Restaurant,Szechuan Restaurant,Spa,Fast Food Restaurant,Vietnamese Restaurant,Hot Dog Joint
2,92395,Convenience Store,Café,Construction & Landscaping,Karaoke Bar,Pizza Place,Electronics Store,Donut Shop,Farm,Fast Food Restaurant,Fried Chicken Joint
3,93637,Mexican Restaurant,Gym / Fitness Center,Farm,Gift Shop,Wings Joint,Department Store,Donut Shop,Electronics Store,Fast Food Restaurant,Fried Chicken Joint
4,93955,Business Service,American Restaurant,Trail,Korean Restaurant,Basketball Court,Golf Course,Greek Restaurant,Grocery Store,Gun Range,Cuban Restaurant


In [None]:
print(view_venues.shape)
view_venues.head()

In [247]:
# The code was removed by Watson Studio for sharing.

(804, 2)

In [248]:
import folium
body = client_932e651001094171ad8152ab24eb579c.get_object(Bucket='courseramlassignment-donotdelete-pr-sul542raczimxp',Key='tbl_final_list.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_final_list = pd.read_csv(body)

df_final_list['LAT'] = df_final_list['LAT'].astype(float)
df_final_list['LONG'] = df_final_list['LONG'].astype(float)
df_final_list1['ZIPCODE'] = df_final_list1['ZIPCODE'].astype('str')
#eliminate current NY location
df_final_list1 = df_final_list.drop(df_final_list.index[[0]])
df_final_list1.dropna(how='any', inplace=True)
# df_final_list1.head()

mapit = folium.Map( location=[34.136580,-118.245839], zoom_start=6 )
# 'width=int' and 'height=int' can also be added to the map

for index, row in df_final_list1.iterrows():
    folium.Marker(location=[ row['LAT'], row['LONG']]).add_to(map_hooray) 
map_hooray.save( 'map.html')
map_hooray    



In [143]:
df_final_list1

Unnamed: 0,ZIPCODE,LAT,LONG,IND_INCOME_25,IND_INCOME_25_50,IND_INCOME_50_75,IND_INCOME_75_100,IND_INCOME_100_200,IND_INCOME_OVER_200,BUS_TOTAL,BUS_EMPL_1_4,BUS_EMPL_5_9,BUS_EMPL_10_19,BUS_EMPL_20_49,BUS_EMPL_50_99,BUS_EMPL_100_249,BUS_EMPL_250_499,BUS_EMPL_500_999,BUS_EMPL_OVER_1000
1,91205.0,34.13658,-118.245839,7350.0,4240.0,1980.0,1090.0,1220.0,240.0,4488.0,2832.0,696.0,468.0,348.0,102.0,30.0,12.0,0.0,0.0
2,91803.0,34.074736,-118.145959,5910.0,3770.0,1900.0,1180.0,1560.0,310.0,4682.0,2954.0,708.0,504.0,342.0,108.0,48.0,18.0,0.0,0.0
3,92395.0,34.501472,-117.292048,7660.0,4110.0,2040.0,1100.0,1160.0,220.0,3608.0,1790.0,780.0,516.0,354.0,84.0,54.0,18.0,12.0,0.0
4,93637.0,36.918079,-120.185933,6120.0,4520.0,2110.0,1090.0,1210.0,270.0,3790.0,1738.0,876.0,540.0,408.0,138.0,60.0,18.0,6.0,6.0
5,93955.0,36.61441,-121.786901,5470.0,4470.0,2210.0,1150.0,1270.0,200.0,3546.0,1828.0,770.0,426.0,342.0,132.0,36.0,6.0,6.0,0.0
