/
Applied Data Science Capstone - Select the best neighborhood for a new restaurant in SFV-F.Uzcategui.Submitted.py
1020 lines (572 loc) · 31.2 KB
/
Applied Data Science Capstone - Select the best neighborhood for a new restaurant in SFV-F.Uzcategui.Submitted.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
#!/usr/bin/env python
# coding: utf-8
# <h1>Applied Data Science Capstone : Select the best neighborhood for a new restaurant in SFV </h1>
#
# <h3>Autor: Francihelena Uzcategui</h3>
#
# ## Table of Contents
#
# <ul>
# <li><a href="#wrangling">I.Introduction
# <li><a href="#wrangling">II.Data
# <li><a href="#wrangling">1. Data Wrangling: 1.1, 1.2, and 1.3</a></li>
# <li><a href="#cleaning">2. Data Cleaning: 2.1 , 2.2, and 2.3</a></li>
# <li><a href="#methodology">III.Methodology
# <li><a href="#eda">3. Exploratory Data Analysis: 3.1, 3.2, and 3.3 </a></li>
# <li><a href="#eda">4. Getting the latitude and the longitude coordinates of each neighborhood</a>
# <li><a href="#eda">5. Explore Los Angeles county, Los Angeles city and San Fernando Valley </a></li>
# <li><a href="#eda">5.1 Foursquare API to explore the San Fernando Valley neighborhoods and segment them </a></li>
# <li><a href="#eda">6. Explore Neighborhoods in San Fernando Valley</a></li>
# <li><a href="#eda">7. Analyze each neighborhoods of San Fernando Valley</a></li>
# <li><a href="#eda">8. Machine Learning Algorithms</a></li>
# <li><a href="#eda">8.1 Clusters in San Fernando Valley neighborhoods</a></li>
# <li><a href="#results">IV.Results
# <li><a href="#eda">9. Examine Clusters of San Fernando Valley</a></li>
# <li><a href="#eda"> V. Discussion</a></li>
# <li><a href="#eda"> VI. Conclusion</a></li>
# <li><a href="#eda"> 10. Limitations</a></li>
# </ul>
#
#
# ## I. Introduction
#
# <p>A client is looking to open a casual restaurant in San Fernando Valley, California.
# The San Fernando Valley is an urbanized valley in Los Angeles County, California [1].
# Nearly two-thirds of the Valley's land area is part of the city of Los Angeles.
# The Valley as well called, is surrounded by many touristic and iconic places of Los Angeles city and Los Angeles county.
# For example, San Fernando Valley is 18 miles north of Downtown Los Angeles. It is also about 17 miles from Santa Monica beach. These are just two of the numerous iconic's locations close to the Valley.
# Also, the entertainment industry's headquartered are here, such a Disney, Warner Bros., Universal Studios, Dreamworks Animation, Cartoon Network, and Motion Picture Association of America.
# In the last years, doing business in the San Fernando Valley has been going flexibly. Its cities earned this reputation, being a stimulus to the economic growth of the region [2]. Thus, it is going to be promissory and prosper launch a new business here, in your case, a Restaurant.</p>
#
# <p>For this analysis, we require data of the borough, neighborhood, ZIP Code, longitude, and latitude. But, the county of Los Angeles does not have boroughs; it has unincorporated communities, incorporated cities, and neighborhoods of the city of Los Angeles - such as San Fernando Valley.
# Thus, to deal with the lack of boroughs, we defined two categories San Fernando Valley and Los Angeles. The map below shows San Fernando Valley, our target.</p>
#
# ![alt text](SFV_MAP.png "MAP")
# <p><center>Source: http://maps.latimes.com/neighborhoods/region/san-fernando-valley/</center></p>
#
#
# Data used:
# 1. Congressional Districts Los Angeles County - By Zip Code --> To get zip codes by districs. For the sake of the data management, this link shows the zip codes and communities compacted, without losing information or sense.
# Source: http://www.laalmanac.com/government/gu02a.php
#
# 2. Los Angeles Zip Codes --> Getting the latitude and the longitude coordinates of each neighborhood in Los Angeles county.
# Source: https://data.lacounty.gov/GIS-Data/ZIP-Codes-and-Postal-Cities/wft9-k7e3
#
# 3. San Fernando Valley (target region) : Incorporated/Uncorporated cities and City of Los Angeles neighborhoods --> Target data.
# Source : https://en.wikipedia.org/wiki/San_Fernando_Valley
#
# Thus, with these three sets of information, we built a dataset with 5 columns: region, neighborhood, ZIP code, longitude, and latitude.
#
# To wrangle the data, we required to scrape the Wikipedia page and official pages, then cleaning it, joining between them, and creating a structured data frame.
#
# Once the data is in a structured format, we explored, visualized, segmented, and clustered the neighborhoods by the region of San Fernando Valley.
# <p>Source:
# [1] https://en.wikipedia.org/wiki/San_Fernando_Valley,
# [2] https://laedc.org/wtc/chooselacounty/regions-of-la-county/san-fernando-valley/</p>
# Download all the dependencies that we will need
# In[1]:
import matplotlib.pyplot as plt
from pywaffle import Waffle
import emoji
from functools import reduce
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import json # library to handle JSON files
get_ipython().system("conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab")
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library
print('Libraries imported.')
# ## II. Data : 1. Data Wrangling
# ## *1.1 First Data : Congressional Districts Los Angeles County - By Zip Code*
# Source:<a> http://www.laalmanac.com/government/gu02a.php</a> --> Direct dowloaded from the web page to Excel
# <p>We selected this data because of many zip code areas are split among two or more communities, being similar to the Borough organization to NYC or Toronto (previous exercise). </p>
# In[2]:
df1= pd.read_excel("C:/Users/Franchi/Documents/IBM/8th course_Applied Data Sciences Capstone Project/2nd project/data/Congressional Districts.LA COUNTY and Zip Codes.xlsx")
df1.head()
# We selected only Los Angeles, and it's communities or neighborhoods.
# ## II. Data : 2. Data Cleaning
# ## *2.1 First Data : Congressional Districts Los Angeles County - By Zip Code*
# In[3]:
df1= df1[df1['Zip Code – City/Community'].str.contains('Los Angeles')]
df1.head()
# Removing unwanted columns
# In[4]:
df1= df1.drop(['District(s)'],axis=1)
df1.head()
# Splitting 'Zip Code – City/Community' column to obtain two columns, one for Zip Code and another for City/Community, this last one requires more splitting(more steps below). Then renaming the columns.
# In[5]:
df1= df1['Zip Code – City/Community'].str.split("-", n=1, expand=True)
df1.columns = ["ZIP Code", "City/Community"] #renaming columns, but the .rename is not working
df1.head()
# <p>With the next splitting process, we cannot keep the ZIP Code column, but we require it, and we'll recover it by inner join, in the following steps. Thus, creating a new column index1 will be the matching criterion with the next data frame df2.<p/>
# In[6]:
df1['index1'] = df1.index
df1.head()
# Splitting 'City/Community' column to separate Los Angeles city from it's communities. Then renaming the columns
# In[7]:
df2=df1["City/Community"].str.split("(", n=1, expand=True)
df2.columns = ["City","Community"]
df2.head()
# To remove unwanted characters as the parenthesis )
# In[8]:
df2['Community']=df2['Community'].str.replace(r")"," ")
df2.head()
# Creating a new column 'index2', that have common values with the previous data frame df1['index1'], to apply the inner join.
# In[9]:
df2['index2'] = df2.index
df2.head()
# Applying inner join for the two data frames: df with df2, to consolidate into one data frame the ZIP Code, City, and Community.
# In[10]:
df3=df1.join(df2.set_index('index2'), on='index1', how='inner')
df3.head()
# Dropping unwanted column "index1"
# In[11]:
df3 = df3.drop(["City/Community","index1"], axis=1)
df3.head()
# ## III. Methodology : 3. Exploratory Data Analysis
#
# ## *3.1 First Data : Congressional Districts Los Angeles County - By Zip Code*
#
# Checking the data frame structure and data type
# In[12]:
df3.shape
# In[13]:
df3.isnull().sum(axis = 0)
# In[14]:
df3.dtypes
# Converting data type of ZIP Code column to avoid future error on the join method. Then, verify that the data type changed.
# In[15]:
df3['ZIP Code']=df3['ZIP Code'].astype(int)
df3.dtypes
# In[16]:
print(emoji.emojize(':pushpin: Voila! Above the ready First dataset(df3) with the structure format wanted --> Community: South Los Angeles , Florence-Graham. Several communities by each zip code'))
# ## II. Data : 1. Data Wrangling
# ## *1.2 Second Data : Los Angeles Zip Codes and Latitude/Longitude by each neighborhood in Los Angeles county*
# <p>Source: https://data.lacounty.gov/GIS-Data/ZIP-Codes-and-Postal-Cities/wft9-k7e3 --> Direct dowloaded from the web page to Excel</p>
#
# <p>Getting the latitude and the longitude coordinates of each neighborhood </p>
# In[17]:
df4=pd.read_csv("C:/Users/Franchi/Documents/IBM/8th course_Applied Data Sciences Capstone Project/2nd project/ZIP_Codes_and_Postal_Cities.DataLAcounty.csv")
df4.head()
# ## II. Data : 2. Data Cleaning
# ## *2.2 Second Data : Los Angeles Zip Codes and Latitude/Longitude by each neighborhood in Los Angeles county*
# Dropping unwanted columns
# In[18]:
df4=df4.drop(['Postal City 2','Postal City 3','Not Acceptable 1','Not Acceptable 2','Not Acceptable 3'], axis=1)
#df=df.drop([0,0])
df4=df4.rename(columns={"Postal City 1": "Neighborhoods"})
df4.head()
# Location column mixed information, and it contains the geographic coordinates and zip codes in the same row. Thus, it should be fixed by splitting the values and then erasing the duplicate zip codes.
# In[19]:
df4[['zipcode-duplicate','Latitude','Longitude']] =df4.Location.str.split(expand=True,) #https://cmdlinetips.com/2018/11/how-to-split-a-text-column-in-pandas/
df4.head()
# Dropping duplicates columns
# In[20]:
df4=df4.drop(['Location','zipcode-duplicate'], axis=1)
df4['Latitude']=df4['Latitude'].str.replace(r","," ")
df4.head()
# ## III. Methodology : 3. Exploratory Data Analysis
# ## *3.2 Second Data : Los Angeles Zip Codes and Latitude/Longitude by each neighborhood in Los Angeles county*
# Checking shape
# In[21]:
df4.shape
# Checking null values
# In[22]:
df4.isnull().sum(axis = 0)
# In[23]:
df4.dtypes
# In[24]:
print(emoji.emojize(':pushpin: Voila! Above is the ready Second dataset(df4) with the structure format wanted --> Columns: ZIP Code, Neighborhoods, Latitude, Longitude'))
# <p>Joining the two dataframes df3 and df4 to create a consolidated data frame with the required information:</p>
# <p> - Columns: ZIP Code, Latitude, Longitude, City, and Community</p>
# <p> - Several communities by each zip code</p>
# In[25]:
df_LA=df4.join(df3.set_index('ZIP Code'), on='ZIP Code', how='inner')
pd.set_option('display.max_rows', None)
df_LA#.head()
# Dropping unwanted column - Neighborhoods, because it contains general and duplicate information. Thus, we kept the Community column because it has more details of the neighborhoods and the format required.
# In[26]:
df_LA=df_LA.drop(['Neighborhoods'], axis=1)
df_LA.head()
# In[27]:
df_LA.shape
# Checking null values
# In[28]:
df_LA.isnull().sum(axis = 0)
# Dropping null values and verifying it
# In[29]:
df_LA = df_LA.dropna()
df_LA.isnull().sum(axis = 0)
# In[30]:
print(emoji.emojize(':pushpin: Voila! Above the ready Third dataset (df_LA) with the structure format required--> Columns: ZIP Code, Latitude, Longitude, City, Community'))
# As we mentioned in the Introduction section, our target is the San Fernando Valley rather than Los Angeles city; remember that San Fernando Valley is a conglomerated of towns within Los Angeles County. Hence, we already have a list with all the communities in San Fernando Valley, and it will join to the data frame df_LA.
# ## II. Data : 1. Data Wrangling
# ## *1.3 Third Data : San Fernando Valley --> Target data*
# Source: Municipalities and neighborhoods - <a> https://en.wikipedia.org/wiki/San_Fernando_Valley</a>
# In[31]:
df5=pd.read_excel("C:/Users/Franchi/Documents/IBM/8th course_Applied Data Sciences Capstone Project/2nd project/data/San Fernado Valley - Municipalities and neighborhoods.xlsx")
df5=df5.rename(columns={"Municipalities and neighborhoods": "Place"})
df5.head()
# ## II. Data : 2. Data Cleaning
# ## *2.3 Third Data : San Fernando Valley --> Target data*
# Convert data frame to list to then search these words within the data frame df_LA
# In[32]:
list_of_SFV_communities = df5['Place'].to_list()
list_of_SFV_communities
# Searching the list of words within the data frame, to get True or False values.
# In[33]:
#How to test if a string contains one of the substrings in a list, in pandas?
#Source: https://stackoverflow.com/questions/26577516/how-to-test-if-a-string-contains-one-of-the-substrings-in-a-list-in-pandas- Answer for Grant Shannon
df_LA["TrueFalse"] = df_LA['Community'].apply(lambda x: 1 if any(i in x for i in list_of_SFV_communities) else 0)
# In[34]:
df_LA
# Re defining the values of True or False. Next, creating a new column with these new values.
# In[35]:
#Add new column based on boolean values in a different column
#Source:https://stackoverflow.com/questions/25570147/add-new-column-based-on-boolean-values-in-a-different-column/25570219 - Answer for EdChum
temp = {True:'San Fernando Valley', False:'Los Angeles'}
df_LA['Cityy'] = df_LA['TrueFalse'].map(temp)
df_LA
# Dropping unwanted columns
# In[36]:
df_LA = df_LA.drop(['City', 'TrueFalse'], axis=1)
df_LA.head()
# Renaming a column
# In[37]:
df_LA = df_LA.rename(columns={'Cityy':'Region'})
df_LA.head()
# ## III.Methodology : 3 Exploratory Data Analysis
# ## *3.3 Third Data : San Fernando Valley --> Target data*
# Checking the dataframe structure
# In[38]:
df_LA.shape
# In[39]:
df_LA.dtypes
# In[40]:
df_LA.isnull().sum()
# ## 4. Getting the latitude and the longitude coordinates of each neighborhood
# In[41]:
print(emoji.emojize(':pushpin: Voila! Above the data frame (df_LA) has the latitude and the longitude coordinates of each neighborhood In Los Angeles County, including San Fernando Valley.'))
# In[42]:
df_LA.head()
# ## 5. Explore Los Angeles county, Los Angeles city and San Fernando Valley.
#
# First, we processed **Los Angeles county** (whole dataset) and visualized its geographic coordinates.
# In[43]:
address = 'Los Angeles, CA'
geolocator = Nominatim(user_agent="LA_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Los Angeles are {}, {}.'.format(latitude, longitude))
# In[44]:
# create map of New York using latitude and longitude values
map_LA = folium.Map(location=[34.0536909, -118.2427666], zoom_start=10)
# add markers to map
for lat, lng, region, community in zip(df_LA['Latitude'], df_LA['Longitude'], df_LA['Region'], df_LA['Community']):
folium.CircleMarker(
[lat, lng],
radius=5,
color='blue',
fill_color='#3186cc',
fill_opacity=0.7).add_to(map_LA)
map_LA
# Then, processing **San Fernando Valley** and visualized its geographic coordinates
# In[45]:
SFV_data = df_LA[df_LA['Region'] == 'San Fernando Valley'].reset_index(drop=True)
SFV_data.head()
# In[46]:
SFV_data.dtypes
# In[47]:
address = 'San Fernando Valley, CA'
geolocator = Nominatim(user_agent="San Fernando Valley_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinate of San Fernando Valley are {}, {}.'.format(latitude, longitude))
# In[48]:
# create map of San fernado Valley using latitude and longitude values
map_SFV = folium.Map(location=[34.2148853, -118.4998204], zoom_start=11)
# add markers to map
for lat, lng, region, community in zip(SFV_data['Latitude'], SFV_data['Longitude'], SFV_data['Region'], SFV_data['Community']):
folium.CircleMarker(
[lat, lng],
radius=5,
color='blue',
fill_color='#3186cc',
fill_opacity=0.7).add_to(map_SFV)
map_SFV
# Next, processing **Los Angeles City** and visualized its geographic coordinates
# To avoid mess up with too many cities, we defined two categories to analyze; our Target: San Fernando Valley, and, the Not target: =! San Fernando Valley.
# In[49]:
LA_City_data = df_LA[df_LA['Region'] != 'San Fernando Valley'].reset_index(drop=True) # if use == 'los Angeles' is empty
LA_City_data.head()
# In[50]:
address = 'Los Angeles city, CA'
geolocator = Nominatim(user_agent="LA_City_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Los Angeles city are {}, {}.'.format(latitude, longitude))
# In[51]:
# create map of East Toronto using latitude and longitude values
map_LA_City = folium.Map(location=[34.0536909, -118.2427666], zoom_start=11)
# add markers to map
for lat, lng, region, community in zip(LA_City_data['Latitude'], LA_City_data['Longitude'], LA_City_data['Region'], LA_City_data['Community']):
folium.CircleMarker(
[lat, lng],
radius=5,
color='blue',
fill_color='#3186cc',
fill_opacity=0.7).add_to(map_LA_City)
map_LA_City
# <h3>5.1 Foursquare API to explore the San Fernando Valley neighborhoods and segment them</h3>
#
# <p> Define Foursquare Credentials and Version</p>
#
# Remember that San Fernando Valley is our target
# In[52]:
CLIENT_ID = 'STODIITDYK4OHL2CWWSGTGUDWEEUYQH1RLWIRCD4CT3MGZP4'#'your-client-ID' # your Foursquare ID
CLIENT_SECRET = 'JJCMU1YNWJRADLOAMADR0V4NRGE45RRU0O0T0XBRRYCDQNYZ' #'your-client-secret' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)
# ## 6. Explore Neighborhoods in San Fernando Valley
# In[53]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT=100):
venues_list=[]
for name, lat, lng in zip(names, latitudes, longitudes):
print(name)
# create the API request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
lat,
lng,
radius,
LIMIT)
# make the GET request
results = requests.get(url).json()["response"]['groups'][0]['items']
# return only relevant information for each nearby venue
venues_list.append([(
name,
lat,
lng,
v['venue']['name'],
v['venue']['location']['lat'],
v['venue']['location']['lng'],
v['venue']['categories'][0]['name']) for v in results])
nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['Community',
'Community Latitude',
'Community Longitude',
'Venue',
'Venue Latitude',
'Venue Longitude',
'Venue Category']
return(nearby_venues)
# Now write the code to run the above function on each neighborhood and create a new data frame called SFV_venues. SFV is the acronym of San Fernando Valley.
# In[54]:
SFV_venues = getNearbyVenues(names=SFV_data['Community'],
latitudes=SFV_data['Latitude'],
longitudes=SFV_data['Longitude']
)
# In[55]:
print(SFV_venues.shape)
SFV_venues.head()
# ## 7. Analyze each neighborhood of San Fernando Valley
# In[56]:
SFV_venues.groupby('Community').count()
# In[57]:
print('There are {} uniques categories.'.format(len(SFV_venues['Venue Category'].unique())))
# In[58]:
# one hot encoding
SFV_onehot = pd.get_dummies(SFV_venues[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
SFV_onehot['Community'] = SFV_venues['Community']
# move neighborhood column to the first column
fixed_columns = [SFV_onehot.columns[-1]] + list(SFV_onehot.columns[:-1])
SFV_onehot = SFV_onehot[fixed_columns]
SFV_onehot.head()
# In[59]:
#And let's examine the new dataframe size
SFV_onehot.shape
# Grouping rows by neighborhood and by taking the mean of the frequency of occurrence of each category
# In[60]:
SFV_grouped = SFV_onehot.groupby('Community').mean().reset_index()
SFV_grouped
# In[61]:
#confirm the new size
SFV_grouped.shape
# Writing a function to sort the venues in descending order.
# In[62]:
def return_most_common_venues(row, num_top_venues):
row_categories = row.iloc[1:]
row_categories_sorted = row_categories.sort_values(ascending=False)
return row_categories_sorted.index.values[0:num_top_venues]
# Creating the new data frame and displaying each neighborhood with the top 10 most common venues
# In[63]:
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['Community']
for ind in np.arange(num_top_venues):
try:
columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
except:
columns.append('{}th Most Common Venue'.format(ind+1))
# create a new dataframe
community_venues_sorted = pd.DataFrame(columns=columns)
community_venues_sorted['Community'] = SFV_grouped['Community']
for ind in np.arange(SFV_grouped.shape[0]):
community_venues_sorted.iloc[ind, 1:] = return_most_common_venues(SFV_grouped.iloc[ind, :], num_top_venues)
community_venues_sorted.head()
# ## 8. Machine Learning Algorithms
#
# Community venues dataset has two dimensions, the typical venues, and the communities. Furthermore, finding common patterns is a simple and efficient way to deal with it. K - means algorithms help to cluster these features. We set the number of clusters: two. This algorithm looks for similar group venues within each neighborhood of San Fernando Valley.
# <h2> 8.1 Clusters in San Fernando Valley neighborhoods</h2>
# In[64]:
#Run k-means to cluster the neighborhood into 2 clusters.
# set number of clusters
kclusters = 2
SFV_grouped_clustering = SFV_grouped.drop('Community', 1)
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(SFV_grouped_clustering)
# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]
# Let's create a new data frame that includes the clusters and the top 10 venues for each neighborhood.
# In[65]:
# add clustering labels
community_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
SFV_merged = SFV_data
# merge LA_grouped with LA_data to add latitude/longitude for each neighborhood
SFV_merged = SFV_merged.join(community_venues_sorted.set_index('Community'), on='Community')
SFV_merged # check the last columns!
# In[66]:
SFV_merged.shape
# Verifying null values
# In[67]:
SFV_merged['Cluster Labels'].isnull().sum()
# Then, we removed the null value and verified it.
# In[68]:
SFV_merged=SFV_merged.dropna()
# In[69]:
SFV_merged['Cluster Labels'].isnull().sum()
# We must convert Cluster Labels type to an integer because it'll generate an error on Visualizing the results clusters--> TypeError: list indices must be integers or slices, not floats.
# In[70]:
SFV_merged.dtypes
# SFV_merged= SFV_merged['Cluster Labels'].astype(int)
# SFV_merged
# SFV_merged
# SFV_merged=SFV_merged['Cluster Labels'].astype(int)
# Finally, let's visualize the resulting clusters
# In[71]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(SFV_merged['Latitude'], SFV_merged['Longitude'], SFV_merged['Community'], SFV_merged['Cluster Labels']):
label = folium.Popup(str(poi) + ' Cluster ' + str(cluster))
folium.CircleMarker(
[lat, lon],
radius=5,
color=rainbow[cluster-1],
fill_color=rainbow[cluster-1],
fill_opacity=0.7).add_to(map_clusters)
map_clusters
# ## IV. Results
# ## 9. Examine Clusters: San Fernando Valley
#
# To determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, we assigned a name to each cluster.
# <h3>Cluster 1 </h3>
# In[72]:
# Cluster 1 ==> 'Cluster Labels'] == 0
# SFV_merged.columns[[4] ==> It is the column wanted to show-index[4]
cluster_1=SFV_merged.loc[SFV_merged['Cluster Labels'] == 0, SFV_merged.columns[[3] + list(range(5, SFV_merged.shape[1]))]]
cluster_1
# <h3>Cluster 2 </h3>
# In[73]:
# # Cluster 2 ==> 'Cluster Labels'] == 1
cluster_2=SFV_merged.loc[SFV_merged['Cluster Labels'] == 1, SFV_merged.columns[[3] + list(range(5, SFV_merged.shape[1]))]]
cluster_2
# <h3>Results : Most common Restaurants and Places to eat within Cluster 1 and 2 </h3>
# <p>The type of food for the restaurant is not defined yet, either the location. Thus, it is early to suggest a definitive decision, although at this stage we have the information required, the different areas and the variety of Restaurants along the San Fernando Valley.</p>
#
# The results of the most common venues help us to know more about the Restaurant industry and customer preferences. In our case, we focused on **Restaurants and Places to eat within both clusters**. The most common venues are International restaurants, such as Oriental, Italian, French, Middle East, Mediterranean, Mexican, and South American.
# **Most common Restaurants and Places to eat within Cluster 1**
# In[74]:
#Source: https://thispointer.com/pandas-get-unique-values-in-single-or-multiple-columns-of-a-dataframe-in-python/
#uniqueValues = (empDfObj['Name'].append(empDfObj['Age'])).unique()
# Get unique elements in multiple columns
uniqueValues = (cluster_1['1st Most Common Venue'].append(cluster_1['2nd Most Common Venue']).append(cluster_1['3rd Most Common Venue']).append(cluster_1['4th Most Common Venue']).append(cluster_1['5th Most Common Venue']).append(cluster_1['6th Most Common Venue']).append(cluster_1['7th Most Common Venue']).append(cluster_1['8th Most Common Venue']).append(cluster_1['9th Most Common Venue']).append(cluster_1['10th Most Common Venue'])).unique()
print('Unique elements in the 10 columns of Most Common Venues :')
print(uniqueValues)
#In case of required convert one-dimensional NumPy Array to List
#list1 = uniqueValues.tolist()
#print(f'List: {list1}')
# In[75]:
# processed by hand
Cluster1_most_common_venues_Restaurant_and_Places_in_SFV = ['Mexican Restaurant', 'Sandwich Place', 'Pizza Place', 'Food Truck', 'Vietnamese Restaurant', 'Sushi Restaurant', 'Chinese Restaurant', 'Indian Restaurant', 'Brazilian Restaurant', 'Fast Food Restaurant', 'American Restaurant','South American Restaurant','Japanese Restaurant', 'Middle Eastern Restaurant', 'Thai Restaurant', 'French Restaurant', 'Fried Chicken Joint', 'Mediterranean Restaurant', 'Salad Place', 'Filipino Restaurant', 'Diner', 'Italian Restaurant', 'Seafood Restaurant', 'Korean Restaurant', 'Taco Place', 'Greek Restaurant', 'Eastern European Restaurant', 'Cajun / Creole Restaurant', 'Shabu-Shabu Restaurant', 'Falafel Restaurant', 'Kosher Restaurant']
Cluster1_most_common_venues_Restaurant_and_Places_in_SFV
# <li>Cluster 1: along all the San Fernando Valley, the leader of the type of Restaurant is the Oriental, including Chinese, Japanese, Thai, Korean, etc. Follow by Mexican food, next, the Italian/ Pizza food. Finally, similar preference for Fast food restaurants, and Sandwich places.</li>
# In[76]:
# processed by hand
Cluster1_most_commom_venues_top_5_restaurant = {'Italian_restaurant_and_Pizza_place': [16], 'Oriental_restaurant': [39], 'Mexican_Restaurant':[29], 'Fast Food':[12], 'Sandwich place':[12] }
Cluster1_most_commom_venues_top_5_restaurant
# In[77]:
plt.figure(
FigureClass=Waffle,
rows=5,
columns=10,
figsize=(11,5),
values={'Italian_restaurant_and_Pizza_place':16, 'Oriental_restaurant':39, 'Mexican_Restaurant':29, 'Fast Food':12, 'Sandwich place':12},
title={'label': 'Top 5 most common Restaurant and Places to eat of Cluster 1', 'fontdict': {'fontsize': 20}},
legend={'loc': 'upper left', 'bbox_to_anchor': (1, 1), 'fontsize': 12},
plot_anchor='C')
plt.show
# **Most commom Restaurants and Places to eat in Cluster 2**
# <li>Cluster 2: Despite it is a small cluster, yet there is a balanced tendency for Mediterranean food (Falafel), French restaurant, Food truck, and Filipino restaurant.</li>
# In[78]:
# processed by hand
Cluster2_most_commom_venues_Restaurant_and_Places_in_SFV=['Falafel Restaurant','French Restaurant','Food Truck','Filipino Restaurant']
Cluster2_most_commom_venues_Restaurant_and_Places_in_SFV
# In[80]:
plt.figure(
FigureClass=Waffle,
rows=5,
columns=10,
figsize=(10,5),
values={'Falafel Restaurant':1,'French Restaurant':1,'Food Truck':1,'Filipino Restaurant':1},
title={'label': 'Top 4 most common Restaurant and Places to eat of Cluster 2', 'fontdict': {'fontsize': 20}},
legend={'loc': 'upper left', 'bbox_to_anchor': (1, 1), 'fontsize': 12},
plot_anchor='C')
plt.show