# Introduction and Business Problem

For my Capstone project I will be comparing the different neighborhoods of the island of Oahu, Hawaii. I will 

*Can FourSquare data help incoming Oahu residents decide where on the island they would like to live?*

Thousands of people move to Oahu each year from the U.S. mainland and from abroad (http://dbedt.hawaii.gov/blog/19-40/). I myself moved to Oahu from the mainland six years ago. When apartment or house hunting, I think many people are presented with multiple options regarding which part of the island in which they would like to reside. Some incoming residents may even intend on opening a business on Oahu. Others move to Hawaii with their families, and are interested in neighborhood schools and crime rates. I would like to provide useful insight for incoming Oahu residents for deciding between different neighborhoods on the island.

I will utilize FourSquare data to analyze differences in the types and frequencies of businesses in each neighborhood. And when possible, analyze relationships between neighborhood crime and school performance. 

In addition to general analysis of the island, two specific examples will be considered:
* Mary is moving to Oahu and is considering opening either a bakery or an Italian restaraunt. Are there neighborhoods that favor one over the other?
* John is moving to Oahu with his family. He would like a neighborhood with nearby parks, but also low crime and good high schools. 


# Data

Foursquare data will be obtained through the use of a Foursquare Developer Account and the Foursquare API. I'll use the Oahu neighborhoods and postal codes found at https://www.kimicorrea.com/oahu-zip-codes/. 

For data on Oahu public schools, I will utilize the Strive HI Performance System results for 2017-2018 school year. This data was releasted by the Hawaii State Department of Education and is available for download as an Excel sheet at http://www.hawaiipublicschools.org/VisionForSuccess/AdvancingEducation/StriveHIPerformanceSystem/Pages/2017-18-results.aspx
To simplify this analysis, I will generate a combined math, language, and science score for public high schools only.

For Oahu crime data, I will utilize the Honolulu Police Department's 2018 Annual Report (http://www.honolulupd.org/downloads/HPD2018annualreport.pdf).

To join the different data sets, I will have to retrieve the postal code for each Oahu school and determine in which crime district each neighborhood resides (as the Oahu crime data is organized into eight separate districts). 

I plan on using a k-means cluster algorithm to cluster the different zip codes according to the most frequent types of business found in a radius around those zip codes. I will also look into the feasability of including the crime and school data in the cluster algorithm. Lastly, I will use NumPy to look into possible correlation between crime and school data. 

### Foursquare Data


In [68]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,Neighborhood,Zip Code
0,Aiea,96701
1,Ewa Beach,96706
2,Kapolei,96707
3,Kapolei,96709
4,Haleiwa,96712


In [19]:
#Import necessary libraries
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation
!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans
!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library
import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
print('Libraries imported.')

Solving environment: done

# All requested packages already installed.

Solving environment: \ 

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    altair-4.1.0               |             py_1         614 KB  conda-forge
    branca-0.4.1               |             py_0          26 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         713 KB

The following NEW packages will be INSTALLED:

    altair:  4.1.0-py_1 conda-forge
    branca:  0.4.1-py_0 conda-forge
    folium:  0.5.0-py_0 conda-forge
    vincent: 0.4.4-py_1 conda-forge


Downloading and Extracting Packages
vincent-0.4.4        | 28 KB     | ##########################

In [48]:
# The code was removed by Watson Studio for sharing.

In [35]:
#Install pgeocode to lookup lat/lon with zip code
!pip install pgeocode
import pgeocode


Collecting pgeocode
  Downloading https://files.pythonhosted.org/packages/86/44/519e3db3db84acdeb29e24f2e65991960f13464279b61bde5e9e96909c9d/pgeocode-0.2.1-py2.py3-none-any.whl
Installing collected packages: pgeocode
Successfully installed pgeocode-0.2.1


In [69]:
#Use geolocator to get latitude and longitude from postal codes.
#First we'll get Honolulu's location for use in map creation later on
address = 'Honolulu, HI'
geolocator = Nominatim(user_agent="explorer")
location = geolocator.geocode(address)
honolulu_latitude = location.latitude
honolulu_longitude = location.longitude

#Create an empty dataframe for a lat/lon dtaframe
# define the dataframe columns
column_names = ['Neighborhood','Zip Code', 'Latitude', 'Longitude'] 
# instantiate the dataframe
df_latlon = pd.DataFrame(columns=column_names)
nomi = pgeocode.Nominatim('us')
#Use a for loop to retrieve latitude and longitude for each postal code in df_zip
for index, row in df_zip.iterrows():
    neighborhood=row['Neighborhood']
    zip_code=row['Zip Code']
    location=nomi.query_postal_code(zip_code)
    neighborhood_lat=location.latitude
    neighborhood_lon=location.longitude
    df_latlon = df_latlon.append({'Neighborhood': neighborhood,
                                          'Zip Code': zip_code,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
df_latlon.head()


Unnamed: 0,Neighborhood,Zip Code,Latitude,Longitude
0,Aiea,96701,21.3908,-157.9332
1,Ewa Beach,96706,21.3274,-158.0103
2,Kapolei,96707,21.3453,-158.087
3,Kapolei,96709,21.3233,-158.0058
4,Haleiwa,96712,21.6312,-158.0693


### Public High School Data

In [67]:

body = client_d2cf653ced7f4ce7bdabe9c0e97f1701.get_object(Bucket='datascienceproject-donotdelete-pr-8nunocffzcnlco',Key='oahu_hs.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_schools = pd.read_csv(body)
#Now sum the math, language, and science scores into one total score
sum_score=df_schools['Math Proficiency']+df_schools['Language Proficiency']+df_schools['Science Proficiency']
df_schools['Sum Proficiency']=sum_score
df_schools.head()

Unnamed: 0,School,Graduation,Math Proficiency,Language Proficiency,Science Proficiency,Neighborhood/District,Zip,District,Sum Proficiency
0,Farrington High,73,26,54,22,Kalihi,96817,5,102
1,Kaimuki High,70,12,57,9,Kaimuki,96816,7,78
2,Kalani High,91,53,67,54,East Honolulu,96821,7,174
3,McKinley High,74,43,58,28,Makiki,96814,1,129
4,Roosevelt High,87,46,73,42,Central Honolulu,96822,1,161


### Oahu Crime Data

In [63]:
body = client_d2cf653ced7f4ce7bdabe9c0e97f1701.get_object(Bucket='datascienceproject-donotdelete-pr-8nunocffzcnlco',Key='2018 Crime Data.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_crime = pd.read_csv(body)
df_crime.head()


Unnamed: 0,District,Total Crimes
0,1,6808
1,2,2472
2,3,4525
3,4,3104
4,5,4917


### Combine All Data

In [70]:
#Average high school scores with the same zip code
df_schools2=df_schools.groupby(['Zip','District'])['Sum Proficiency'].mean().reset_index()#apply(lambda x: ','.join(x)).reset_index()
#Add back on school names
df_schools3=df_schools.groupby(['Zip'])['School'].apply(lambda x: ','.join(x)).reset_index()
df_schools_combined=df_schools2.join(df_schools3.set_index('Zip'), on='Zip')
#df_schools_combined
#Add Total Crimes
df_schools_crime=df_schools_combined.join(df_crime.set_index('District'), on='District')
#Add Latitude and Longitude
df_combined=df_schools_crime.join(df_latlon.set_index('Zip Code'),on='Zip')
df_combined

Unnamed: 0,Zip,District,Sum Proficiency,School,Total Crimes,Neighborhood,Latitude,Longitude
0,96701,3,133.0,Aiea High,4525,Aiea,21.3908,-157.9332
1,96706,8,142.0,Campbell High,4723,Ewa Beach,21.3274,-158.0103
2,96707,8,95.0,Kapolei High,4723,Kapolei,21.3453,-158.087
3,96731,4,114.0,Kahuku H&I,3104,Kahuku,21.675,-157.9725
4,96734,4,140.5,"Kailua High,Kalaheo High",3104,Kailua,21.4063,-157.7448
5,96744,4,129.0,Castle High,3104,Kaneohe,21.4228,-157.8115
6,96782,3,159.0,Pearl City High,4525,Pearl City,21.4084,-157.9652
7,96786,2,122.0,Leilehua High,2472,Wahiawa,21.5006,-158.0435
8,96789,2,175.0,Mililani High,2472,Mililani,21.4531,-158.0174
9,96791,2,123.0,Waialua H&I,2472,Waialua,21.5766,-158.1267
