# Introduction

This is Data Science Capstone Project for Coursera. The goal is to provide supporting information for a family deciding on a neighborhood depending on their educational and child care needs. It is achieved by clustering a given set of neighborhoods as per existence of day cares, pre-schools and the ratings of schools nearby. It involves the following:

### Data Gathering: 
1. Getting LatLong Coordinates for pre-defined set of neighborhoods
2. Getting nearby schools for these neighborhoods using foursqare API
3. Obtaining ratings of the schools from greatschools.org

### Data Preparation:
4. Massaging the data for clustering

### Clustering:
5. Clustering the data using KMeans Algorithm
6. Analyzing resulting clusters


In [1]:
import pandas as pd # library for data analysis
from pandas.io.json import json_normalize
import requests # library to handle requests
from lxml import etree #xml to parse school ratings
import math

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

#!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

import numpy as np # library to handle data in a vectorized manner
import matplotlib.cm as cm
import matplotlib.colors as colors

#Libraries for reading credentials and school dictionary
import pickle
import base64

#Library to match school names
#!conda install -c conda-forge python-levenshtein
import Levenshtein

print('Libraries imported.')

Libraries imported.


In [2]:
# Python program to store constants required for this project.
%run Constants.ipynb

#### Read the list of neighborhoods from a csv file. Note that these are list of neighborhoods in Greater Houston, TX

In [3]:
# neighborhoods file name is taken from Constants file
data = pd.read_csv(neighborhoodsFile)
data.head()

Unnamed: 0,Neighborhood,City
0,Acacia Park,"Spring, TX"
1,Albury Trails Estates,"The Woodlands, TX"
2,Alden Bridge,"Spring, TX"
3,Alden Bridge Hollylaurel,"Spring, TX"
4,Alden Trace,"Spring, TX"


#### Get the Geographical Coordinates for list of neighborhoods

In [4]:
latlong = {}
latlong_list = []

for index, row in data.iterrows():
    city = row['City']
    neighborhood = row['Neighborhood']
    geolocator = Nominatim(user_agent="city_explorer")
    location = None
    while(location is None):
        try:
            location = geolocator.geocode(neighborhood+','+city)
            lat = location.latitude
            lng = location.longitude
            latlong = {'Neighborhood': neighborhood, 'City': city, 'Resolved Name':location, 'Latitude': lat, 'Longitude':lng }
            latlong_list.append(latlong)
        except:
            break
            
neigh_latlong = pd.DataFrame(latlong_list, columns=['Neighborhood', 'City', 'Resolved Name', 'Latitude', 'Longitude'])

In [5]:
print(neigh_latlong.shape)
neigh_latlong.head()

(184, 5)


Unnamed: 0,Neighborhood,City,Resolved Name,Latitude,Longitude
0,Acacia Park,"Spring, TX","(Acacia Park, Alden Bridge, The Woodlands, Mon...",30.218422,-95.529759
1,Albury Trails Estates,"The Woodlands, TX","(Albury Trails Estates, Harris County, Texas, ...",30.069389,-95.580927
2,Alden Bridge,"Spring, TX","(Alden Bridge, The Woodlands, Montgomery Count...",30.213204,-95.517991
3,Alden Bridge Hollylaurel,"Spring, TX","(Hollylaurel, Alden Bridge, The Woodlands, Mon...",30.223754,-95.518385
4,Alden Trace,"Spring, TX","(Alden Trace, Alden Bridge, The Woodlands, Mon...",30.208315,-95.519652


#### Plot neighborhoods on map

In [6]:
#city is taken from constants file
geolocator = Nominatim(user_agent="city_explorer")
location = None
while(location is None):
    try:
        location = geolocator.geocode(city)
    except:
        break
print(location)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates are {}, {}.'.format(latitude, longitude))

The Woodlands, Montgomery County, Texas, USA
The geograpical coordinates are 30.1734194, -95.504686.


In [7]:
def showMap(latitude, longitude, neighborhood_data):
    # create map of Toronto using latitude and longitude values
    map = folium.Map(location=[latitude, longitude], zoom_start=10)

    # add markers to map
    for lat, lng, neighborhood in zip(neighborhood_data['Latitude'], neighborhood_data['Longitude'], neighborhood_data['Neighborhood']):
        label = '{}'.format(neighborhood)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(map)  
    
    return map

In [8]:
map = showMap(latitude, longitude, neigh_latlong)
map

#### Foursqure credentials and greatschool key are encoded and stored in a file through different code. Same are being retrieved here.

In [9]:

stored_creds = pickle.load( open( "secrets.pkl", "rb" ))
foursquare_id = base64.b64decode(stored_creds['foursquare_clientid']).decode("utf-8", "ignore")
foursqure_secret = base64.b64decode(stored_creds['foursquare_secret']).decode("utf-8", "ignore")
greatschools_key = base64.b64decode(stored_creds['greatschool_key']).decode("utf-8", "ignore")

#VERSION, LIMIT, category (foursquare categoryid for school), radius, foursquareUrl are taken from Constants file

#### Use Foursqure API to get nearby schools for each neighborhood. search function is used with category as 'school' as we are only interested in retrieving schools here

In [10]:
def getNearbyVenues(names, latitudes, longitudes):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = foursquareUrl.format(
            foursquare_id, 
            foursqure_secret, 
            lat, 
            lng, 
            VERSION, 
            category, 
            radius, 
            LIMIT)
        
            
        # make the GET request
        results = requests.get(url).json()
        venues = results['response']['venues']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['name'], 
            v['location']['lat'], 
            v['location']['lng'],  
            v['categories'][0]['name']) for v in venues])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'School', 
                  'School Latitude', 
                  'School Longitude', 
                  'School Category']
    
    return(nearby_venues)

In [11]:
schools = getNearbyVenues(names=neigh_latlong['Neighborhood'],
                                   latitudes=neigh_latlong['Latitude'],
                                   longitudes=neigh_latlong['Longitude']
                                  )
schools.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,School,School Latitude,School Longitude,School Category
0,Acacia Park,30.218422,-95.529759,Woodlands Civic Ballet,30.207763,-95.527541,School
1,Acacia Park,30.218422,-95.529759,Bush Elementary School,30.211384,-95.517843,School
2,Acacia Park,30.218422,-95.529759,Barbara Pierce Bush Elementary,30.211636,-95.517427,Elementary School
3,Acacia Park,30.218422,-95.529759,Kumon Math and Reading Center of The Woodlands...,30.208512,-95.529289,School
4,Acacia Park,30.218422,-95.529759,Legacy Preparatory Christian Academy,30.225233,-95.545745,High School


In [12]:
print('There are {} unique schools.'.format(len(schools['School'].unique())))

There are 348 unique schools.


#### Separating Day cares and Pre-schools (referred as childcare centers henceforth) which will be merged later after obtaining school rankings and cleaning up the data

In [13]:
#daycareCategories constant is taken from Constants file
daycares = schools[schools['School Category'].isin(daycareCategories)].reset_index(drop=True)
print('There are {} unique daycares or preschools.'.format(len(daycares['School'].unique())))
daycares.head()

There are 23 unique daycares or preschools.


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,School,School Latitude,School Longitude,School Category
0,Altwood,30.180225,-95.554421,Primrose School of The Woodlands at Sterling R...,30.183106,-95.538375,Daycare
1,Augusta Pines,30.125041,-95.537935,Primrose School of The Woodlands at Creekside ...,30.13973,-95.547159,Daycare
2,Augusta Pines,30.125041,-95.537935,The Goddard School,30.123991,-95.555491,Daycare
3,Belcourte,30.187361,-95.538795,Primrose School of The Woodlands at Sterling R...,30.183106,-95.538375,Daycare
4,Bethany Bend,30.211766,-95.54126,Little Sunshine's Playhouse & Preschool,30.209921,-95.549726,Preschool


#### Onehot encoding of childcare centers. Here the intention is just to check presence of such a facility in the vicinity of a neighborhood

In [14]:

# one hot encoding
onehot = pd.get_dummies(daycares[['School Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
onehot['Neighborhood'] = daycares['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [onehot.columns[-1]] + list(onehot.columns[:-1])
onehot = onehot[fixed_columns]

onehot.head(10)

Unnamed: 0,Neighborhood,Daycare,Preschool
0,Altwood,1,0
1,Augusta Pines,1,0
2,Augusta Pines,1,0
3,Belcourte,1,0
4,Bethany Bend,0,1
5,Bonny Branch,1,0
6,Bridgestone,1,0
7,Canoe Bend,1,0
8,Canoe Bend,0,1
9,Canyon Gate At Northpointe,1,0


In [15]:
# Grouping the data to eliminate duplicate rows and combine daycare and preschool information per neighborhood
onehot = onehot.groupby(['Neighborhood'])['Daycare','Preschool'].max().reset_index()
onehot.head(10)

Unnamed: 0,Neighborhood,Daycare,Preschool
0,Altwood,1,0
1,Augusta Pines,1,0
2,Belcourte,1,0
3,Bethany Bend,0,1
4,Bonny Branch,1,0
5,Bridgestone,1,0
6,Canoe Bend,1,1
7,Canyon Gate At Northpointe,1,0
8,Carlton Woods Dr,1,0
9,Cascade Canyon,1,0


#### Filtering out schools so that we have only Elementary, Middle and High schools

In [16]:
#schoolCategories and schoolFilter is taken from Constants file
schools = schools[schools['School Category'].isin(schoolCategories)].reset_index(drop=True)
filtered_schools = schools[schools['School'].str.contains(schoolFilter)].reset_index(drop=True)
print('There are {} unique schools.'.format(len(filtered_schools['School'].unique())))
filtered_schools.head(10)

There are 104 unique schools.


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,School,School Latitude,School Longitude,School Category
0,Acacia Park,30.218422,-95.529759,Barbara Pierce Bush Elementary,30.211636,-95.517427,Elementary School
1,Alden Bridge,30.213204,-95.517991,Barbara Pierce Bush Elementary,30.211636,-95.517427,Elementary School
2,Alden Bridge Hollylaurel,30.223754,-95.518385,Barbara Pierce Bush Elementary,30.211636,-95.517427,Elementary School
3,Alden Trace,30.208315,-95.519652,Barbara Pierce Bush Elementary,30.211636,-95.517427,Elementary School
4,Alden Trace,30.208315,-95.519652,Galatas Elementary,30.196711,-95.527682,Elementary School
5,Alden Trace,30.208315,-95.519652,Powell Elementary,30.19912,-95.511201,Elementary School
6,Altwood,30.180225,-95.554421,Coulson Tough Elementary School,30.183184,-95.571458,Elementary School
7,Altwood,30.180225,-95.554421,The Woodlands Christian High School,30.177607,-95.535643,High School
8,Ashford Place,30.038838,-95.586603,Blackshear Elementary,30.036122,-95.584608,Elementary School
9,Ashford Place,30.038838,-95.586603,Kohrville Elementary,30.045361,-95.592062,Elementary School


#### Discovered that names of schools returned by Foursquare API dont always match or return results on GreatSchools. So, created a dictionary of school names to match. There can be much more elegant solution, but that is beyond the scope of this project.

In [17]:
schools_dict = pickle.load( open( "schools.pkl", "rb" ))
filtered_schools = filtered_schools.replace({'School': schools_dict})
filtered_schools.head(10)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,School,School Latitude,School Longitude,School Category
0,Acacia Park,30.218422,-95.529759,Bush Elementary School,30.211636,-95.517427,Elementary School
1,Alden Bridge,30.213204,-95.517991,Bush Elementary School,30.211636,-95.517427,Elementary School
2,Alden Bridge Hollylaurel,30.223754,-95.518385,Bush Elementary School,30.211636,-95.517427,Elementary School
3,Alden Trace,30.208315,-95.519652,Bush Elementary School,30.211636,-95.517427,Elementary School
4,Alden Trace,30.208315,-95.519652,Galatas Elementary,30.196711,-95.527682,Elementary School
5,Alden Trace,30.208315,-95.519652,Powell Elementary,30.19912,-95.511201,Elementary School
6,Altwood,30.180225,-95.554421,Tough Elementary School,30.183184,-95.571458,Elementary School
7,Altwood,30.180225,-95.554421,The Woodlands Christian High School,30.177607,-95.535643,High School
8,Ashford Place,30.038838,-95.586603,Blackshear Elementary,30.036122,-95.584608,Elementary School
9,Ashford Place,30.038838,-95.586603,Kohrville Elementary,30.045361,-95.592062,Elementary School


#### Get list of unique schools for getting their rankings

In [18]:
unique_schools = filtered_schools.loc[:,['School', 'School Category']].drop_duplicates().reset_index()
print('There are {} unique schools.'.format(len(unique_schools['School'].unique())))
unique_schools.head()

There are 101 unique schools.


Unnamed: 0,index,School,School Category
0,0,Bush Elementary School,Elementary School
1,4,Galatas Elementary,Elementary School
2,5,Powell Elementary,Elementary School
3,6,Tough Elementary School,Elementary School
4,7,The Woodlands Christian High School,High School


#### Get Ratings of schools from Greatschools.org

In [19]:
#state and schoolsUrl are taken from Constants file

In [20]:
def getSchoolRatings(school, levelCode):
    url = schoolsUrl.format(
            greatschools_key, state, school.lower(), 'alpha', levelCode, 10)
    results = requests.get(url)
    tree = etree.fromstring(results.content)
    dict_list = []
    
    # For each query, greatschools returns more than one result and we need to match the proper school that we queried for
    # There are 2 criteria that I used:
    # 1.matching the school name using Levenshtein ratio and 2. the cities that are in scope of this project
    # list of accepted cities is taken from Constants file
    for element in tree:
        dict = {}
        dict.update({'School':school})
        for child in element:
            dict.update({child.tag: child.text })
        nameSimilarity = Levenshtein.ratio(dict['name'].lower(),school.lower())
        if dict['city'] in cities and nameSimilarity > 0.6:
            return dict

In [21]:
# Getting rankings of schools from Greatschools.org
no_rating_schools = []
dict_list =[]
#levelCodes is dictionary declared in Constants file that maps the school levels to the literals understood by greatschools

for school, level in zip(unique_schools['School'], unique_schools['School Category']):  
    school_list = getSchoolRatings(school,levelCodes[level.strip()])
    if not school_list:
        no_rating_schools.append(school)
    else:    
        dict_list.append(school_list)
# There are 2 ratings given - one is rating by greatSchools itself and another is the aggregation of parent ratings 
df = pd.DataFrame(dict_list,columns=['School','gsRating','parentRating','city','state'])
df.head()


Unnamed: 0,School,gsRating,parentRating,city,state
0,Bush Elementary School,3.0,3,Houston,TX
1,Galatas Elementary,10.0,4,Spring,TX
2,Powell Elementary,9.0,3,Spring,TX
3,Tough Elementary School,10.0,4,The Woodlands,TX
4,The Woodlands Christian High School,,5,The Woodlands,TX


#### Note that there are 2 ratings given for each school - gsRating given by Greatschools.org and parentRating based on ratings given by Parents. In case of private schools, there is no gsRating and only parentRating if parents happened to rate it


#### Merge the schools data back with neighborhood information

In [22]:
schools_with_rating = filtered_schools.join ( df.set_index( [ 'School' ], verify_integrity=True ),
               on=[ 'School' ], how='left' )
schools_with_rating.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,School,School Latitude,School Longitude,School Category,gsRating,parentRating,city,state
0,Acacia Park,30.218422,-95.529759,Bush Elementary School,30.211636,-95.517427,Elementary School,3,3,Houston,TX
1,Alden Bridge,30.213204,-95.517991,Bush Elementary School,30.211636,-95.517427,Elementary School,3,3,Houston,TX
2,Alden Bridge Hollylaurel,30.223754,-95.518385,Bush Elementary School,30.211636,-95.517427,Elementary School,3,3,Houston,TX
3,Alden Trace,30.208315,-95.519652,Bush Elementary School,30.211636,-95.517427,Elementary School,3,3,Houston,TX
4,Alden Trace,30.208315,-95.519652,Galatas Elementary,30.196711,-95.527682,Elementary School,10,4,Spring,TX


#### Pick only relevant columns for clustering

In [23]:
school_data = schools_with_rating.loc[:,['Neighborhood', 'School Category','gsRating','parentRating']]
school_data.head()

Unnamed: 0,Neighborhood,School Category,gsRating,parentRating
0,Acacia Park,Elementary School,3,3
1,Alden Bridge,Elementary School,3,3
2,Alden Bridge Hollylaurel,Elementary School,3,3
3,Alden Trace,Elementary School,3,3
4,Alden Trace,Elementary School,10,4


#### Cleanup the NaN data in ratings, normalize the ratings (gsrating is out of 10 and parent rating is out of 5) and group to remove duplicate rows

In [24]:
school_data['gsRating'] = school_data['gsRating'].astype("float").apply(lambda x: 0 if math.isnan(x) else x/10)
school_data['parentRating'] = school_data['parentRating'].astype("float").apply(lambda x: 0 if math.isnan(x) else x/5)
school_data1 = school_data.groupby(['Neighborhood','School Category'])['gsRating','parentRating'].max().reset_index()
school_data1.head()

Unnamed: 0,Neighborhood,School Category,gsRating,parentRating
0,Acacia Park,Elementary School,0.3,0.6
1,Alden Bridge,Elementary School,0.3,0.6
2,Alden Bridge Hollylaurel,Elementary School,0.3,0.6
3,Alden Trace,Elementary School,1.0,0.8
4,Altwood,Elementary School,1.0,0.8


#### Need to flip the row data to columns to get school ratings per category in columns. Direct transform does not seem to work, so using somewhat less elegant code to do so.

In [25]:

school_data1['gsElementaryRating'] = school_data1.apply(lambda row:row['gsRating']  if row['School Category']=='Elementary School' else 0,axis=1)
school_data1['parentElementaryRating'] = school_data1.apply(lambda row:row['parentRating']  if row['School Category']=='Elementary School' else 0,axis=1)
school_data1['gsMiddleRating'] = school_data1.apply(lambda row:row['gsRating']  if row['School Category']=='Middle School' else 0,axis=1)
school_data1['parentMiddleRating'] = school_data1.apply(lambda row:row['parentRating']  if row['School Category']=='Middle School' else 0,axis=1)
school_data1['gsHighschoolRating'] = school_data1.apply(lambda row:row['gsRating']  if row['School Category']=='High School' else 0,axis=1)
school_data1['parentHighschoolRating'] = school_data1.apply(lambda row:row['parentRating']  if row['School Category']=='High School' else 0,axis=1)
school_data2 = school_data1.groupby(['Neighborhood'])['gsElementaryRating','parentElementaryRating','gsMiddleRating','parentMiddleRating','gsHighschoolRating','parentHighschoolRating'].max().reset_index()
school_data2


Unnamed: 0,Neighborhood,gsElementaryRating,parentElementaryRating,gsMiddleRating,parentMiddleRating,gsHighschoolRating,parentHighschoolRating
0,Acacia Park,0.3,0.6,0.0,0.0,0.0,0.0
1,Alden Bridge,0.3,0.6,0.0,0.0,0.0,0.0
2,Alden Bridge Hollylaurel,0.3,0.6,0.0,0.0,0.0,0.0
3,Alden Trace,1.0,0.8,0.0,0.0,0.0,0.0
4,Altwood,1.0,0.8,0.0,0.0,0.0,1.0
5,Ashford Place,0.7,0.8,0.0,0.0,0.0,0.0
6,Auburn,0.7,0.8,0.0,0.0,0.0,0.0
7,Auburn Lakes,1.0,0.0,0.0,0.0,0.0,0.0
8,Augusta Pines,1.0,0.8,0.0,0.0,0.0,0.0
9,Bacopa Bay,1.0,0.8,0.0,0.0,0.0,0.0


#### Join the school data with childcare data and remove any NaNs

In [26]:
school_data3 = school_data2.join ( onehot.set_index( [ 'Neighborhood' ], verify_integrity=True ),
               on=[ 'Neighborhood' ], how='left' )
school_data3['Daycare'] = school_data3['Daycare'].astype("float").apply(lambda x: 0 if math.isnan(x) else x)
school_data3['Preschool'] = school_data3['Preschool'].astype("float").apply(lambda x: 0 if math.isnan(x) else x)
school_data3.head()

Unnamed: 0,Neighborhood,gsElementaryRating,parentElementaryRating,gsMiddleRating,parentMiddleRating,gsHighschoolRating,parentHighschoolRating,Daycare,Preschool
0,Acacia Park,0.3,0.6,0.0,0.0,0.0,0.0,0.0,0.0
1,Alden Bridge,0.3,0.6,0.0,0.0,0.0,0.0,0.0,0.0
2,Alden Bridge Hollylaurel,0.3,0.6,0.0,0.0,0.0,0.0,0.0,0.0
3,Alden Trace,1.0,0.8,0.0,0.0,0.0,0.0,0.0,0.0
4,Altwood,1.0,0.8,0.0,0.0,0.0,1.0,1.0,0.0


#### Cluster the data now using Kmeans

In [27]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

school_data_clustering = school_data3.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(school_data_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 1, 1, 2, 1, 1, 1, 2, 1])

In [28]:
# add clustering labels
school_data3.insert(0, 'Cluster Labels', kmeans.labels_)
school_data3.head()

Unnamed: 0,Cluster Labels,Neighborhood,gsElementaryRating,parentElementaryRating,gsMiddleRating,parentMiddleRating,gsHighschoolRating,parentHighschoolRating,Daycare,Preschool
0,1,Acacia Park,0.3,0.6,0.0,0.0,0.0,0.0,0.0,0.0
1,1,Alden Bridge,0.3,0.6,0.0,0.0,0.0,0.0,0.0,0.0
2,1,Alden Bridge Hollylaurel,0.3,0.6,0.0,0.0,0.0,0.0,0.0,0.0
3,1,Alden Trace,1.0,0.8,0.0,0.0,0.0,0.0,0.0,0.0
4,2,Altwood,1.0,0.8,0.0,0.0,0.0,1.0,1.0,0.0


#### Join the cluster data with latlong data for plotting on map

In [29]:
school_merged = school_data3.join(neigh_latlong.set_index('Neighborhood'), on='Neighborhood')
school_merged = school_merged.drop(['City', 'Resolved Name'], axis=1)
school_merged.head(10) 

Unnamed: 0,Cluster Labels,Neighborhood,gsElementaryRating,parentElementaryRating,gsMiddleRating,parentMiddleRating,gsHighschoolRating,parentHighschoolRating,Daycare,Preschool,Latitude,Longitude
0,1,Acacia Park,0.3,0.6,0.0,0.0,0.0,0.0,0.0,0.0,30.218422,-95.529759
1,1,Alden Bridge,0.3,0.6,0.0,0.0,0.0,0.0,0.0,0.0,30.213204,-95.517991
2,1,Alden Bridge Hollylaurel,0.3,0.6,0.0,0.0,0.0,0.0,0.0,0.0,30.223754,-95.518385
3,1,Alden Trace,1.0,0.8,0.0,0.0,0.0,0.0,0.0,0.0,30.208315,-95.519652
4,2,Altwood,1.0,0.8,0.0,0.0,0.0,1.0,1.0,0.0,30.180225,-95.554421
5,1,Ashford Place,0.7,0.8,0.0,0.0,0.0,0.0,0.0,0.0,30.038838,-95.586603
6,1,Auburn,0.7,0.8,0.0,0.0,0.0,0.0,0.0,0.0,30.089772,-95.544337
7,1,Auburn Lakes,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,30.12864,-95.524525
8,2,Augusta Pines,1.0,0.8,0.0,0.0,0.0,0.0,1.0,0.0,30.125041,-95.537935
9,1,Bacopa Bay,1.0,0.8,0.0,0.0,0.0,0.0,0.0,0.0,30.144632,-95.521402


#### Plot clusters on map

In [30]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(school_merged['Latitude'], school_merged['Longitude'], school_merged['Neighborhood'], school_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### See each cluster below along with my analysis

### Cluster0: Neighborhoods near to :
1. Elementary Schools with good ratings
2. Middle schools 
3. Childcare facilities like pre-schools 
### - Suits families with young kids going up to middle school as well as with childcare needs

In [31]:
school_merged.loc[school_merged['Cluster Labels'] == 0, school_merged.columns[[1] + list(range(2, school_merged.shape[1]-2))]]

Unnamed: 0,Neighborhood,gsElementaryRating,parentElementaryRating,gsMiddleRating,parentMiddleRating,gsHighschoolRating,parentHighschoolRating,Daycare,Preschool
16,Canoe Bend,1.0,0.6,0.9,0.0,0.0,0.0,1.0,1.0
17,Canyon Gate At Northpointe,1.0,0.8,0.9,0.6,0.0,0.0,1.0,0.0
23,Champions Trail,0.9,0.8,0.0,0.0,0.0,0.0,1.0,1.0
26,Cokeberry Forest,0.7,1.0,0.8,0.6,0.0,0.0,1.0,1.0
27,Colony Creek Village,0.9,0.8,0.9,0.6,0.0,0.0,1.0,1.0
32,Creekside Park,1.0,0.6,0.0,0.0,0.0,0.0,1.0,1.0
33,Creekside Park,1.0,0.6,0.0,0.0,0.0,0.0,1.0,1.0
47,Glen Mill,0.5,1.0,0.0,0.0,0.0,0.0,1.0,1.0
53,Haven Lake Estates,1.0,0.6,0.9,0.0,0.0,0.0,1.0,1.0
61,Jagged Ridge,1.0,0.6,0.9,0.0,0.0,0.0,1.0,1.0


### Cluster1: Neighborhoods near to :
1. Elementary Schools with average to good ratings
2. Few Middle school assignments
3. Almost no Childcare facilities
### - Suits families with young kids with elementary school with no childcare needs

In [32]:
school_merged.loc[school_merged['Cluster Labels'] == 1, school_merged.columns[[1] + list(range(2, school_merged.shape[1]-2))]]

Unnamed: 0,Neighborhood,gsElementaryRating,parentElementaryRating,gsMiddleRating,parentMiddleRating,gsHighschoolRating,parentHighschoolRating,Daycare,Preschool
0,Acacia Park,0.3,0.6,0.0,0.0,0.0,0.0,0.0,0.0
1,Alden Bridge,0.3,0.6,0.0,0.0,0.0,0.0,0.0,0.0
2,Alden Bridge Hollylaurel,0.3,0.6,0.0,0.0,0.0,0.0,0.0,0.0
3,Alden Trace,1.0,0.8,0.0,0.0,0.0,0.0,0.0,0.0
5,Ashford Place,0.7,0.8,0.0,0.0,0.0,0.0,0.0,0.0
6,Auburn,0.7,0.8,0.0,0.0,0.0,0.0,0.0,0.0
7,Auburn Lakes,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Bacopa Bay,1.0,0.8,0.0,0.0,0.0,0.0,0.0,0.0
11,Bending Bough,0.7,0.8,0.0,0.0,0.0,0.0,0.0,0.0
14,Boudreaux Rd,0.8,1.0,0.0,0.0,0.0,0.0,0.0,0.0


### Cluster2: Neighborhoods near to :
1. Average to good rated Elementary Schools,
2. Choice of public and private high schools 
3. Childcare facilities like day care
### - Suits families with teenage kids as well as young kids with child care needs

In [33]:
school_merged.loc[school_merged['Cluster Labels'] == 2, school_merged.columns[[1] + list(range(2, school_merged.shape[1]-2))]]

Unnamed: 0,Neighborhood,gsElementaryRating,parentElementaryRating,gsMiddleRating,parentMiddleRating,gsHighschoolRating,parentHighschoolRating,Daycare,Preschool
4,Altwood,1.0,0.8,0.0,0.0,0.0,1.0,1.0,0.0
8,Augusta Pines,1.0,0.8,0.0,0.0,0.0,0.0,1.0,0.0
10,Belcourte,1.0,0.8,0.0,0.0,0.0,1.0,1.0,0.0
13,Bonny Branch,1.0,0.8,0.0,0.0,0.0,1.0,1.0,0.0
15,Bridgestone,0.7,0.8,0.0,0.0,0.6,0.8,1.0,0.0
28,Copper Sage,1.0,0.8,0.0,0.0,0.0,1.0,1.0,0.0
31,Covington Bridge,0.7,0.8,0.0,0.0,0.6,0.8,0.0,0.0
43,Estates Creek,0.3,0.8,0.0,0.0,0.4,0.8,0.0,0.0
50,Grogan's Mill,0.7,1.0,0.0,0.0,0.0,0.0,1.0,0.0
65,Lakewood Forest,0.6,1.0,0.0,0.0,0.6,0.6,0.0,0.0


### Cluster3: Neighborhoods near to :
1. Mostly private high schools with good parent rankings, and 
3. Childcare facilities like day care
### - Suits families with teenage kids and preference to private schools as well as young kids with child care needs

In [34]:
school_merged.loc[school_merged['Cluster Labels'] == 3, school_merged.columns[[1] + list(range(2, school_merged.shape[1]-2))]]

Unnamed: 0,Neighborhood,gsElementaryRating,parentElementaryRating,gsMiddleRating,parentMiddleRating,gsHighschoolRating,parentHighschoolRating,Daycare,Preschool
19,Carlton Woods Dr,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
20,Cascade Canyon,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
24,Chandler Creek,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
30,Country Meadows,0.0,0.0,0.0,0.0,0.0,0.6,0.0,0.0
48,Gosling Road,0.0,0.0,0.0,0.0,0.8,0.8,0.0,1.0
56,Hazelcrest,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
60,Indian Springs,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
69,Legacy Point,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
71,Lenox Hill,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
114,Shawnee Ridge,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0


### Cluster4: Neighborhoods near to :
1. Elementary Schools with good ratings
2. Childcare facilities like pre-schools 
### - Suits families with young kids as well as with childcare needs

In [35]:
school_merged.loc[school_merged['Cluster Labels'] == 4, school_merged.columns[[1] + list(range(2, school_merged.shape[1]-2))]]

Unnamed: 0,Neighborhood,gsElementaryRating,parentElementaryRating,gsMiddleRating,parentMiddleRating,gsHighschoolRating,parentHighschoolRating,Daycare,Preschool
12,Bethany Bend,1.0,0.8,0.0,0.0,0.0,0.0,0.0,1.0
22,Champion springs,0.5,0.6,0.0,0.0,0.3,0.6,0.0,1.0
41,Eagle Mead,1.0,0.8,0.0,0.0,0.0,0.0,0.0,1.0
49,Greenvine,1.0,0.8,0.0,0.0,0.0,0.0,0.0,1.0
59,Hunterwood,0.9,0.8,0.9,0.8,0.0,0.0,0.0,1.0
73,Linton Ridge,1.0,0.8,0.0,0.0,0.0,0.0,0.0,1.0
79,Maple Glade,1.0,0.8,0.0,0.0,0.0,0.0,0.0,1.0
82,Memorial Chase,0.9,1.0,0.0,0.0,0.0,0.0,0.0,1.0
95,Oakhurst Dr,0.8,1.0,0.0,0.0,0.0,0.0,0.0,1.0
108,Rayford,0.8,0.8,0.0,0.0,0.0,0.0,0.0,1.0
