# Table Of Contents
<font size = 3>

<a href="#intro">Introduction</a>

0. <a href="#item0">Imports</a>

1. <a href="#item1">Data Acquisition and Preparation</a>

2. <a href="#item2">Data Cleaning</a>

3. <a href="#item3">Using Google Places API</a>

4. <a href="#item4">Exploratory Data Analysis</a>

5. <a href="#item5">Four Square API</a>
    
6. <a href="#item6">Clustering</a>

</font>
</div>

<a id='intro'></a> 
# Introduction to problem:

## Neighbourhood Similarity near Colleges 
<p>
Every year many under-graduates and graduates pursue higher studies. The top choices for education are USA, UK, Canada, Australia, New Zealand. What many people want is a proper neighborhood (ignoring money constraints), once you do find a good college, you might want to know what lies near the college, we'll use FourSquare API to explore the neighbourhood and find colleges having the same neighbourhood. So if you don't get selected in the college you would know what colleges have similar neighbourhood, makes it easy to choose what colleges to choose to send application based on your neighbourhood preference. For example I want a safe neighbourhood, with many cafes and gyms (suppose) so I would explore neighbourhood BASED on my preferences, and so can everyone else. This is useful for any student who wants to goto a college with a similar neighbourhood in or to other country.
</p>


<br><br><br>
<a id='item0'></a> 
# Imports

In [15]:
import requests  # get requests
import json  # to parse the json file
import os.path

import numpy as np
import pandas as pd  # because arrays are oldschool
from pandas.io.json import json_normalize  # for handling nested json

import seaborn as sns

import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors

import folium
from folium.plugins import MarkerCluster

import googlemaps
import configparser

from progressbar import ProgressBar  # tests your patience
import time  # for delay
# to clear notebook output cell via code
from IPython.display import clear_output

import pprint
pp = pprint.PrettyPrinter(indent=4)


print('Libraries Imported')

Libraries Imported


<br><br><br>
<a id='item1'></a> 
# 1. Data Acquisition and Preparation


## Get University Data

In [3]:
csv_path = "./CSVs/world-universities.csv"
isFile = os.path.isfile(csv_path)
url = "https://raw.githubusercontent.com/deepratnaawale/world-universities-csv/master/world-universities.csv"
if not isFile:
    res = requests.get(url, allow_redirects=True)
    with open(csv_path, 'wb') as file:
        file.write(res.content)
        print("File Downloaded")
else:
    print("File Already Exists.")

File Already Exists.


<br><br><br><br>
<a id='item2'></a> 
# 2. Data Cleaning

In [16]:
df = pd.read_csv(csv_path)

In [17]:
df.shape

(9363, 3)

In [18]:
df.drop(columns='Website_Link')

Unnamed: 0,Country_Code,University_Name
0,AD,University of Andorra
1,AE,Abu Dhabi University
2,AE,Ajman University of Science & Technology
3,AE,Alain University of Science and Technology
4,AE,Al Ghurair University
...,...,...
9358,ZW,Solusi University
9359,ZW,University of Zimbabwe
9360,ZW,Women's University in Africa
9361,ZW,Zimbabwe Ezekiel Guti University


In [21]:
country_codes = ['US', 'UK', 'CA', 'AU', 'NZ']
# for country_code in country_codes:
df = df[df['Country_Code'].isin(country_codes)]
df

Unnamed: 0,Country_Code,University_Name,Website_Link
251,AU,Australian Catholic University,http://www.acu.edu.au/
252,AU,Australian Correspondence Schools,http://www.acs.edu.au/
253,AU,Australian Defence Force Academy,http://www.adfa.oz.au/
254,AU,Australian Lutheran College,http://www.alc.edu.au/
255,AU,Australian Maritime College,http://www.amc.edu.au/
...,...,...,...
9179,US,York College Nebraska,http://www.york.edu/
9180,US,York College of Pennsylvania,http://www.yorkcol.edu/
9181,US,Yorker International University,http://www.nyuniversity.net/
9182,US,York University,http://www.yorkuniversity.us/


<br><br><br>
<a id='item3'></a> 
# 3. Using Google Places API

In [22]:
def getGoogleAPIKey():
    config = configparser.ConfigParser()
    config.read('local-config.ini')
    return config['google maps api']['key']

In [23]:
show_progress = ProgressBar()
gmaps = googlemaps.Client(key=getGoogleAPIKey())

In [32]:
req = gmaps.places(query='Australian Catholic University, AU', region='au')

In [34]:
lat_long = req['results'][0]['geometry']['location']

places_nearby = gmaps.places_nearby(
    location=lat_long, radius=3000, type='point_of_interest', )

for place in places_nearby['results']:
    print(place['name'], place['types'])

Park Hyatt Melbourne ['lodging', 'point_of_interest', 'establishment']
Hotel Grand Chancellor ['lodging', 'point_of_interest', 'establishment']
Mantra on Russell Melbourne ['lodging', 'point_of_interest', 'establishment']
Grand Hyatt Melbourne ['lodging', 'point_of_interest', 'establishment']
The Victoria Hotel ['lodging', 'point_of_interest', 'establishment']
The Park Hotel Melbourne ['lodging', 'point_of_interest', 'establishment']
ibis Melbourne - Hotel & Apartments ['lodging', 'point_of_interest', 'establishment']
Novotel Melbourne on Collins ['lodging', 'point_of_interest', 'establishment']
DoubleTree by Hilton Hotel Melbourne - Flinders Street ['lodging', 'point_of_interest', 'establishment']
Causeway 353 Hotel ['lodging', 'point_of_interest', 'establishment']
The Langham, Melbourne ['lodging', 'point_of_interest', 'establishment']
Mantra on Little Bourke Melbourne ['lodging', 'point_of_interest', 'establishment']
Oaks Melbourne on Market Hotel ['lodging', 'point_of_interest', 'e

<br><br><br>
# Importing latitudes and longitudes from json

In [None]:
with open('G-lat-long-new.json') as jsonFile:  # refers to the json we created earlier
    ll_data = json.load(jsonFile)  # load data to a python var
print('Lat-Long Imported.')

In [None]:
ll = json_normalize(ll_data, record_path='candidates', meta=['status'])

In [None]:
ll.drop('formatted_address', axis=1, inplace=True)
ll.rename(columns={'geometry.location.lat': 'latitude',
          'geometry.location.lng': 'longitude'})
ll.shape

<br>

<br>

## Making new columns in df from ll dataframe

<br>

In [None]:
df = df.join(ll)

In [None]:
df.shape

In [None]:
df = df[df['status'] == "OK"]

In [None]:
df.drop(['status'], axis=1, inplace=True)
df = df.rename(columns={'geometry.location.lat': 'latitude',
               'geometry.location.lng': 'longitude'})
df.head()

In [None]:
df.to_csv('college-dataset.csv')

<br>

## We have saved the data to college_dataset.csv 

<br>


In [None]:
df = pd.read_csv("college-dataset.csv")

In [None]:
df.head()

In [None]:
df.drop(['Unnamed: 0'], axis=1, inplace=True)

In [None]:
df['country'].value_counts()

In [None]:
df = df[df['country'].str.contains("United States|United Kingdom|Canada")]
# keeping only the locations with location-country as US, UK, Canada

In [None]:
# manually checked lat long on google maps
df['latitude'][df.college_name == 'University of California, Berkeley'] = 37.8718992
df['longitude'][df.college_name ==
                'University of California, Berkeley'] = -122.2607286

df['latitude'][df.college_name == 'Johnson & Wales University'] = 41.8197902
df['longitude'][df.college_name == 'Johnson & Wales University'] = -71.415209

In [None]:
# They are placed in india and nowhere found on google maps
df.drop(df[df['college_name'] ==
        'Engineering and Technology College'].index, inplace=True)
df.drop(df[df['college_name'] ==
        'College of Nursing and Public Health'].index, inplace=True)

In [None]:
df[df['fees'] > 50000]

In [None]:
# Manually searched college fees
df['fees'][df.college_name == 'University of Colorado at Boulder'] = 48570
df['fees'][df.college_name == 'University of Pennsylvania'] = 7134
df['fees'][df.college_name == 'University of Nebraska Omaha'] = 28564
df['fees'][df.college_name == 'Johns Hopkins University'] = 45350
df['fees'][df.college_name == 'University of Illinois at Urbana Champaign'] = 53437
df['fees'][df.college_name == 'Carnegie Mellon University'] = 38940
df['fees'][df.college_name == 'Northwestern University'] = 42000
df['fees'][df.college_name == 'Washington University in St. Louis'] = 63000
df['fees'][df.college_name == 'University of Massachusetts Amherst'] = 62000
df['fees'][df.college_name == 'Bentley University'] = 68640

In [None]:
df.reset_index(drop=True, inplace=True)

In [None]:
df.to_csv('final-college-dataset.csv')

<br><br><br>

<a id='item4'></a> 
# 4. Exploratory Data Analysis

So now that we have our data it's time to explore it. Lets see the number per country. The venues data frame would make it easy to do so.

In [None]:
df = pd.read_csv("final_college_dataset.csv")

In [None]:
df.drop(['Unnamed: 0'], axis=1, inplace=True)

In [None]:
df_usa = df[df['country'].str.contains("United States")]
df_uk = df[df['country'].str.contains("United Kingdom")]
df_canada = df[df['country'].str.contains("Canada")]

<br><br>
## Lets look at the geospatial data

In [None]:
usa_coordinates = [37.0902, -100]
canada_coordinates = [54.6959279, -90]
uk_coordinates = [54.2186138, -5]

<br><br>
## USA Map

In [None]:
usa_map = folium.Map(location=usa_coordinates, zoom_start=4)
mc = MarkerCluster().add_to(usa_map)

for row in df_usa.itertuples():
    folium.Marker(
        location=[row.latitude, row.longitude],
        icon=None,
        popup=row.college_name
    ).add_to(mc)

usa_map

In case map not showing: https://github.com/xtreme0021/Capstone/blob/master/Images/USA_marker_cluster.png

<br><br>
## Canada Map

In [None]:
canada_map = folium.Map(location=canada_coordinates, zoom_start=4)
mc = MarkerCluster().add_to(canada_map)

for row in df_canada.itertuples():
    folium.Marker(
        location=[row.latitude, row.longitude],
        icon=None,
        popup=row.college_name
    ).add_to(mc)

canada_map

In case map not showing: https://github.com/xtreme0021/Capstone/blob/master/Images/Canada_marker_cluster.png

<br><br>
## UK Map

In [None]:
uk_map = folium.Map(location=uk_coordinates, zoom_start=6)
mc = MarkerCluster().add_to(uk_map)

for row in df_uk.itertuples():
    folium.Marker(
        location=[row.latitude, row.longitude],
        icon=None,
        popup=row.college_name
    ).add_to(mc)

uk_map

In case map not showing: https://github.com/xtreme0021/Capstone/blob/master/Images/UK_marker_cluster.png

<br><br>
# Lets Take a look at the salary

In [None]:
df.describe()

In [None]:
sns.set(rc={'figure.figsize': (3, 6)})
boxplot = sns.boxplot(data=df['fees'])

## Let's look at average fees by country

In [None]:
print('Average fees per country')
print("USA : {0:.2f}".format(df_usa['fees'].mean()))
print("UK : {0:.2f}".format(df_uk['fees'].mean()))
print("Canada : {0:.2f}".format(df_canada['fees'].mean()))

In [None]:
sns.set(rc={'figure.figsize': (7, 6)})
boxplot = sns.boxplot(
    data=df,
    x='country',
    y='fees'
)

# Note: 

## The box plot of Canada lies within the 50% Quartile range of USA, it won't be surprising if we get more <font color = 037ffc>Canadian colleges</font> when looking for similar <font color = 037ffc>low fees</font> colleges when holding money as a criterion.

## Similarly, <font color = 037ffc>UK</font> would be a preferred choice when choosing <font color = 037ffc>mid to high fees </font>colleges when compared to USA when holding money as a criterion.

## It is obvious, but let me specifically point out, <font color = 'red'>USA has the highest fees</font> in all 3 nations.

<br><br><br>
<a id='item5'></a> 
# 5. FourSquare Places API

In [None]:
CLIENT_ID = fs.get_client_id()  # your Foursquare ID
CLIENT_SECRET = fs.get_client_secret()  # your Foursquare Secret
VERSION = '20181102'  # Foursquare API version
RADIUS = 10000  # Radius to search in
LIMIT = 20  # Limit to no. of search results

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=10000):
    count = 0
    venues_list = []
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(count, name)

        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            lng,
            radius,
            LIMIT)

        # make the GET request
        try:
            results = requests.get(url).json()[
                "response"]['groups'][0]['items']
            venues_list.append([(
                name,
                lat,
                lng,
                v['venue']['name'],
                v['venue']['location']['lat'],
                v['venue']['location']['lng'],
                v['venue']['categories'][0]['shortName']) for v in results])

        except KeyError:
            print('KeyError: Replacing with none.')
            venues_list.append([name, lat, lng, None, None, None, None])
        count += 1

    return (venues_list)

In [None]:
college_venues = getNearbyVenues(names=df['college_name'],
                                 latitudes=df['latitude'],
                                 longitudes=df['longitude']
                                 )

In [None]:
none_list = []
for i in range(len(college_venues)+1):
    try:
        if college_venues[i][-1] == None:
            none_list.append(i)
    except:
        pass
none_list

In [None]:
not_none = []
for i in range(len(college_venues)):
    if i not in none_list:
        not_none.append(college_venues[i])
    else:
        pass

In [None]:
nearby_venues = pd.DataFrame(
    [item for venue_list in not_none for item in venue_list])
nearby_venues.columns = ['college_name',
                         'c_latitude',
                         'c_longitude',
                         'venue',
                         'v_latitude',
                         'v_longitude',
                         'v_category']

In [None]:
df.drop(none_list)

In [None]:
nearby_venues.groupby('college_name').count()

In [None]:
print('There are {} uniques categories.'.format(
    len(nearby_venues['v_category'].unique())))

<br><br><br>
<a id='item6'></a> 
# 6. Clustering

In [None]:
nearby_onehot = pd.get_dummies(
    nearby_venues[['v_category']], prefix="", prefix_sep="")

# add college_name column back to dataframe
nearby_onehot['college_name'] = nearby_venues['college_name']

# move college_name column to the first column
fixed_columns = [nearby_onehot.columns[-1]] + list(nearby_onehot.columns[:-1])
nearby_onehot = nearby_onehot[fixed_columns]

nearby_onehot.head()

In [None]:
nearby_grouped = nearby_onehot.groupby('college_name').mean().reset_index()
nearby_grouped

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)

    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['college_name']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
nearby_venues_sorted = pd.DataFrame(columns=columns)
nearby_venues_sorted['college_name'] = nearby_grouped['college_name']

for ind in np.arange(nearby_grouped.shape[0]):
    nearby_venues_sorted.iloc[ind, 1:] = return_most_common_venues(
        nearby_grouped.iloc[ind, :], num_top_venues)

nearby_venues_sorted.head()

In [None]:
from sklearn.cluster import KMeans

In [None]:
# set number of clusters
kclusters = 5

nearby_grouped_clustering = nearby_grouped.drop('college_name', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(
    nearby_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

In [None]:
# add clustering labels
nearby_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

nearby_merged = df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
nearby_merged = nearby_merged.join(
    nearby_venues_sorted.set_index('college_name'), on='college_name')

nearby_merged.head()  # check the last columns!

In [None]:
nearby_merged[nearby_merged['Cluster Labels'].isnull()]

In [None]:
nearby_merged.drop(619, inplace=True)

In [None]:
# create map
map_clusters = folium.Map(location=usa_coordinates, zoom_start=3)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(nearby_merged['latitude'], nearby_merged['longitude'], nearby_merged['college_name'], nearby_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' +
                         str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)

map_clusters

# in case map not showing: https://github.com/xtreme0021/Capstone/blob/master/Images/Cluster.png
<br><br><br>

In [None]:
nearby_merged.loc[nearby_merged['Cluster Labels'] == 0,
                  nearby_merged.columns[[1] + list(range(5, nearby_merged.shape[1]))]]

In [None]:
nearby_merged.loc[nearby_merged['Cluster Labels'] == 1,
                  nearby_merged.columns[[1] + list(range(5, nearby_merged.shape[1]))]]

In [None]:
nearby_merged.loc[nearby_merged['Cluster Labels'] == 2,
                  nearby_merged.columns[[1] + list(range(5, nearby_merged.shape[1]))]]

In [None]:
nearby_merged.loc[nearby_merged['Cluster Labels'] == 3,
                  nearby_merged.columns[[1] + list(range(5, nearby_merged.shape[1]))]]

In [None]:
nearby_merged.loc[nearby_merged['Cluster Labels'] == 4,
                  nearby_merged.columns[[1] + list(range(5, nearby_merged.shape[1]))]]

<br><br>
# Conclusion:
The colleges have been genuinely clustered on the basis of their neighbourhood, and have an indefinite trend. From the clusters it is apparent that the neighbourhood of UK will majorly defer from that of USA or Canada. Hence, I’ve successfully clustered the college neighbourhood into the following categories:
1.	American Eats
2.	Exotic Eats
3.	Tour/ Outgoing
4.	Night Life (Pub) and Fitness
5.	Art Prone/ Mature Audience


## Credits
World Universities Data set: https://github.com/endSly/