# Table Of Contents
<font size = 3>

<a href="#intro">Introduction</a>

0. <a href="#item0">Imports</a>

1. <a href="#item1">Data Acquisition and Preparation</a>

2. <a href="#item2">Data Cleaning</a>

3. <a href="#item3">Using Google Places API</a>

4. <a href="#item4">Exploratory Data Analysis</a>

5. <a href="#item5">Four Square API</a>
    
6. <a href="#item6">Clustering</a>

</font>
</div>

<a id='intro'></a> 
# Introduction to problem:

## Neighbourhood Similarity near Colleges 
<p>
Every year many graduates pursue masters, if we consider us (people doing the course), many of us would want to go for "Computer Science & Information Technology" field. The top choices for us would be USA, UK, Canada (intentionally ignoring Germany because German is compulsory in it). What many people want is a proper neighborhood (ignoring money constraints), once you do find a good college, you might want to know what lies near the college, we'll use FourSquare API to explore the neighbourhood and find colleges having the same neighbourhood. So if you don't get selected in the college you would know what colleges have similar neighbourhood, makes it easy to choose what colleges to choose to send application based on your neighbourhood preference. For example I want a safe neighbourhood, with many cafes and gyms (suppose) so I would explore neighbourhood BASED on my preferences, and so can everyone else. This is useful for any student who wants to goto a college with a similar neighbourhood in or to other country.
</p>

## Data Used

We are using www.mastersportal.com to mine our data.
We are searching for Canadian, Australian, United States, United Kingdoms Colleges/ Universities that provide full time Master's Degree Programme in the field of Computer Science and IT.
We have the following details available:
Degree,
Density,
Full Time Duration,
ID,
level,
listing_type,
logo,
organization,
organization_id,
summary,
tution fee,
Address: area, city, country.

Data we'll be requiring:
organization_id: College id<br> 
organization: College name<br> 
tution fee: College Fees<br>
Address: area, city, country of College<br>
latitude, longitude: can be obtained from address using geopy<br>

## Usage of Data (used for)
organization_id: Primary Key (in case colleges have same name)<br> 
organization: Name to refer to college<br> 
tution fee: Find colleges with acceptable fee range<br>
Address: Find Lat-Long of college<br>
latitude, longitude: For clustering and exploring with FourSquare Places API<br>



<br><br><br>
<a id='item0'></a> 
# Imports

In [None]:

import requests # get requests
import json # to parse the json file


import numpy as np 
import pandas as pd # because arrays are oldschool 
from pandas.io.json import json_normalize # for handling nested json


import seaborn as sns

import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors

import folium
from folium.plugins import MarkerCluster


from GoogleMapsApiKey import get_key
import FoursquareApiCredentials as fs


from progressbar import ProgressBar # tests your patience
import time # for delay
from IPython.display import clear_output # to clear notebook output cell via code

print('Libraries Imported')

<br><br><br>
<a id='item1'></a> 
# 1. Data Acquisition and Preparation


# Importing data by web scraping
## We are using <a href= "https://www.mastersportal.com/"> www.mastersportal.com</a> to mine our data.
<p>We are searching for Canadian, Australian, United States, United Kingdoms Colleges/ Universities that provide full time Master's Degree Programme in the field of Computer Science and IT.</p>

## Initial look at the data.

#### The search looked like this: <a href = "https://github.com/xtreme0021/Capstone/blob/master/Images/search_tags_smallext.png">https://github.com/xtreme0021/Capstone/blob/master/Images/search_tags_smallext.png</a>

#### We inspect-element the search results: <a href = "https://github.com/xtreme0021/Capstone/blob/master/Images/rmbMenu.png">https://github.com/xtreme0021/Capstone/blob/master/Images/rmbMenu.png</a>

#### The inspector shows the following: <a href = "https://github.com/xtreme0021/Capstone/blob/master/Images/inspectElement.png">https://github.com/xtreme0021/Capstone/blob/master/Images/inspectElement.png</a>

<p>So the Information is contained in the span tag, class = 'Location'. But, if we 'View Page Source' of the HTML and search for 'Location' this is what it shows:
<br><br>
<textarea rows="7" cols="50">
    <span class="Location"> 
        <span class="Fact LocationFact">
               {{organisation}}
        </span> 
        <span class="Fact LocationFact">
            {{{venue}}}
        </span>
    </span>
</textarea>
<br><br>
The <font color="red">{{organization}}, {{{venue}}}</font> tags imply that the data is being dynamically inserted into the HTML  via a JSON file, so lets get the JSON. 
</p>

## Getting the JSON

<p>
1. Goto inspect element.<br>
2. Choose Network Tab.<br>
3. Goto 2nd Page of search to refresh network activity.<br>
4. We are looking for a json that fills up search requests. So the column Type would be json, and since we are requesting for data, the network fetches data so cause is fetch.<br>
5. One of the Domains fullfills our needs, the 'search.prtl.co' domain. Double click it. Inspector-NetworkActivity:
<a href = "https://github.com/xtreme0021/Capstone/blob/master/Images/NetworkActivity.png">https://github.com/xtreme0021/Capstone/blob/master/Images/NetworkActivity.png</a><br>
6. Voila! You have the link of a json string that generates the result.<br>
7. We can save the json from there but well good luck with 214 files :). (Don't worry i have a solution.)<br>
</p>

## Notice that we were on the second page and the url for out json was: 

<a href="https://search.prtl.co/2018-07-23/?start=10&q=ci-30%2C56%2C202%2C82%7Cdg-msc%7Cde-fulltime%7Cdi-24%7Cen-413%7Clv-master%7Ctc-EUR"> https://search.prtl.co/2018-07-23/?start=10&q=ci-30%2C56%2C202%2C82%7Cdg-msc%7Cde-fulltime%7Cdi-24%7Cen-413%7Clv-master%7Ctc-EUR</a><br></code>

<p>
So the '?start=10' represents the college to start from in the returned search list. If we put 20 in place of 10 then it gives us the data of 3rd page, so now we can have data for all pages by changing 'start' value. The max start value should be 2130 i.e college no 2131 - 2137. (total search results were 2137 (on the day I'm searching on)if you saw in search image).
</p>

In [None]:
# since there are 2137 colleges, say 2140 for simplicity sake and in each page there are 10 colleges.
jsonExportFileName = 'college-list.json'
with open('Capstone/JSON/' + jsonExportFileName, 'a+') as jsonFile:
    for i in range(0, 2140, 10): # (start_value, approx total colleges, no_of_college_per_page)
        # using formatted strings to generate url for search returned json text, we are doing ?start=i, to get first 10 results, then next ten,till 2137 results are obtained.
        url = "https://search.prtl.co/2018-07-23/?start={}&q=ci-30%2C56%2C202%2C82%7Cdg-msc%7Cde-fulltime%7Cdi-24%7Cen-413%7Clv-master%7Ctc-EUR".format(str(i)) 
        webPage = requests.get(url) # connect and get the WebPage
        
        json.dump(webPage.json(), jsonFile) # dump formatted json data from webPage to jsonFile
print(jsonExportFileName+" has been created.")

#### Since the data is coming from 213 webpages, the json file strings ends 213 times i.e theres '][' (end and start of json) in between which should be ',' (comma to continue the json).
##### So I manully replaced them using the editor. A simple find and replace would do.

In [None]:
with open('Capstone/JSON/college-list.json') as jsonFile: # refers to the json we created while scraping the website
    raw_data = json.load(jsonFile) # load data to a python variable
print('Data Imported.')

In [None]:
df = json_normalize(raw_data) # normalizing the data using pandas library function

In [None]:
df

### 'venues' is nested, so we normalize it to a different dataframe

In [None]:
venues = json_normalize(data = raw_data, record_path = 'venues')
venues.drop('display_area', axis = 1, inplace = True) # this column serves no purpose whatsoever

### Lets check if we successfully extracted the venues data

In [None]:
df.columns

In [None]:
venues

In [None]:
df = df.join(venues) # Merging df and venues

<br><br><br><br>
<a id='item2'></a> 
# 2. Data Cleaning

In [None]:
df.columns # lets revise the columns we have

In [None]:
# We seriously dont need these columns they are just clutter that we got from json we parsed
columns_to_drop = ['degree', 'density.fulltime', 'density.parttime', 
    'enhanced', 'organisation_id', 'level', 'listing_type', 'logo', 'methods.blended',
    'methods.face2face', 'methods.online','parttime_duration.unit', 
    'parttime_duration.value', 'summary', 'title', 'venues', 'fulltime_duration.value', 'fulltime_duration.unit']
df.drop(columns_to_drop, axis = 1,inplace = True)

In [None]:
df.isna().sum()

In [None]:
df[df['tuition_fee.currency'] != 'EUR'].count()

In [None]:
df = df[df['tuition_fee.currency'].notnull()] #new df from current df where tuition_fee.currency is not null 
df = df[df['area'].notnull()]

According to https://www.geteducated.com/career-center/detail/what-is-a-masters-degree,
To earn a master’s degree you usually need to complete from 36 to 54 semester credits of study (or 60 to 90 quarter-credits). This equals 12 to 18 college courses. 

45 is average of 36 ad 54!

In [None]:
df.loc[df['tuition_fee.unit'] == 'credit', 'tuition_fee.value'] = (df['tuition_fee.value']*45)/2 
# Multiplying tuition_fee.value by 45 when tuition_fee.unit is 'credit' 
# This gives us average per year fees, to get a uniform fee scale (all fees in per year format)

In [None]:
df.drop(['tuition_fee.currency', 'tuition_fee.unit'], axis=1, inplace=True) 
# since we have uniform values we dont need the currency and unit thus we will drrop them

In [None]:
df = df.rename(columns = {'tuition_fee.value': 'fees', 'organisation': 'college_name'})

In [None]:
# rearranging the columns
df = df[['id','college_name', 'fees', 'area', 'city', 'country']] # removed location from here on date 20191023

In [None]:
df

In [None]:
df = df.reset_index()
df.drop(['index'], axis=1, inplace=True)

<br><br><br>

In [None]:
df_test = pd.DataFrame()
df_test['clg_city'] = df['college_name'].map(str)+', '+df['city']

In [None]:
clg_names = df['college_name'].to_list()
clg_city = df_test['clg_city'].to_list()

<br><br><br>
<a id='item3'></a> 
# 3. Using Google Places API

In [None]:
show_progress = ProgressBar()

In [None]:
API_KEY = get_key()

In [None]:
def findPlace(query, key):
    url = 'https://maps.googleapis.com/maps/api/place/findplacefromtext/json?'
    req = requests.get(
        url + 
        'input='  + query +
        '&inputtype='+ 'textquery' +
        '&fields=' + 'geometry/location'+
        '&key=' + key
    ) 
    time.sleep(0.02)
    return req

In [None]:
def jdump_latlongG(filename, query_list, key):
    with open(filename, 'a+') as jsonFile:
        for i in show_progress(range(0, 2060)):
            req = findPlace(query_list[0][i], key)
            if req.json()['status'] != 'OK':
                req = findPlace(query_list[1][i], key)
            json.dump(req.json(), jsonFile)            

In [None]:
jdump_latlongG('G-lat-long-new.json', [clg_city, clg_names], API_KEY)

<br><br><br>
# Importing latitudes and longitudes from json

In [None]:
with open('G-lat-long-new.json') as jsonFile: # refers to the json we created earlier
    ll_data = json.load(jsonFile) # load data to a python var
print('Lat-Long Imported.')

In [None]:
ll = json_normalize(ll_data, record_path='candidates', meta =['status'])

In [None]:
ll.drop('formatted_address', axis=1, inplace=True)
ll.rename(columns={'geometry.location.lat': 'latitude', 'geometry.location.lng': 'longitude'})
ll.shape

<br>

<br>

## Making new columns in df from ll dataframe

<br>

In [None]:
df = df.join(ll)

In [None]:
df.shape

In [None]:
df = df[df['status']=="OK"]

In [None]:
df.drop(['status'], axis=1, inplace=True)
df = df.rename(columns={'geometry.location.lat': 'latitude', 'geometry.location.lng': 'longitude'})
df.head()

In [None]:
df.to_csv('college-dataset.csv')

<br>

## We have saved the data to college_dataset.csv 

<br>


In [None]:
df = pd.read_csv("college-dataset.csv") 

In [None]:
df.head()

In [None]:
df.drop(['Unnamed: 0'], axis=1, inplace=True)

In [None]:
df['country'].value_counts()

In [None]:
df = df[df['country'].str.contains("United States|United Kingdom|Canada")] 
# keeping only the locations with location-country as US, UK, Canada

In [None]:
# manually checked lat long on google maps
df['latitude'][df.college_name=='University of California, Berkeley'] = 37.8718992
df['longitude'][df.college_name=='University of California, Berkeley'] = -122.2607286

df['latitude'][df.college_name=='Johnson & Wales University'] = 41.8197902
df['longitude'][df.college_name=='Johnson & Wales University']= -71.415209

In [None]:
# They are placed in india and nowhere found on google maps
df.drop(df[df['college_name']=='Engineering and Technology College'].index, inplace = True)
df.drop(df[df['college_name']=='College of Nursing and Public Health'].index, inplace = True)

In [None]:
df[df['fees']>50000]

In [None]:
# Manually searched college fees
df['fees'][df.college_name=='University of Colorado at Boulder'] = 48570
df['fees'][df.college_name=='University of Pennsylvania'] = 7134
df['fees'][df.college_name=='University of Nebraska Omaha'] = 28564
df['fees'][df.college_name=='Johns Hopkins University'] = 45350
df['fees'][df.college_name=='University of Illinois at Urbana Champaign'] = 53437
df['fees'][df.college_name=='Carnegie Mellon University'] = 38940
df['fees'][df.college_name=='Northwestern University'] = 42000
df['fees'][df.college_name=='Washington University in St. Louis'] = 63000
df['fees'][df.college_name=='University of Massachusetts Amherst'] = 62000
df['fees'][df.college_name=='Bentley University'] = 68640

In [None]:
df.reset_index(drop=True, inplace=True)

In [None]:
df.to_csv('final-college-dataset.csv')

<br><br><br>

<a id='item4'></a> 
# 4. Exploratory Data Analysis

So now that we have our data it's time to explore it. Lets see the number per country. The venues data frame would make it easy to do so.

In [None]:
df = pd.read_csv("final_college_dataset.csv")

In [None]:
df.drop(['Unnamed: 0'], axis=1, inplace=True)

In [None]:
df_usa=df[df['country'].str.contains("United States")]
df_uk=df[df['country'].str.contains("United Kingdom")]
df_canada=df[df['country'].str.contains("Canada")]

<br><br>
## Lets look at the geospatial data

In [None]:
usa_coordinates = [37.0902, -100]
canada_coordinates = [54.6959279, -90]
uk_coordinates = [54.2186138, -5]

<br><br>
## USA Map

In [None]:
usa_map = folium.Map(location = usa_coordinates, zoom_start = 4)
mc = MarkerCluster().add_to(usa_map)    

for row in df_usa.itertuples():
    folium.Marker(
        location=[row.latitude,row.longitude],
        icon = None,
        popup=row.college_name
    ).add_to(mc)

usa_map

In case map not showing: https://github.com/xtreme0021/Capstone/blob/master/Images/USA_marker_cluster.png

<br><br>
## Canada Map

In [None]:
canada_map = folium.Map(location = canada_coordinates, zoom_start = 4)
mc = MarkerCluster().add_to(canada_map)    

for row in df_canada.itertuples():
    folium.Marker(
        location=[row.latitude,row.longitude],
        icon = None,
        popup=row.college_name
    ).add_to(mc)

canada_map

In case map not showing: https://github.com/xtreme0021/Capstone/blob/master/Images/Canada_marker_cluster.png

<br><br>
## UK Map

In [None]:
uk_map = folium.Map(location = uk_coordinates, zoom_start = 6)
mc = MarkerCluster().add_to(uk_map)    

for row in df_uk.itertuples():
    folium.Marker(
        location=[row.latitude,row.longitude],
        icon = None,
        popup=row.college_name
    ).add_to(mc)

uk_map

In case map not showing: https://github.com/xtreme0021/Capstone/blob/master/Images/UK_marker_cluster.png

<br><br>
# Lets Take a look at the salary

In [None]:
df.describe()

In [None]:
sns.set(rc={'figure.figsize':(3,6)})
boxplot = sns.boxplot(data=df['fees'])

## Let's look at average fees by country

In [None]:
print('Average fees per country')
print("USA : {0:.2f}".format(df_usa['fees'].mean()))
print("UK : {0:.2f}".format(df_uk['fees'].mean()))
print("Canada : {0:.2f}".format(df_canada['fees'].mean()))

In [None]:
sns.set(rc={'figure.figsize':(7,6)})
boxplot = sns.boxplot(
    data = df,
    x = 'country',
    y = 'fees'
)

# Note: 

## The box plot of Canada lies within the 50% Quartile range of USA, it won't be surprising if we get more <font color = 037ffc>Canadian colleges</font> when looking for similar <font color = 037ffc>low fees</font> colleges when holding money as a criterion.

## Similarly, <font color = 037ffc>UK</font> would be a preferred choice when choosing <font color = 037ffc>mid to high fees </font>colleges when compared to USA when holding money as a criterion.

## It is obvious, but let me specifically point out, <font color = 'red'>USA has the highest fees</font> in all 3 nations.

<br><br><br>
<a id='item5'></a> 
# 5. FourSquare Places API

In [None]:
CLIENT_ID = fs.get_client_id() # your Foursquare ID
CLIENT_SECRET = fs.get_client_secret() # your Foursquare Secret
VERSION = '20181102' # Foursquare API version
RADIUS = 10000 # Radius to search in
LIMIT = 20 # Limit to no. of search results

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=10000):
    count = 0
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(count, name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        try:
            results = requests.get(url).json()["response"]['groups'][0]['items']
            venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['shortName']) for v in results])
        
        except KeyError:
            print('KeyError: Replacing with none.')
            venues_list.append([name, lat, lng, None, None, None, None])
        count += 1
    
    return(venues_list)

In [None]:
college_venues = getNearbyVenues(names=df['college_name'],
                                   latitudes=df['latitude'],
                                   longitudes=df['longitude']
                                  )

In [None]:
none_list = []
for i in range(len(college_venues)+1):
    try:
        if college_venues[i][-1] == None:
            none_list.append(i)
    except:
        pass
none_list

In [None]:
not_none = []
for i in range(len(college_venues)):
    if i not in none_list:
        not_none.append(college_venues[i])
    else:
        pass
   

In [None]:
nearby_venues = pd.DataFrame([item for venue_list in not_none for item in venue_list])
nearby_venues.columns = ['college_name', 
                  'c_latitude', 
                  'c_longitude', 
                  'venue', 
                  'v_latitude', 
                  'v_longitude', 
                  'v_category']

In [None]:
df.drop(none_list)

In [None]:
nearby_venues.groupby('college_name').count()

In [None]:
print('There are {} uniques categories.'.format(len(nearby_venues['v_category'].unique())))

<br><br><br>
<a id='item6'></a> 
# 6. Clustering

In [None]:
nearby_onehot = pd.get_dummies(nearby_venues[['v_category']], prefix="", prefix_sep="")

# add college_name column back to dataframe
nearby_onehot['college_name'] = nearby_venues['college_name'] 

# move college_name column to the first column
fixed_columns = [nearby_onehot.columns[-1]] + list(nearby_onehot.columns[:-1])
nearby_onehot = nearby_onehot[fixed_columns]

nearby_onehot.head()

In [None]:
nearby_grouped = nearby_onehot.groupby('college_name').mean().reset_index()
nearby_grouped

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['college_name']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
nearby_venues_sorted = pd.DataFrame(columns=columns)
nearby_venues_sorted['college_name'] = nearby_grouped['college_name']

for ind in np.arange(nearby_grouped.shape[0]):
    nearby_venues_sorted.iloc[ind, 1:] = return_most_common_venues(nearby_grouped.iloc[ind, :], num_top_venues)

nearby_venues_sorted.head()

In [None]:
from sklearn.cluster import KMeans

In [None]:
# set number of clusters
kclusters = 5

nearby_grouped_clustering = nearby_grouped.drop('college_name', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(nearby_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

In [None]:
# add clustering labels
nearby_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

nearby_merged = df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
nearby_merged = nearby_merged.join(nearby_venues_sorted.set_index('college_name'), on='college_name')

nearby_merged.head() # check the last columns!

In [None]:
nearby_merged[nearby_merged['Cluster Labels'].isnull()]

In [None]:
nearby_merged.drop(619, inplace=True)

In [None]:
# create map
map_clusters = folium.Map(location=usa_coordinates, zoom_start=3)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(nearby_merged['latitude'], nearby_merged['longitude'], nearby_merged['college_name'], nearby_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

# in case map not showing: https://github.com/xtreme0021/Capstone/blob/master/Images/Cluster.png
<br><br><br>

In [None]:
nearby_merged.loc[nearby_merged['Cluster Labels'] == 0, nearby_merged.columns[[1] + list(range(5, nearby_merged.shape[1]))]]

In [None]:
nearby_merged.loc[nearby_merged['Cluster Labels'] == 1, nearby_merged.columns[[1] + list(range(5, nearby_merged.shape[1]))]]

In [None]:
nearby_merged.loc[nearby_merged['Cluster Labels'] == 2, nearby_merged.columns[[1] + list(range(5, nearby_merged.shape[1]))]]

In [None]:
nearby_merged.loc[nearby_merged['Cluster Labels'] == 3, nearby_merged.columns[[1] + list(range(5, nearby_merged.shape[1]))]]

In [None]:
nearby_merged.loc[nearby_merged['Cluster Labels'] == 4, nearby_merged.columns[[1] + list(range(5, nearby_merged.shape[1]))]]

<br><br><br>
# Conclusion:
The colleges have been genuinely clustered on the basis of their neighbourhood, and have an indefinite trend. From the clusters it is apparent that the neighbourhood of UK will majorly defer from that of USA or Canada. Hence, I’ve successfully clustered the college neighbourhood into the following categories:
1.	American Eats
2.	Exotic Eats
3.	Tour/ Outgoing
4.	Night Life (Pub) and Fitness
5.	Art Prone/ Mature Audience
