# Capstone Project - The Battle of the Neighborhoods
### Applied Data Science Capstone by IBM/Coursera
### by Emil Fuerstenberg Haegg

## Table of contents
* [Introduction](#intro)
* [Initialisation of notebook environment](#environment)
* [Visualising similarity of Scottish cities towns by nearby venues and population](#pop_venues)
* [Visualising GDP per capita for areas in Scotland using a choropleth map](#gdp_areas)

## Introduction <a name="intro"></a>

### Background

Scotland is a country that is part of the United Kingdom and makes up the northern part of the island known as Great Britain. Scotland has a varied nature with the southern and eastern parts mostly consisting of rural lowlands, while the north-western parts are aptly known as the Highlands. The Highlands have a varied geology with mountains, countless islands, forests and bogs, but is sparsely populated. Most of the country's population is instead concentrated to the two larger cities in the south: Glasgow and Edinburgh, and the surrounding areas. 

In this project we aim to find a suitable location to relocate from a city to another area, considering some of the factors which would influence the choice. A relative planning to relocate from the town of Stornoway located on the island of Lewis and Harris at the north-western edge of the country. For work reasons they are moving to the area of Glasgow and Edinburgh. 

### Business problem

The problem consists of narrowing down the choice of data for comparing the areas, depending on what the stakeholders, in this case our relative, consider the most important. The number of different comparisons which could be made are vast, and input from our stakeholder will guide which direction to take. Our stakeholder is looking to move to town which is similar to their hometown. As a starting point, we look to compare the cities and town on the following features: venues, population size and GDP per capita. We aim present this comparison visually. This way our stakeholder can get an initial overview on similarities and differences, as a starting point for further exploration. 

### Data

Data on the 51 largest cities and towns of Scotland and their population was found on Wikipedia: 
https://en.wikipedia.org/wiki/List_of_towns_and_cities_in_Scotland_by_population 
    
Data on the hometown Stornoway can also be found at a Wikipedia page, this data can be added manually, which can in this case of a single entry be quicker than other methods: 
https://en.wikipedia.org/wiki/Stornoway 
    
GDP per capita for the different areas of Scotland can also be found on a Wikipedia page: 
https://en.wikipedia.org/wiki/Economy_of_Scotland 
    
Data could also be gathered from official registers but using Wikipedia as a source enables us to make use of this vast collection of knowledge, which is Wikipedia, while demonstrating the usage of commonly used tools for handling unstructured data. Using the package BeautifulSoup we can access the html-code directly and scrape this for the data of interest. 

Foursquare Places API can be used to find exploring venues for a given city or town. By connecting to this API we can request up to date information on the locations and categories of venues, within a radius from a certain coordinate. We look to arrange the towns and cities into clusters depending on the most common venues in the neighbourhood.  

To create a choropleth map, we need a geo JSON-file representing the borders of the different areas, which can then be used to layer on top of the map and coloured according to GDP per capita. A JSON-file representing Local Authority Districts can be found at the following GitHub repository. The file has not been updated for several years, but represents the current areas, in place since 1994. 
https://github.com/martinjc/UK-GeoJSON/blob/master/json/administrative/sco/lad.json 

## Initialisation of notebook environment <a name="environment"></a>

### Import libraries

In [30]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analysis

# Set diplay options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import re # For regular expression manipulation
from math import sqrt # squareroot function
from math import pi

import json # module for handling json-files
import requests # library to handle requests
            
#!pip install bs4  # uncomment if not already installed
from bs4 import BeautifulSoup

print('Libraries imported.')

Libraries imported.


### Credentials and default settings for Foursquare API

In [31]:
import configparser

config = configparser.ConfigParser()
config.read("config.txt")

CLIENT_ID= config.get("configuration","api_key")
CLIENT_SECRET = config.get("configuration","api_secret")

VERSION = '20200416' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

## Visualising similarity of Scottish cities towns by nearby venues and population <a name="pop_venues"></a>

### Data sourcing

Webpage URL

In [32]:
url = "https://en.wikipedia.org/wiki/List_of_towns_and_cities_in_Scotland_by_population"

Getting content in text format and store in variable 'data'

In [33]:
data  = requests.get(url).text

Create beautifulsoup object on 'data'

In [34]:
soup = BeautifulSoup(data,"html.parser")

Find all html-tables in the webpage

In [35]:
tables = soup.find_all('table') # in html table is represented by the tag <table>

Check how many tables were found

In [36]:
len(tables)

4

Indices for tables found  

In [37]:
for index,table in enumerate(tables):
    table_index = index
    print(table_index)

0
1
2
3


Check for name of tables

In [38]:
for index, table in enumerate(tables):
    print(tables[index].caption)

None
None
None
None


Find all wikitables in the webpage

In [39]:
# Inspecting webpage and finding that table in interest is wikitable, 
# testing find all for that class, rather than looking through tables above
wikitables = soup.find_all("table",{"class":"wikitable"})


Number of wikitables found

In [40]:
print(len(wikitables))

2


Indices for tables found 

In [41]:
# indices for tables found
for index,table in enumerate(wikitables):
    table_index = index
    print(table_index)

0
1


Table of interest

In [42]:
tablen = wikitables[0]

Creating dataframe and filling with data of interest from fetched table

In [43]:
# create dataframe
cities = pd.DataFrame(columns=["City","Population","Status"])
    
# enumerating tablerows and columns, fill dataframe with data, correct columns found using <th>,
for row in tablen.tbody.find_all("tr"):
    col = row.find_all("td")
    if (col != []):
        city = col[0].text.strip() 
        population = col[1].text.strip()
        status = col[2].text.strip()
        cities = cities.append({"City":city, "Population":population, "Status":status}, ignore_index=True)

Add hometown to dataframe
https://en.wikipedia.org/wiki/Stornoway

In [44]:
new_row = {'City':'Stornoway', 'Population':5070, 'Status':'Town'}
cities = cities.append(new_row, ignore_index=True)

In [45]:
display(cities)

Unnamed: 0,City,Population,Status
0,Glasgow,612040,City
1,Edinburgh,488050,City
2,Aberdeen,200680,City
3,Dundee,148280,City
4,Paisley,77220,Town[4]
5,East Kilbride,75120,Town
6,Livingston,57030,Town
7,Hamilton,54080,Town
8,Dunfermline,53100,Town
9,Cumbernauld,50920,Town


## Visualising GDP per capita for areas in Scotland using a choropleth map <a name="gdp_areas"></a>

### Data sourcing

Page used to find geo-json-file over scotland, download file using wget

https://github.com/martinjc/UK-GeoJSON/blob/master/json/administrative/sco/lad.json

In [46]:
!wget --quiet https://raw.githubusercontent.com/martinjc/UK-GeoJSON/master/json/administrative/sco/lad.json

Read geodata and fetch names of areas (Glasgow City is for some reason not fetched), but included in scot_geo, so works fine to key one when drawing choropeth map)

In [47]:
# read geosjon-file
scot_geo = r'lad.json'

with open(scot_geo, 'r') as j:
    d = json.loads(j.read())

lista_json = []
for i in range(0,31):
    lad_names = d['features'][i]['properties']['LAD13NM']
    lista_json.append(lad_names)
    print(lad_names)

Clackmannanshire
Dumfries and Galloway
East Ayrshire
East Lothian
East Renfrewshire
Eilean Siar
Falkirk
Fife
Highland
Inverclyde
Midlothian
Moray
North Ayrshire
Orkney Islands
Perth and Kinross
Scottish Borders
Shetland Islands
South Ayrshire
South Lanarkshire
Stirling
Aberdeen City
Aberdeenshire
Argyll and Bute
City of Edinburgh
Renfrewshire
West Dunbartonshire
West Lothian
Angus
Dundee City
North Lanarkshire
East Dunbartonshire


Webpage URL

In [48]:
url = "https://en.wikipedia.org/wiki/Economy_of_Scotland"

Getting content in text format and store in variable 'data'

In [49]:
data  = requests.get(url).text

Creating beautifulsoup object on 'data'

In [50]:
soup = BeautifulSoup(data,"html.parser")

Find all html-tables in the webpage

In [51]:
tables = soup.find_all('table')

Check how many tables were found

In [52]:
len(tables)

20

Check for name of tables

In [53]:
 for index, table in enumerate(tables):
    print(tables[index].caption)

None
<caption class="infobox-title adr">Economy of <span class="country-name">Scotland</span></caption>
<caption>Top export destinations (excluding oil and gas)
</caption>
<caption>Income Tax Rates and Bands 2018-21<sup class="reference" id="cite_ref-113"><a href="#cite_note-113">[110]</a></sup>
</caption>
<caption>GDP by area
</caption>
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None


Inspecting webpage and finding that table in interest is wikitable, 
testing find all for that class, rather than looking through tables above

In [54]:
wikitables = soup.find_all("table",{"class":"wikitable"})
# number of tables
len(wikitables)

3

Check for name of tables

In [55]:
 for index, table in enumerate(wikitables):
    print(wikitables[index].caption)

<caption>Top export destinations (excluding oil and gas)
</caption>
<caption>Income Tax Rates and Bands 2018-21<sup class="reference" id="cite_ref-113"><a href="#cite_note-113">[110]</a></sup>
</caption>
<caption>GDP by area
</caption>


Table of interest

In [56]:
 tablen = wikitables[2]

Creating dataframe and filling with data of interest from fetched table

In [57]:
# Create dataframe
area_gdp = pd.DataFrame(columns=["Area", "GDP per capita"])

# Enumerating over tablerows and columns, fill dataframe with data from table
for row in tablen.tbody.find_all("tr"):
    col = row.find_all("td")
    if (col != []):
        area = col[0].text
        GDP = col[2].text.strip()
        area_gdp = area_gdp.append({"Area":area, "GDP per capita":GDP}, ignore_index=True)

area_gdp

Unnamed: 0,Area,GDP per capita
0,Tayside,"€25,950"
1,Angus & Dundee,"€24,500"
2,Perth & Kinross & Stirling,"€27,400"
3,Dumfries & Galloway,"€20,500"
4,Scottish Borders,"€20,300"
5,Clackmann. & Fife,"€19,900"
6,Falkirk,"€21,800"
7,Edinburgh & Lothian,"€31,766"
8,Edinburgh,"€50,400"
9,West Lothian,"€26,200"
