# Analysis of London spendings with python

In this post I want to discuss how you can use python to fetch data from the internet, 
put them in a readable format and gain some interesting insights.

This exercise is motivated by ["Using SQL for Lightweight Data Analysis"](http://schoolofdata.org/2013/03/26/using-sql-for-lightweight-data-analysis/) by Rufus Pollock. Here, I extend Rufus' analysis to a larger dataset and I use different analysis tools.

## The data

The data come from the ["London GLA spending"](https://www.london.gov.uk/about-us/greater-london-authority-gla/spending-money-wisely/our-spending) website, where GLA stands for Greater London Authority. Every month GLA publishes their spendings on Housing Services, Developing, Communities & Intelligence, etc. While writing, the GLA webpage contains 38 csv files with inhomogeneous formatting. There are empty columns and irregularly spaced data. To complicate things, the GLA website keeps changing root address and html design. So, I do not guarantee that the code described below will work in 2 years from now.

The webpage looks like this:

<img src="https://raw.githubusercontent.com/vincepota/GLA/master/notebook/web_screenshot.png" width="600">

where we are interested in the content of the `CSV file` column. 

The strategy is straightforward: 
- scrap the html code of the GLA webpage;
- extract the links to the `.csv` files;
- download all the data and append the results to a `pandas` dataframe;
- Clean the data
- Have some fun with the data

## The code

Let's import some libraries, where the most important is `BeautifulSoup` which allows to handle the html code hiding behind web pages. If you do not have `BeautifulSoup` installed, you can get it via `pip install BeautifulSoup`.

In [19]:
import pandas as pd
from bs4 import BeautifulSoup
import urllib2
import matplotlib.pylab as plt
import re
import sqlite3

Let's extract the `html` code from the GLA webpage:

In [23]:
wpage= 'https://www.london.gov.uk/about-us/greater-london-authority-gla/spending-money-wisely/our-spending'

req = urllib2.Request(wpage)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page, 'html5lib')

The `csv` files that we need are contained in `<td>` tags, which are nested inside `<table>` tags.
Some `<td>` tags contain the direct link to the `csv` file, while other `<td>` tags contain a *link* to another webpage which contains the `csv` file. It is rather confusing, but it can be implemented very easily with python:

In [21]:
table = soup.find_all('table')     # Find all <table> tags
thelist = []                       

for t in table:
    if len(t.find_all('th')) > 0:  # Only select tables with csv files
        for a in t.find_all('a', href=True):   # Find all hyperlinks in the table
            thelink = 'https:' + a['href']     
            if len(thelink) < 40:        # If True, thelink is a link to another webpage
                                         # containing the csv file
                    
               req = urllib2.Request(thelink)    # Scrap thelink wepage
               page = urllib2.urlopen(req)
               soup = BeautifulSoup(page, 'html5lib')

               aa = soup.find_all(href = re.compile('.csv'))[0] # Extract the csv file
               thelink = aa['href']
               thelist.append(thelink)   
            else:                        # If the link is a link to the csv file, append the
               thelist.append(thelink)   # results straight away

`thelist` is a list which contains all the direct links to the `csv` files. Note that we have not downloaded the data yet.

In [22]:
print 'the list contains', len(thelist), 'csv files'
thelist[0:5]

the list contains 38 csv files


[u'https://www.london.gov.uk/sites/default/files/mayors_250_report_-_2015-16_-_p12_-_combined.csv',
 u'https://www.london.gov.uk/sites/default/files/mayors_250_report_-_2015-16_-_p11_-_combined_fn.csv',
 u'https://www.london.gov.uk/sites/default/files/mayors_250_report_-_2015-16_-_p10_-_combined.csv',
 u'https://www.london.gov.uk/sites/default/files/mayors_250_report_-_2015-16_-_p09_-_combined.csv',
 u'https://www.london.gov.uk/sites/default/files/copy_of_mayors_250_report_-_2015-16_-_p08_-.csv']

We can now download the data. Instead of downloading every `csv` files to disk, one can use `pandas` ability to read `csv` files straight from the internet. Before we do that, let's see how the data look like

In [25]:
example = urllib2.urlopen(thelist[0]).read(20000)
example = example.split("\n")

In [28]:
example[0:20]

[',,,,,,,,,,\r',
 ',Report of all payments made by GLA & GLA Land & Property for value equal to or greater than \xa3 250.00 Excl. VAT,,,,,,,,,\r',
 ',Reporting Period : ,12,,,,,,,,\r',
 ',Start Date:,07 February 2016,,,,,,,,\r',
 ',End Date:,05 March 2016,,,,,,,,\r',
 ',Financial Year :,2015 / 16,,,,,,,,\r',
 ',,,Amount \xa3,,,,,,,\r',
 ',Total Remuneration for the month,,"3,086,689.45",,,,,,,\r',
 ',Irrecoverable VAT,,0.00,,,,,,,\r',
 ',,,,,,,,,,\r',
 ',Vendor ID,Vendor Name,Cost Element,Expenditure Account Code Description,Document No,Amount,Clearing Date,Directorate,Service Expenditure Analysis,\r',
 ',10016524,TRANSPORT FOR LONDON,544071,FUNCTIONAL BODY GRANT PAYMENT,CHAPS649,"66,253,087.00",24 Feb  2016,RESOURCES,Highways and transport services,\r',
 ',10016524,TRANSPORT FOR LONDON,544093,NLE - GRANT PMT TO TFL,CHAPS627,"20,945,312.00",15 Feb  2016,RESOURCES,Highways and transport services,\r',
 ',NC,DCLG,544073,BUSINESS RATE RETENTION-CLG,CHAPS643,"17,926,156.00",22 Feb  2016,RES