# CEDAS-NORBIS PhD Summer School

Here we look at the London Borough related data to explore how we can use interactive visualisations to explore and compare multiple "spaces" generated by embedding algorithms popular in the machine learning literature and practice.

In [44]:
import pandas as pd
import altair as alt

df = pd.read_csv('london-borough-profiles.csv')

In [45]:
df.head()

Unnamed: 0,Code,Area/INDICATOR,Inner/ Outer London,GLA Population Estimate 2013,GLA Household Estimate 2013,Inland Area (Hectares),Population density (per hectare) 2013,"Average Age, 2013","Proportion of population aged 0-15, 2013","Proportion of population of working-age, 2013",...,Teenage conception rate (2012),Life satisfaction score 2012-13 (out of 10),Worthwhileness score 2012-13 (out of 10),Happiness score 2012-13 (out of 10),Anxiety score 2012-13 (out of 10),Political control in council,Proportion of seats won by Conservatives in 2014 election,Proportion of seats won by Labour in 2014 election,Proportion of seats won by Lib Dems in 2014 election,Turnout at 2014 local elections
0,E09000001,City of London,Inner London,8000,4514.371383,290.4,27.525868,41.303887,7.948036,77.541617,...,.,8.1,8.23,7.44,x,,,,,
1,E09000002,Barking and Dagenham,Outer London,195600,73261.40858,3610.8,54.160527,33.228935,26.072939,63.835021,...,35.4,7.06,7.57,6.97,3.3,Lab,0.0,100.0,0.0,38.16
2,E09000003,Barnet,Outer London,370000,141385.7949,8674.8,42.651374,36.896246,20.886408,65.505593,...,14.7,7.35,7.79,7.27,2.63,Cons,50.793651,42.857143,1.587302,41.1
3,E09000004,Bexley,Outer London,236500,94701.2264,6058.1,39.044243,38.883039,20.28283,63.14645,...,25.8,7.47,7.75,7.21,3.22,Cons,71.428571,23.809524,0.0,not avail
4,E09000005,Brent,Outer London,320200,114318.5539,4323.3,74.06367,35.262694,20.462585,68.714872,...,19.6,7.23,7.32,7.09,3.33,Lab,9.52381,88.888889,1.587302,33


Lots of different features. We also have really odd NaN values such as x and not available. We can try and get rid of this.

In [46]:
def isnumber(x):
    try:
        float(x)
        return True
    except:
        if (len(x) > 1) & ("not avail" not in x):
            return True
        else:
            return False

# apply isnumber function to every element
df = df[df.applymap(isnumber)]
df.head()

Unnamed: 0,Code,Area/INDICATOR,Inner/ Outer London,GLA Population Estimate 2013,GLA Household Estimate 2013,Inland Area (Hectares),Population density (per hectare) 2013,"Average Age, 2013","Proportion of population aged 0-15, 2013","Proportion of population of working-age, 2013",...,Teenage conception rate (2012),Life satisfaction score 2012-13 (out of 10),Worthwhileness score 2012-13 (out of 10),Happiness score 2012-13 (out of 10),Anxiety score 2012-13 (out of 10),Political control in council,Proportion of seats won by Conservatives in 2014 election,Proportion of seats won by Labour in 2014 election,Proportion of seats won by Lib Dems in 2014 election,Turnout at 2014 local elections
0,E09000001,City of London,Inner London,8000,4514.371383,290.4,27.525868,41.303887,7.948036,77.541617,...,,8.1,8.23,7.44,,,,,,
1,E09000002,Barking and Dagenham,Outer London,195600,73261.40858,3610.8,54.160527,33.228935,26.072939,63.835021,...,35.4,7.06,7.57,6.97,3.3,Lab,0.0,100.0,0.0,38.16
2,E09000003,Barnet,Outer London,370000,141385.7949,8674.8,42.651374,36.896246,20.886408,65.505593,...,14.7,7.35,7.79,7.27,2.63,Cons,50.793651,42.857143,1.587302,41.1
3,E09000004,Bexley,Outer London,236500,94701.2264,6058.1,39.044243,38.883039,20.28283,63.14645,...,25.8,7.47,7.75,7.21,3.22,Cons,71.428571,23.809524,0.0,
4,E09000005,Brent,Outer London,320200,114318.5539,4323.3,74.06367,35.262694,20.462585,68.714872,...,19.6,7.23,7.32,7.09,3.33,Lab,9.52381,88.888889,1.587302,33.0


That looks much cleaner.

Replace the NaN values in numeric columns with the mean.

In [48]:
# get only numeric columns
numericColumns = df._get_numeric_data()

In [None]:
from sklearn.metrics import euclidean_distances

# keep place names and store them in a variable
placeNames = df["Area/INDICATOR"]

# let's fill the missing values with mean()
numericColumns = numericColumns.fillna(numericColumns.mean())

# let's centralize the data
numericColumns -= numericColumns.mean()

# now we compute the euclidean distances between the columns by passing the same data twice
# the resulting data matrix now has the pairwise distances between the boroughs.
# CAUTION: note that we are now building a distance matrix in a high-dimensional data space
# remember the Curse of Dimensionality -- we need to be cautious with the distance values
distMatrix = euclidean_distances(numericColumns, numericColumns)

Check to make sure everything looks ok.

In [None]:
numericColumns.head()

If we try and build a visualisation of this data, we will struggle (the code to produce a scatterplot matrix will take a while, but you can give it a try yourself). Dimension reduction will help us here!

We could apply various different types of dimension reduction here. We are specifically going to capture the dissimilarity in the data using [multidimensional scaling](https://scikit-learn.org/stable/modules/manifold.html#multidimensional-scaling). Our distance matrix will come in useful here.

In [None]:
from sklearn import manifold
# for instance, typing distMatrix.shape on the console gives:
# Out[115]: (38, 38) # i.e., the number of rows

# first we generate an MDS object and extract the projections
mds = manifold.MDS(n_components = 2, max_iter=3000, n_init=1, dissimilarity="precomputed")
Y = mds.fit_transform(distMatrix)

In [None]:
## ALTERNATVE
## NOTE: You can also try different projection methods here, maybe t-SNE which has become very popular to analyse proximities and strcutures in multi-variate spaces

#from sklearn.manifold import TSNE

#TSNE_Model = TSNE(n_components=2)
#tSNE_embdedded = TSNE_Model.fit_transform(numericColumns)
#tSNE_embdedded

In [None]:
# You can get the coordinates of the points in this new "space", you can access them through the Y array
Y

In [None]:
dfWithMDS1 = pd.DataFrame.from_records(Y)
dfWithMDS1 = dfWithMDS1.rename(columns={0: "MDS_1_X", 1: "MDS_1_Y"})

In [None]:
alt.Chart(dfWithMDS1).mark_point().encode(
    x='MDS_1_X:Q',
    y='MDS_1_Y:Q'
)

<img src="tools-hammer.svg" width="60">  **GIVE IT A TRY!** 

You can now make the above more useful by adding the names of the boroughs as labels? An example is here: https://altair-viz.github.io/gallery/scatter_with_labels.html

Note that the borough names are in the full dataframe called **df** so you will need to start by merging these "new" dimensions with the full dataset and take it from there.

In [None]:
#.....


One key idea that we want to follow up is to explore multiple projections and relate and compare them visually.

Our data also include happiness metrics. Pulling these out of our data and carry out further multidimensional scaling operations so that can help us see how the boroughs differ in happiness.

In [None]:
# get the data columns relating to emotions and feelings
dataOnEmotions = numericColumns[["Life satisfaction score 2012-13 (out of 10)", "Worthwhileness score 2012-13 (out of 10)","Happiness score 2012-13 (out of 10)"]]

# a new distance matrix to represent "emotional distance"s
distMatrix2 = euclidean_distances(dataOnEmotions, dataOnEmotions)

# compute a new "embedding" (machine learners' word for projection)
Y2 = mds.fit_transform(distMatrix2)



The location of the different boroughs on the 2 dimensional multidimensional scaling space from the happiness metrics is

We may want to look at if the general happiness rating captures the position of the boroughs. To do this we need to assign colours based on the binned happiness score.

In [None]:
# get the data columns relating to emotions and feelings
dataOnDiversity = numericColumns[["Proportion of population aged 0-15, 2013", "Proportion of population of working-age, 2013", "Proportion of population aged 65 and over, 2013", "% of population from BAME groups (2013)", "% people aged 3+ whose main language is not English (2011 census)"]]

# a new distance matrix to represent "emotional distance"s
distMatrix3 = euclidean_distances(dataOnDiversity, dataOnDiversity)

mds = manifold.MDS(n_components = 2, max_iter=3000, n_init=1, dissimilarity="precomputed")
Y3 = mds.fit_transform(distMatrix3)



Generete a visualisation to put all these together in a shared plot.

### A small TODO for you:

Q: Can you think of other maps that you can produce with this data? Have a look at the variables once again and try to produce new "perspectives" to the data and see what they have to say.