### Introduction

The following datasets are used in this demonstration:
- 2011 Census data - population and household estimates, available [here](https://www.ons.gov.uk/census/2011census/2011censusdata/2011censusdatacatalogue/populationandhouseholdestimates).
- ONS - House price to residence-based earnings ratio, available [here](https://www.ons.gov.uk/peoplepopulationandcommunity/housing/datasets/ratioofhousepricetoresidencebasedearningslowerquartileandmedian)
- Kaggle - COVID19 related tweets, available [here](https://www.kaggle.com/gpreda/covid19-tweets)

These datasets are also available altogether [here](https://github.com/bz-dev/ox-interview/tree/main/data) in the GitHub repository for this demonstration.

### Table of contents

1. [Population and household estimates (**univariate**)](#section1)

    1.1 [Gender ratio](#section1_1)

    <small>Basic pie chart</small>

    1.2 [Population and gender ratio by region](#section1_2)

    <small>How to use subplots to group charts together, and how to use annotations to add more details to the plot.</small>

    1.3 [Population by outward postcode using choropleth map](#section1_3)

    <small>How to plot location related data on a map using geojson.</small>

    1.4 [Population by postcode parent area using choropleth map](#section1_4)

    <small>How to merge granulated geo areas and plot on the map.</small>

2. [House prices and earnings (**multivariate**)](#section2)

    2.1 [Correlation between median house price and median household earning](#section2_1)

    <small>How to use a combination of scatter plot, trendline, box plot and rug plot to show distributions and correlations between variables.</small>

    2.2 [Median and lower quartile house prices from 2002 to 2020 (animated)](#section2_2)

    <small>How to use animated chart to show trends over changes of one variable.</small>

3. [Text analysis and visualisation](#section3)

    3.1 [Text tokenization and word cloud](#section3_1)

    <small>How to perform a simple text tokenization and use word cloud to show word frequencies.</small>

    3.2 [Basic sentiment analysis with VADER](#section3_2)

    <small>How to perform a simple sentiment analysis and use grouped line charts.</small>

<hr>

### Preparation
Please run this code block before running any other blocks in this notebook.

In [None]:
# Install packages using jupyter notebook's built-in %pip function
%pip install pandas plotly geojson shapely openpyxl scipy nltk

# Import required packages
import pandas as pd
import numpy as np
import json
from pathlib import Path
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import re
import shapely.geometry
from shapely.ops import unary_union
import geojson
import itertools
import wordcloud
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Set default pandas plotting backend to Plotly
pd.options.plotting.backend = "plotly"

# Pass paths to resources to variables
file_population = Path("../data/ons/population_household.csv")
dir_geojson = Path("../data/geojson")
file_house_price = Path("../data/ons/house_price_earning.xlsx")
file_tweet = Path("../data/tweet/covid19_tweets.csv")

print("✅ Notebook preparation completed.")

<a id='section1'></a>
### 1. Population and household estimates

In [None]:
# Declare section parent variable
var1 = dict()

# Load data
var1["df"] = pd.read_csv(file_population)

# Split postcode into outward and inward, e.g. LE3 9QP => LE3 and 9QP
var1["df"]["postcode_out"] = var1["df"]["Postcode"].apply(lambda x: x[0:4].strip())
var1["df"]["postcode_in"] = var1["df"]["Postcode"].apply(lambda x: x[4:].strip())

# Keep only non-numeric part in outward as the parent region
var1["df"]["postcode_region"] = var1["df"]["postcode_out"].apply(lambda x: re.split("\d+", x)[0])

# Remove column [Postcode] to save memory
var1["df"] = var1["df"].drop(["Postcode"], axis=1)

print("✅ Section 1 preparation completed.")

<a id='section1_1'></a>
#### 1.1 Gender ratio

In [None]:
# Declare section parent variable
sec1_1 = dict()

# Count total number of males and females
sec1_1["df"] = pd.DataFrame(
    dict(Gender=["Male", "Female"], Count=[var1["df"]["Males"].sum(), var1["df"]["Females"].sum()]))

# Generate pie chart
sec1_1["fig"] = px.pie(sec1_1["df"], values="Count", names="Gender", title="National gender ratio (England and Wales)")
sec1_1["fig"].show()

# Remove variable to save memory
del sec1_1

<a id='section1_2'></a>
#### 1.2 Population and gender ratio by region

In [None]:
# Declare section parent variable
sec1_2 = dict()

# Sum up all numbers within same postcode region, descending order by column [Total]
sec1_2["df"] = var1["df"].groupby(["postcode_region"]).sum().sort_values(by=['Total'], ascending=False).reset_index()

# Generate plots side by side
sec1_2["fig"] = make_subplots(rows=1, cols=2, specs=[[{}, {}]], shared_xaxes=True,
                              shared_yaxes=False, vertical_spacing=0.001)

# Add plot 1: bar chart for top 10 population regions
sec1_2["fig"].append_trace(go.Bar(
    x=sec1_2["df"].head(10)["Total"],
    y=sec1_2["df"].head(10)["postcode_region"],
    marker=dict(color='rgba(50, 171, 96, 0.6)', line=dict(
        color='rgba(50, 171, 96, 1.0)',
        width=1)),
    name='Population',
    orientation='h',
), 1, 1)

# Add plot 2: scatter chart for gender ratio in top 10 population regions
sec1_2["fig"].append_trace(
    go.Scatter(
        x=sec1_2["df"].head(10)["Males"] / sec1_2["df"].head(10)["Females"],
        y=sec1_2["df"].head(10)["postcode_region"],
        mode='lines+markers', line_color='rgb(128, 0, 128)', name='M/F ratio',
    ), 1, 2)

# Update layout changes on plot title, axes, color and legends
sec1_2["fig"].update_layout(
    title='Male/Female ratio in top 10 population regions',
    yaxis=dict(showgrid=False, showline=False, showticklabels=True, domain=[0, 0.85]),
    yaxis2=dict(showgrid=False, showline=True, showticklabels=False,
                linecolor='rgba(102, 102, 102, 0.8)', linewidth=2, domain=[0, 0.85]),
    xaxis=dict(zeroline=False, showline=False, showticklabels=True,
               showgrid=True, domain=[0, 0.42]),
    xaxis2=dict(zeroline=False, showline=False, showticklabels=True, showgrid=True,
                domain=[0.47, 1], side='top', dtick=25000),
    legend=dict(x=0.029, y=1.038, font_size=10),
    margin=dict(l=100, r=20, t=70, b=70),
    paper_bgcolor='rgb(248, 248, 255)',
    plot_bgcolor='rgb(248, 248, 255)',
)

# Add annotations to the plot
annotations = []

# Adding labels to charts
for a_mf, a_pop, x_pcr in zip(np.round(sec1_2["df"].head(10)["Males"] / sec1_2["df"].head(10)["Females"], decimals=2),
                              sec1_2["df"].head(10)["Total"],
                              sec1_2["df"].head(10)["postcode_region"]):
    # Add label to M/F ratio scatter plot
    annotations.append(dict(xref='x2', yref='y2', y=x_pcr, x=a_mf,
                            text=a_mf, xshift=50, showarrow=False))

    # Add label to population bar chart
    annotations.append(dict(xref='x1', yref='y1', y=x_pcr, x=a_pop,
                            text=f"{np.round(a_pop / 1000000, 3)}M",
                            xshift=25, showarrow=False))

sec1_2["fig"].update_layout(annotations=annotations)
sec1_2["fig"].show()

# Remove variables to save memory
del sec1_2

<a id='section1_3'></a>
#### 1.3 Population by outward postcode using choropleth map
Concat individual postcode geojson mapping into single variable `geojson_uk`.

In [None]:
# Declare section parent variable
sec1_3 = dict()

# Create parent geojson collection object
sec1_3["geojson"] = dict(type="FeatureCollection", features=list())

# Load all geojson files
for f_geojson in list(dir_geojson.glob("*.geojson")):
    with open(f_geojson) as f:
        geojson_data = json.load(f)
        for feature in geojson_data["features"]:
            # Add feature id using properties.name, which is the outward postcode
            feature["id"] = feature["properties"]["name"]

            # Add feature to parent geojson collection
            sec1_3["geojson"]["features"].append(feature)

# Sum up all numbers within same postcode outward area
sec1_3["df"] = var1["df"].groupby(["postcode_out"]).sum().reset_index()

# Get max population number among areas
sec1_3["p_max"] = sec1_3["df"]["Total"].max()

# Plot figure
sec1_3["fig"] = px.choropleth_mapbox(var1["df"].groupby(["postcode_out"]).sum().reset_index(),
                                     geojson=sec1_3["geojson"],
                                     locations='postcode_out', color='Total',
                                     color_continuous_scale="jet",
                                     range_color=(0, sec1_3["p_max"]),
                                     mapbox_style="carto-positron",
                                     zoom=4.5, center={"lat": 52.5, "lon": -1.6},
                                     opacity=0.5)
sec1_3["fig"].update_layout(margin=dict(r=0, t=0, l=0, b=0))
sec1_3["fig"].show()

# Remove variables to save memory
del sec1_3

<small>As the opensource geographical data used here comes from Wikipedia, it does not cover all England regions. This leads to white areas on the map.</small>

<a id='section1_4'></a>
#### 1.4 Population by postcode parent area using choropleth map
In the above plot, regions are probably too granulated, so you would not be able to see an obvious trend. Let's merge them into parent regions.

In [None]:
# Declare section parent variable
sec1_4 = dict()

# Create parent geojson collection object
sec1_4["geojson"] = dict(type="FeatureCollection", features=list())

# Load all geojson files
for f_geojson in list(dir_geojson.glob("*.geojson")):
    with open(f_geojson) as f:
        geojson_data = json.load(f)
        # Merge granulated geometry areas into parent region
        merged = unary_union(
            list(map(
                (lambda x: shapely.geometry.asShape(x["geometry"])),
                geojson_data["features"])
            ))
        # Create new geojson feature from merged geometry
        geojson_merged = geojson.Feature(geometry=merged,
                                         properties={"name": f_geojson.stem},
                                         id=f_geojson.stem)
        # Add feature to parent geojson collection
        sec1_4["geojson"]["features"].append(geojson_merged)

# Sum up all numbers within same postcode region
sec1_4["df"] = var1["df"].groupby(["postcode_region"]).sum().reset_index()

# Get max population number among regions
sec1_4["p_max"] = sec1_4["df"]["Total"].max()

# Plot figure
sec1_4["fig"] = px.choropleth_mapbox(sec1_4["df"],
                                     geojson=sec1_4["geojson"],
                                     locations='postcode_region',
                                     color='Total',
                                     color_continuous_scale="jet",
                                     range_color=(0, sec1_4["p_max"]),
                                     mapbox_style="carto-positron",
                                     zoom=4.5, center={"lat": 52.5, "lon": -1.6},
                                     opacity=0.5)
sec1_4["fig"].update_layout(margin={"r": 0, "t": 0, "l": 0, "b": 0})
sec1_4["fig"].show()

# Remove variables to save memory
del sec1_4

In [None]:
# Remove section 1 variable
del var1
print("⚠ You have just deleted all variables for section 1."
      "If you would like to rerun any block in section 1, "
      "please make sure that you run the section preparation block first.")

<a id='section2'></a>
### 2. House prices and earnings

Define a reusable function to read different sheets from the source data file.

In [None]:
# Define a function to read different sheets from source
def read_df2(source, sheet, skip_top, column_name):
    """
    A reusable function to read from the source file and process into a DataFrame in required format.
    :param source: str or path object, where the source excel file located
    :param sheet: str, name of the sheet in excel file to be read
    :param skip_top: int, how many rows at the top of the sheet to be skipped
    :param column_name: str, new column name to be given to the variable
    :return: DataFrame
    """

    # Define a function to transpose DataFrame row
    def transpose_df2(r, c_name):
        """
        To transpose a single row into DataFrame with specific format.
        :param r: Series, a row from the DataFrame
        :param c_name: str, new name to be given to transposed column
        :return: DataFrame
        """
        # row -> column -> new DataFrame
        new_df = r[year_columns].T.to_frame().reset_index(drop=True)
        # Rename the column
        new_df.columns = [c_name]
        # Add column [Region] and populate the whole column with value from cell [Region] in the row
        new_df["Region"] = r["Region"].strip()
        # Add column [Year] and populate the column from 2002 to 2020
        new_df["Year"] = range(2002, 2021)
        return new_df

    # Read target sheet from source file into DataFrame -> drop column [Code] -> drop row if any cell in column [Name] is empty
    df = pd.read_excel(source, sheet_name=sheet, skiprows=skip_top).drop(columns=["Code"]).dropna(subset=["Name"])

    # Drop empty columns
    df.drop([col for col in df.columns if df[col].isnull().all()], axis=1, inplace=True)

    # Standardise column names
    year_columns = [str(x) for x in range(2002, 2021)]
    df.columns = ["Region"] + year_columns

    # Transpose and concat each row into the final DataFrame
    df.loc[0][year_columns].T.reset_index(drop=True)
    new_df = pd.DataFrame(columns=["Region", "Year", column_name])
    for _, row in df.iterrows():
        new_df = pd.concat([new_df, transpose_df2(row, column_name)])

    # Set numeric column data type to integer
    return new_df.astype({column_name: 'int32'})

# Declare section parent variable
var2 = temp = dict()

# Read sheet [1a] for the median house price by country and region, England and Wales, 2002 - 2020
temp["df1a"] = read_df2(file_house_price, "1a", 6, "Median house price")

# Read sheet [1b] for the median gross annual earnings by country and region, England and Wales, 2002 - 2020
temp["df1b"] = read_df2(file_house_price, "1b", 6, "Median earn")

# Read sheet [2a] for the lower quartile house price by country and region, England and Wales, 2002 - 2020
temp["df2a"] = read_df2(file_house_price, "2a", 6, "Lower house price")

# Read sheet [1b] for the lower quartile gross annual earnings by country and region, England and Wales, 2002 - 2020
temp["df2b"] = read_df2(file_house_price, "2b", 6, "Lower earn")

# Merge the four DataFrame above into new DataFrame
var2["df"] = pd.merge(temp["df1a"], temp["df1b"], on=["Region", "Year"])
var2["df"] = pd.merge(var2["df"], temp["df2a"], on=["Region", "Year"])
var2["df"] = pd.merge(var2["df"], temp["df2b"], on=["Region", "Year"])
# Show top 5 rows of processed DataFrame
# df2_final.head(5)
# Delete variables to save memory
del temp

print("✅ Section 2 preparation completed.")

<a id='section2_1'></a>
#### 2.1 Correlation between median house price and median household earning

- Scatter plot and trendlines show the correlations between median house price and median earn.
- Top box plot shows the distribution of median house price.
- Right rug box shows the distribution of median earn.

In [None]:
# Declare section parent variable
sec2_1 = dict()

# Plot correlation
sec2_1["fig"] = px.scatter(var2["df"][var2["df"].Region.isin(["England", "Wales", "London"])],
                           x="Median house price", y="Median earn", color="Region",
                           marginal_x="box", marginal_y="rug", trendline="ols")
sec2_1["fig"].show()

# Delete variables to save memory
del sec2_1

<a id='section2_2'></a>
#### 2.2 Median and lower quartile house prices from 2002 to 2020 (animated)

In [None]:
# Declare section parent variable
sec2_2 = dict()

# Reformat dataframe into required format for plotting
sec2_2["df"] = pd.DataFrame(columns=["Region", "Year", "Variable", "Value"])
for _, row in var2["df"].iterrows():
    sec2_2["df"] = sec2_2["df"].append([dict(Region=row["Region"], Year=row["Year"],
                                             Variable="Median house price", Value=row["Median house price"]),
                                        dict(Region=row["Region"], Year=row["Year"],
                                             Variable="Median earn", Value=row["Median earn"]),
                                        dict(Region=row["Region"], Year=row["Year"],
                                             Variable="Lower house price", Value=row["Lower house price"]),
                                        dict(Region=row["Region"], Year=row["Year"],
                                             Variable="Lower earn", Value=row["Lower earn"])],
                                       ignore_index=True)

# Generate grouped bar chart with animation based on year changes
sec2_2["fig"] = px.bar(sec2_2["df"][sec2_2["df"]["Variable"].isin(["Median house price", "Lower house price"])], x="Region",
                  y="Value", text="Value", color="Variable", barmode="group", animation_frame="Year",
                  animation_group="Region", range_y=[0, 500000])
sec2_2["fig"].update_yaxes(title_text="Price", secondary_y=False)
sec2_2["fig"].update_yaxes(title_text="Earn", secondary_y=True)
sec2_2["fig"].show()

# Delete variables to save memory
del sec2_2

In [None]:
# Remove section 2 variable
del var2
print("⚠ You have just deleted all variables for section 2."
      "If you would like to rerun any block in section 2, "
      "please make sure that you run the section preparation block first.")

<a id='section3'></a>
### 3. Text based analysis and visualisation

In [None]:
# Declare section parent variable
var3 = dict()

# Load data and keep only selected columns
var3["df"] = pd.read_csv(file_tweet)[["user_name", "text", "date"]].sort_values(by='date', ascending=False)

# Convert date column to datetime type
var3["df"]["date"] = pd.to_datetime(var3["df"]["date"], format="%Y-%m-%d %H:%M:%S")

# Add new day column with only date from the date column, i.e. remove the time %H:%M:%S
var3["df"]["day"] = var3["df"]["date"].apply(lambda x: x.date())

# Use basic WhitespaceTokenizer, which will segment text into individual words simply using whitespace
var3["tokenizer"] = nltk.WhitespaceTokenizer()

# Create a new column tokens
var3["df"]["tokens"] = var3["df"]["text"].apply(lambda x: var3["tokenizer"].tokenize(x.lower()))

print("✅ Section 1 preparation completed.")

<a id='section3_1'></a>
#### 3.1 Text tokenization and word cloud

In [None]:
# Declare section parent variable
sec3_1 = dict()

# Flatten the tokens into one list and remove some noise. In real case, this could combine with part-of-speech (POS) tagging. Here we only use a list of adjectives, conjunctions, articles and prepositions.
sec3_1["noise"] = ["the", "a", "an", "from", "to", "at", "of", "on", "in", "and", "or", "is", "are", "was", "am",
                          "were", "been", "have", "has", "had", "for", "with", "this", "that", "&amp;", "be", "as", "-"]
sec3_1["tokens"] = [x for x in list(itertools.chain.from_iterable(var3["df"]["tokens"].tolist())) if
              not x.startswith("@") and not x.startswith("http") and x not in sec3_1["noise"] and len(x) >= 3]

# Calculate frequencies of each token
sec3_1["freq"] = nltk.FreqDist(sec3_1["tokens"])

# Create a word cloud
sec3_1["cloud"] = wordcloud.WordCloud(background_color='white', width=1200, height=600).generate_from_frequencies(
    sec3_1["freq"])
sec3_1["fig"] = px.imshow(sec3_1["cloud"])
sec3_1["fig"].update_layout(xaxis={"visible": False}, yaxis={"visible": False})
sec3_1["fig"].show()

# Delete variables to save memory
del sec3_1

<a id='section3_2'></a>
#### 3.2 Basic sentiment analysis with VADER
Use nltk's built-in [VADER](https://ojs.aaai.org/index.php/ICWSM/article/view/14550) model to perform a basic sentiment analysis on these tweets without the need for training.

In [None]:
# Declare section parent variable
sec3_2 = dict()

# Download lexicon resource required for using the VADER model
nltk.download('vader_lexicon')

# Create an analyzer instance
sec3_2["analyzer"] = SentimentIntensityAnalyzer()

# Create a function that returns a tuple of sentiment analysis result from the analyzer
def sentiment_analysis(text):
    result = sec3_2["analyzer"].polarity_scores(text)
    return result["neg"], result["neu"], result["pos"], result["compound"], \
           "Positive" if result["compound"] > 0.05 else "Negative" if result["compound"] < -0.05 else "Neutral"

# Create new columns with sentiment analysis result
var3["df"]["Negative"], var3["df"]["Neutral"], var3["df"]["Positive"], var3["df"]["Compound"], var3["df"]["Overall"] = zip(
    *var3["df"]["text"].map(sentiment_analysis))

In [None]:
# Create pie chart of overall sentiment analysis result from the compound value
sec3_2["fig1"] = px.pie(var3["df"].groupby("Overall").count().reset_index(), values="user_name", names="Overall",
                   title="Sentimental results")
sec3_2["fig1"].show()



In [None]:
# Unpivot the dataframe
sec3_2["df1"] = pd.melt(var3["df"], id_vars=["user_name", "text", "tokens", "date", "Overall", "day"],
                     value_vars=["Negative", "Neutral", "Positive", "Compound"])

# Create box chart for distribution of negative scores across days
sec3_2["fig2"] = px.box(sec3_2["df1"][(sec3_2["df1"]["Overall"] == "Negative") & (sec3_2["df1"]["variable"] == "Negative")], x="day",
             y="value", labels={
        "value": "Negative score"
    })
sec3_2["fig2"].show()

In [None]:
sec3_2["df2"] = var3["df"].groupby(["Overall", "day"]).count().reset_index().sort_values(by="day")

sec3_2["fig3"] = px.line(sec3_2["df2"], x="day", y="user_name", color="Overall",
              title="Positive/Negative/Neutral tweet counts over days", labels={
        "user_name": "Count"
    })
sec3_2["fig3"].update_layout()
sec3_2["fig3"].show()

# Delete variables to save memory
del sec3_2

In [None]:
# Remove section 2 variable
del var3
print("⚠ You have just deleted all variables for section 3."
      "If you would like to rerun any block in section 3, "
      "please make sure that you run the section preparation block first.")