# Data viz projecy

### Import useful libraries

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np

### Read the data and clean it

Read the data and convert the date format

In [2]:
df=pd.read_csv("UK-HPI-full-file-2023-02.csv"); # load the data
df_reduced=df[['Date', 'RegionName', 'AreaCode', 'AveragePrice']].copy(); # extract the data that refer only to restrict number of columns.
df_reduced['Date'] = pd.to_datetime(df['Date'],dayfirst=True); # By default, the date is imported as object. However, it is better to convert the date column from object to date class

Extract the data matching a specific adate

In [3]:
time_analysis= np.datetime64('2023-01-01'); # in the first phase, will only work with the data for a give date
df_grouped=df_reduced[df_reduced.Date==time_analysis].copy(); # extract the dataset for that given date

Discard any macro-region (e.g. England) from the dataset

In [20]:
regions_to_discard=['England','Scotland','Wales','Northern Ireland']; # We will work with local authorities, so we can discard high level geographical boundaries. Create a list with the elements to discard.
df_final=df_grouped[~df_grouped.RegionName.isin(regions_to_discard)].copy(); # find the rows where the region name is equal a value within the list. The function returns true when the values are met. The ~ operator swaps true and false.
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 401 entries, 228 to 135313
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Date          401 non-null    datetime64[ns]
 1   RegionName    401 non-null    object        
 2   AreaCode      401 non-null    object        
 3   AveragePrice  401 non-null    float64       
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 15.7+ KB


Export the data in a json file

In [5]:
df_final['Date'] = df_final['Date'].dt.strftime('%Y-%m-%d')
df_final.to_json(path_or_buf='HPI.json', orient='records')

### Read the json file with the coordinates of the area

In [146]:
import json
json_england = json.load(open('England.json'));  # read the data for England
json_scotalnd= json.load(open('Scotland.json')); # read the data for Scotland
json_wales=json.load(open('Wales.json')); # read the data for Wales
json_northerireland=json.load(open('NortherIrland.json')); # read the data for Norther Ireland
df_england=pd.json_normalize(json_england['features']); # convert json into dataframe
df_scotland=pd.json_normalize(json_scotalnd['features']); # convert json into dataframe
df_wales=pd.json_normalize(json_wales['features']); # convert json into dataframe
df_northerireland=pd.json_normalize(json_northerireland['features']); # convert json into dataframe
df_greatbritain=pd.concat([df_england,df_scotland,a,df_northerireland]); # combine all the dataframes together

Norther Ireland use a different column for storing the region name and region number. Combine them under the same name.

In [147]:
df_greatbritain['properties.LAD13NM']=df_greatbritain['properties.LAD13NM'].fillna(df_greatbritain['properties.LGDNAME']);
df_greatbritain['properties.LAD13CD']=df_greatbritain['properties.LAD13CD'].fillna(df_greatbritain['properties.LGDCode']);

In [148]:
df_greatbritain=df_greatbritain[['properties.LAD13CD','properties.LAD13NM','geometry.type','geometry.coordinates']]; # retain only some columns
df_greatbritain=df_greatbritain.rename(columns={ "properties.LAD13CD": "AreaCode", "properties.LAD13NM": "RegionName",
                                     "geometry.type":"geometry_type","geometry.coordinates":"geometry_coordinates"}); # rename the columns to easier names

In [150]:
df_greatbritain['Averageprice']=np.nan
df_greatbritain.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 695 entries, 0 to 10
Data columns (total 5 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   AreaCode              695 non-null    object 
 1   RegionName            695 non-null    object 
 2   geometry_type         695 non-null    object 
 3   geometry_coordinates  695 non-null    object 
 4   Averageprice          0 non-null      float64
dtypes: float64(1), object(4)
memory usage: 32.6+ KB


In [151]:
for code,price in zip(df_final.AreaCode,df_final.AveragePrice):
    df_greatbritain.loc[df_greatbritain.AreaCode==code,'Averageprice']=price;

In [152]:
df_greatbritain.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 695 entries, 0 to 10
Data columns (total 5 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   AreaCode              695 non-null    object 
 1   RegionName            695 non-null    object 
 2   geometry_type         695 non-null    object 
 3   geometry_coordinates  695 non-null    object 
 4   Averageprice          607 non-null    float64
dtypes: float64(1), object(4)
memory usage: 32.6+ KB
