# 01_pulling_cleaning_data
A notebook to ingest various local election data, standardise it and export it. 

In order to standardise the 2022 data (not at the ward-level, we may want to consider imputing ward-level scores based on previous results?)

Data: 
- 2018 local election results, ward-level
- 2021 local election results, ward-level
- 2022 local election results, council-level

NOTE: Using Python 3.10

NL, 10/07/22

## IMPORTS

In [4]:
import pandas as pd
import numpy as np

from tqdm import tqdm

## PATHS

Add paths to data, URLS, etc. below as required

In [16]:
DATA_PATH = '../data/'

In [67]:
WARD_RESULTS_2018_RAW = DATA_PATH+'raw/2018_LocalElectionResults_AndrewTealLeap.csv'
WARD_RESULTS_2021_RAW = DATA_PATH+'raw/2021_LocalElectionResullts_BritainElectis.xlsx'
COUNCIL_RESULTS_2022_RAW = DATA_PATH+'raw/2022_LocalElectionResults_CouncilLevel.xlsx'

In [59]:
CENSUS_2021_2011_POP_CHANGE = DATA_PATH+'raw/2021_2011_Census_population_change.xlsx'
CENSUS_2021_TOP_LEVEL_FINDINGS = DATA_PATH+'raw/2021_Census_TopLevelSummaryStats.xlsx'

## INIT / DATA IN

In [108]:
colnames_2018 = ['local_authority', 'local_authority_ons_code', 'ward', 'ward_ons_code', 'candidate_name', 'party', 'n_votes', 'cndidate_won']

In [109]:
colnames_2021_base = ['ward_ons_code', 'local_authority', 'ward', 'year_last_election', 'winner_last_election', 'winner_2021']
party_names = ['con', 'lab', 'ldem', 'ukip', 'grn', 'snp', 'pc', 'ind', 'indgrp', 'reg', 'resass', 'oth']
suffixes = ['_2021_vote', '_previous_vote', '_2021_perc', '_previous_perc']

In [110]:
party_columns = []
for suffix in suffixes:
    for party in party_names:
        tmp = party+suffix
        party_columns.append(tmp)

colnames_2021 = colnames_2021_base+party_columns

In [113]:
raw_results_2018_df = pd.read_csv(WARD_RESULTS_2018_RAW, header=0, names=colnames_2018)
raw_results_2021_df = pd.read_excel(WARD_RESULTS_2021_RAW, sheet_name='results', header=None, skiprows=2, names=colnames_2021)

Note: our 2022 results are only at the council level. For this purpose, it might make sense to collapse our other results data to the council level also.

In [71]:
colnames_2022 = [
    'ons_code', 
    'council_name', 
    'boundary_change', 
    'new_council', 
    'council_type', 
    'seats', 
    'region', 
    'con_pre', 
    'lab_pre', 
    'ldem_pre', 
    'grn_pre', 
    'oth_pre', 
    'total_pre', 
    'control_pre', 
    'con_post', 
    'lab_post', 
    'ldem_post', 
    'grn_post', 
    'oth_post', 
    'vacant_post', 
    'total_post', 
    'control_post', 
    'con_net', 
    'lab_net', 
    'ldem_net', 
    'grn_net', 
    'oth_net']

In [87]:
raw_results_2022_df = pd.read_excel(COUNCIL_RESULTS_2022_RAW, skiprows=1, names=colnames_2022)

In [90]:
raw_results_2022_df = raw_results_2022_df.replace('-', np.NaN)

For census files, it seems that we need to do data imports from row 6 of the files.

In [64]:
census_2021_11_pop_change_df = pd.read_excel(CENSUS_2021_2011_POP_CHANGE, sheet_name='Population change', skiprows=5)
census_2021_11_pop_change_df = census_2021_11_pop_change_df.rename(columns={'Area code [note 2]': 'area_code',
'Area name' : 'area_name', 'All persons, 2011' : 'n_people_2011', 'All persons, 2021' : 'n_people_2021', 'Percentage change' : 'perc_change'})

## DATA WRANGLING / TRANSFORMATION

let's create some council-level files for our 2018 and 2021 data, so as to ensure comparability. 

let's also save these files for future use. 