# ASHE Table 7

-----

### Requirements

"Annual Summary of Hours and Earnings"


#### Observations & Dimensions

The `observations` are the numbers in the percentile columns.

The required dimensions are:

* **Geography** - in the `Code` column, one letter followed by 8 digits
* **Percentiles** - 10,20,30, etc
* **Time** - year, 4 digits
* **Gender** - Male, Female, All
* **Working Pattern** - Full time, Part time, All
* **Statistics** - The "topic" of the dataset, i.e "monthly pay net etc", in the filename

-----
    
Notes:

The "statistics" seems pointless because we're looking at one file. In production there are 24 per year per ASHE table.

It's always worth getting the file out of /sources and having a look over.

In [1]:
%cd mock-transformations/

/workspace/mock-transformations


In [2]:
import pandas as pd
import numpy as np
import collections

excel_path = 'sources/PROV - Work Geography Table 7.1a   Weekly pay - Gross 2018.xls'

In [3]:
# Riping through this dataset in Pandas is much faster because despite the shape being the same, split cells require renaming
# Dictionary of Excel contents
#   Worksheet Name      Gender          Working Pattern
excel_ws = {
    'All':              ('All',         'All'),
    'Male':             ('Males',       'All'),
    'Female':           ('Females',     'All'),
    'Full-Time':        ('All',         'Full-Time'),
    'Part-Time':        ('All',         'Part-Time'),
    'Male Full-Time':   ('Male',        'Full-Time'),
    'Male Part-Time':   ('Male',        'Part-Time'),
    'Female Full-Time': ('Females',     'Full-Time'),
    'Female Part-Time': ('Females',     'Part-Time')
}

In [4]:
# Worksheet Definitinos
#   Column Name         Units
columns = collections.OrderedDict()
columns = {
    'Number of Jobs':                   'Thousands',
    'Median':                           'GBP/week',
    'Annual Percentage change Median':  'Percent',
    'Mean':                             'GBP/week',
    'Annual Percentage change Mean':    'Percent',
    '10th Percentile':                  'GBP/week',                     
    '20th Percentile':                  'GBP/week',
    '25th Percentile':                  'GBP/week',
    '30th Percentile':                  'GBP/week',
    '40th Percentile':                  'GBP/week',
    '60th Percentile':                  'GBP/week',
    '70th Percentile':                  'GBP/week',
    '75th Percentile':                  'GBP/week',
    '80th Percentile':                  'GBP/week',
    '90th Percentile':                  'GBP/week'
}



In [5]:
work = list()
for ws_name, ((gender, work_hours)) in excel_ws.items():
    df = pd.read_excel(io=excel_path, sheet_name=ws_name, header=4, index_col=1)
    df.drop(labels=['Description', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19'], axis=1, inplace=True)
    df.columns = pd.MultiIndex.from_tuples([(gender, work_hours, x) for x in columns.keys()])
    df = df[df.index.notnull()].replace(r'x', np.nan)
    work.append(df)

In [7]:
work[1]

Unnamed: 0_level_0,Males,Males,Males,Males,Males,Males,Males,Males,Males,Males,Males,Males,Males,Males,Males
Unnamed: 0_level_1,All,All,All,All,All,All,All,All,All,All,All,All,All,All,All
Unnamed: 0_level_2,Number of Jobs,Median,Annual Percentage change Median,Mean,Annual Percentage change Mean,10th Percentile,20th Percentile,25th Percentile,30th Percentile,40th Percentile,60th Percentile,70th Percentile,75th Percentile,80th Percentile,90th Percentile
Code,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3
K02000001,13347,555.0,2.7,666.8,3.1,224.2,342.7,376.3,410.6,479.1,639.2,745.3,808.2,888.8,1166.3
K03000001,12950,556.5,2.4,670.0,3.1,225.0,344.3,378.2,412.5,481.1,642.2,748.4,812.3,893.0,1172.9
K04000001,11851,557.9,2.4,673.4,3.2,225.6,344.5,378.7,413.3,481.5,643.9,751.6,817.9,900.0,1184.4
E92000001,11300,562.2,2.6,678.9,3.3,226.8,345.0,380.6,416.1,485.2,648.8,759.5,824.1,907.0,1195.5
E12000001,466,506.6,1.0,590.1,2.1,234.9,335.4,360.0,387.7,442.9,577.2,669.3,718.6,775.2,969.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
S12000029,56,574.9,-0.5,621.1,1.6,234.6,361.9,390.8,420.6,490.6,644.1,736.2,788.8,827.7,
S12000030,20,590.5,6.3,653.2,2.6,243.1,348.3,402.3,441.2,520.6,667.4,780.2,845.8,,
S12000039,17,531.2,2.5,620.7,3.4,,344.1,387.6,436.9,492.1,623.9,686.5,,,
S12000040,48,570.0,6.4,720.2,18.9,323.6,393.7,435.9,459.3,519.9,634.2,739.7,776.9,912.5,


In [8]:
# Join the dataframes, unstack, and reset index
output = pd.concat(work, axis=1, join='inner').unstack().reset_index()

In [10]:
# Name those columns
output.columns = ['Gender', 'Working Pattern', 'Measure', 'Geography', 'OBS']

In [12]:
output['Time'] = 2018
output['Statistic'] = 'Gross weekly pay'
output['Units'] = output['Measure'].replace(columns)

In [13]:
output

Unnamed: 0,Gender,Working Pattern,Measure,Geography,OBS,Time,Statistic,Units
0,All,All,Number of Jobs,K02000001,26417,2018,Gross weekly pay,Thousands
1,All,All,Number of Jobs,K03000001,25633,2018,Gross weekly pay,Thousands
2,All,All,Number of Jobs,K04000001,23324,2018,Gross weekly pay,Thousands
3,All,All,Number of Jobs,E92000001,22204,2018,Gross weekly pay,Thousands
4,All,All,Number of Jobs,E12000001,984,2018,Gross weekly pay,Thousands
...,...,...,...,...,...,...,...,...
58180,Females,Part-Time,90th Percentile,S12000029,,2018,Gross weekly pay,GBP/week
58181,Females,Part-Time,90th Percentile,S12000030,,2018,Gross weekly pay,GBP/week
58182,Females,Part-Time,90th Percentile,S12000039,,2018,Gross weekly pay,GBP/week
58183,Females,Part-Time,90th Percentile,S12000040,,2018,Gross weekly pay,GBP/week
