# Population Estimates

-----

### Requirements

Table called 'data'

#### Observations & Dimensions

The `observations` should be apparent.

The required dimensions are:

* **Geography** - get this from the top row "title", cut off the non place names.
* **Time** - use dates from left hand column
* **CDID** - from across the top

-----
Notes:

* This should be pretty straight forward in data baker, but consider doing it in pure pandas after you're done, we'll need
to do that from time to time.

To do it with pandas, replace the load cell with
```
tidy_sheet = pd.read_excel('./sources/pop.xlsx', sheetname='Sheet1')
```


In [1]:
%cd mock-transformations/

/workspace/mock-transformations


In [2]:
from databaker.framework import *

# the Seasonally Adjusted Tabs
tidy_sheet = loadxlstabs("./sources/pop.xls")

Loading ./sources/pop.xls which has size 12288 bytes
Table names: ['data']


In [3]:
# Geography range is B1:H1
geography = tidy_sheet[0].excel_ref("B1").expand(RIGHT).is_not_blank()

# Clean up the name by replacing stuff

In [4]:
# Time range is A8:A?
time = tidy_sheet[0].excel_ref("A8").expand(DOWN).is_not_blank()

In [5]:
# CDID range is B2:H2
cdid = tidy_sheet[0].excel_ref("B2").expand(RIGHT).is_not_blank()

In [6]:
# Observations range are B8:H?
observations = tidy_sheet[0].excel_ref("B8").expand(DOWN).expand(RIGHT).is_not_blank()

In [7]:
# Dimension assignment
dimensions = [HDim(geography, "Geography", DIRECTLY, ABOVE),
              HDim(time, "Year", DIRECTLY, LEFT),
              HDim(cdid, "CDID", DIRECTLY, ABOVE)]

In [8]:
# Conversion segment
cs = ConversionSegment(tidy_sheet[0], dimensions, observations)

In [9]:
# Convert to dataframe
df = cs.topandas()
df.head()




Unnamed: 0,OBS,CDID,Geography,Year
0,5235600.0,SCPOP,Scotland population mid-year estimate,1971
1,54387600.0,GBPOP,Great Britain population mid-year estimate,1971
2,46411700.0,ENPOP,England population mid-year estimate,1971
3,55928000.0,UKPOP,United Kingdom population mid-year estimate,1971
4,49152000.0,EWPOP,England and Wales population mid-year estimate,1971


In [10]:
# Trim the geography name
df["Geography"] = df["Geography"].str.replace(" population mid-year estimate", "")

In [11]:
# Observations are integer
df["OBS"] = df["OBS"].astype(int)

In [12]:
# Done
df

Unnamed: 0,OBS,CDID,Geography,Year
0,5235600,SCPOP,Scotland,1971
1,54387600,GBPOP,Great Britain,1971
2,46411700,ENPOP,England,1971
3,55928000,UKPOP,United Kingdom,1971
4,49152000,EWPOP,England and Wales,1971
...,...,...,...,...
331,55977200,ENPOP,England,2018
332,66435600,UKPOP,United Kingdom,2018
333,59115800,EWPOP,England and Wales,2018
334,1881600,NIPOP,Northern Ireland,2018
