# Assignment 05 - Population
By: Daniel Finnerty

The purpose of this notebook is to analyse the differences between the sexes by age in Ireland. Data is gathered from the below CSO location:

https://ws.cso.ie/public/api.restful/PxStat.Data.Cube_API.ReadDataset/FY006A/CSV/1.0/en



## Process

Import required package for effective operation of notebook.

In [47]:
# Import Pandas for data frames
import pandas as pd

Download data file as CSV

In [48]:
# Data location
url = "https://ws.cso.ie/public/api.restful/PxStat.Data.Cube_API.ReadDataset/FY006A/CSV/1.0/en"

# Read CVS into a data frame
df = pd.read_csv(url)

# Show first 3 rows
df.head(3)

Unnamed: 0,STATISTIC,Statistic Label,TLIST(A1),CensusYear,C02199V02655,Sex,C02076V03371,Single Year of Age,C03789V04537,Administrative Counties,UNIT,VALUE
0,FY006AC01,Population,2022,2022,-,Both sexes,-,All ages,IE0,Ireland,Number,5149139
1,FY006AC01,Population,2022,2022,-,Both sexes,-,All ages,2ae19629-1492-13a3-e055-000000000001,Carlow County Council,Number,61968
2,FY006AC01,Population,2022,2022,-,Both sexes,-,All ages,2ae19629-1433-13a3-e055-000000000001,Dublin City Council,Number,592713


In [49]:
# Show last 3 rows
df.tail(3)

Unnamed: 0,STATISTIC,Statistic Label,TLIST(A1),CensusYear,C02199V02655,Sex,C02076V03371,Single Year of Age,C03789V04537,Administrative Counties,UNIT,VALUE
9789,FY006AC01,Population,2022,2022,2,Female,650,100 years and over,2ae19629-149d-13a3-e055-000000000001,Cavan County Council,Number,12
9790,FY006AC01,Population,2022,2022,2,Female,650,100 years and over,2ae19629-14a4-13a3-e055-000000000001,Donegal County Council,Number,31
9791,FY006AC01,Population,2022,2022,2,Female,650,100 years and over,2ae19629-1495-13a3-e055-000000000001,Monaghan County Council,Number,7


To determine the weighted mean age by sex, the 'Both sexes' information can be removed.

In [50]:
# Show only data that does not have 'Both sexes' in the Sex column
df = df[df["Sex"] != "Both sexes"]

# And data that does not have 'All ages' in the Single Year of Age column
df = df[df["Single Year of Age"] != "All ages"]

df = df[df["Administrative Counties"] == "Ireland"]

# Show first 3 rows to confirm
df.head(3)

Unnamed: 0,STATISTIC,Statistic Label,TLIST(A1),CensusYear,C02199V02655,Sex,C02076V03371,Single Year of Age,C03789V04537,Administrative Counties,UNIT,VALUE
3296,FY006AC01,Population,2022,2022,1,Male,200,Under 1 year,IE0,Ireland,Number,29610
3328,FY006AC01,Population,2022,2022,1,Male,1,1 year,IE0,Ireland,Number,28875
3360,FY006AC01,Population,2022,2022,1,Male,2,2 years,IE0,Ireland,Number,30236


The age now also needs to be converted solely to integers.

In [51]:
# Change 'Under 1 year' to '0'
df["Single Year of Age"] = df["Single Year of Age"].str.replace('Under 1 year', '0')

# Remove all non-digit characters, leaving only the number
df["Single Year of Age"] = df["Single Year of Age"].str.replace('\D', '', regex=True)

# Convert all age values to 64-bit integers
df['Single Year of Age']=df['Single Year of Age'].astype('int64')

# Convert all 'Value' numbers to 64-bit integers
df['VALUE']=df['VALUE'].astype('int64')

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 202 entries, 3296 to 9760
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   STATISTIC                202 non-null    object
 1   Statistic Label          202 non-null    object
 2   TLIST(A1)                202 non-null    int64 
 3   CensusYear               202 non-null    int64 
 4   C02199V02655             202 non-null    object
 5   Sex                      202 non-null    object
 6   C02076V03371             202 non-null    object
 7   Single Year of Age       202 non-null    int64 
 8   C03789V04537             202 non-null    object
 9   Administrative Counties  202 non-null    object
 10  UNIT                     202 non-null    object
 11  VALUE                    202 non-null    int64 
dtypes: int64(4), object(8)
memory usage: 20.5+ KB


  df["Single Year of Age"] = df["Single Year of Age"].str.replace('\D', '', regex=True)


With this done, the data can now be simplified, by removing unnecessary columns

In [52]:
# Comfirm all column names
headers = df.columns.tolist()

# Show list
headers

['STATISTIC',
 'Statistic Label',
 'TLIST(A1)',
 'CensusYear',
 'C02199V02655',
 'Sex',
 'C02076V03371',
 'Single Year of Age',
 'C03789V04537',
 'Administrative Counties',
 'UNIT',
 'VALUE']

In [53]:
# Remove unrequired columns
drop_col_list = ['STATISTIC', 'Statistic Label','TLIST(A1)','CensusYear','C02199V02655','C02076V03371','C03789V04537','UNIT']
df.drop(columns=drop_col_list, inplace=True)

#print (df.head(3))
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 202 entries, 3296 to 9760
Data columns (total 4 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Sex                      202 non-null    object
 1   Single Year of Age       202 non-null    int64 
 2   Administrative Counties  202 non-null    object
 3   VALUE                    202 non-null    int64 
dtypes: int64(2), object(2)
memory usage: 7.9+ KB


In [54]:
df_anal = pd.pivot_table(df, 'VALUE',"Single Year of Age","Sex")
print (df_anal.head(3))
# write out the entire file to local machine
#df_anal.to_csv("population_for_analysis.csv")

Sex                  Female     Male
Single Year of Age                  
0                   28186.0  29610.0
1                   27545.0  28875.0
2                   28974.0  30236.0


In [55]:
df.head()

Unnamed: 0,Sex,Single Year of Age,Administrative Counties,VALUE
3296,Male,0,Ireland,29610
3328,Male,1,Ireland,28875
3360,Male,2,Ireland,30236
3392,Male,3,Ireland,31001
3424,Male,4,Ireland,31686
