## BUILD AGE DISTRIBUTION AND AGE SPECIFIC FERTILITY DATASETS

**DATA SOURCE**: 
- **Age Distribution**: https://population.un.org/wpp/Download/Standard/CSV/; Population by Age and Sex ;   
    Population interpolated by single age and single year.
       - PopMale: Male population for the individual age (thousands)
       - PopFemale: Female population for the individual age (thousands)
       - PopTotal: Total population for the individual age (thousands)
    
    The original dataset contains the info on Age distributions for all countries from 1950 to 2019. It has a size of around 240KB.   
    The dataset is downloaded and saved in the base directory (./Age&Fertility_Distribution_Source) with the name: 
       - WPP2019_PopulationBySingleAgeSex_1950-2019.csv  
       
     Considered its huge size and the amount of information it contains, it is first necessary to filter it to reduce the time range in order to make it more manageable. In particular, it will be reduced uo to contain just from 2015 to 2020 information and then, the 2015-2019 Dataset will be used to built for subsequent transformations.   
     The output is saved in the same base directory with the name: 
       - WPP2019_PopulationBySingleAgeSex_2015-2019.csv.   
 
    
       
- **Fertility Rate By Age Distribution** : https://population.un.org/wpp/Download/Standard/CSV/; Fertility Indicators;   
    Fertility indicators, by age, annualy and 5-year periods, from 1950-1955 to 2095-2100.
       - ASFR: Age-specific fertility rate (births per 1,000 women)
       - PASFR: Percentage age-specific fertility rate
       - Births: Number of births, both sexes combined (thousands)  
       
**Nota**: This Script allows to get the Age Distributions and Age-Specific Fertility Dataset filtered correctly for the desired country and information; The Jupyter Notebook Script can be run easily on personal PC. The relevant outcome generated by the Script is then saved in the desired Target Directory.   

**OUTPUT**: 
- World_Age_2019.csv
- AgeSpecificFertility.csv

### TABLE OF CONTENTS     
[1. Build Age-Distribution Dataset](#age_distr)       
[2. Build Mother-Age-Distribution Dataset](#mother_distr)     

In [1]:
# import libraries
import numpy as np
import pandas as pd
import csv
import math
import os 

# set Source and Target directories to save results 
# b_dir >> base directory where are saved the sources cvs files downloaded 
b_dir = './Age&Fertility_Distribution_Source'
# target_dir >> target directory where to store the results
target_dir = './'

### 1.  Build Age-Distribution Dataset 
<a id="age_distr"></a>

Convert the following cell to code mode to obtain the 2015-2019 Age Dataset from the 1950-2019 Age Dataset.  

Specifically, the 1950-2019 Dataset should be dowloaded from the Source (See Above) and saved in the base directory with the name *WPP2019_PopulationBySingleAgeSex_1950-2019.csv*.   

The method will then return the 2015-2019 csv File and save the output in the base directory as *WPP2019_PopulationBySingleAgeSex_2015-2019.csv*.  


In [2]:
################################
### BUILD AGE DISTRIBUTION DATASET 

def age_distribution_dataset(country_list=None):
    '''Function to build the suitable dataset for age distribution for the countries of interest
    Input = list of strings of countries of interest (i.e. ['Italy','Spain'] etc); if None generate for all countries;
    Output = DataFrame containing the age population data for 2019 and country of interest '''
    
    #data = pd.read_csv(os.path.join(b_dir,'WPP2019_PopulationBySingleAgeSex_1950-2019.csv')) # total original dataset
    data = pd.read_csv(os.path.join(b_dir,'WPP2019_PopulationBySingleAgeSex_2015-2019.csv')) # reduced dataset
    locations = list(data.Location.unique())
    data = data[data['Time'] == 2019] # select the time range 2019
    data = data[['Location','PopTotal','AgeGrp']] # keep and order only the relevant columns
    # create the needed datasets 
    data_agedist = pd.DataFrame() # initialize the DataFrame to store the results 
    c = ['Location']  
    c.extend([str(i) for i in range(101)]) # create the list with columns names
    data_agedist= pd.DataFrame(columns=c) # initialize needed columns
    if country_list == None:
        for country in locations: # iterate over all countries
            if country != 'Europe' and country != 'Latin America and the Caribbean' and country!= 'Northern America': 
                l = [] 
                l.append(str(country))
                births = data[data['Location'] == country]['PopTotal'].to_list()
                l.extend(births)
                s = pd.Series(l, index= data_agedist.columns)
                data_agedist = data_agedist.append(s, ignore_index=True)
    else:
        for country in country_list: # iterate over the countries of interest
            l = [] 
            l.append(str(country))
            births = data[data['Location'] == country]['PopTotal'].to_list()
            l.extend(births)
            s = pd.Series(l, index= data_agedist.columns)
            data_agedist = data_agedist.append(s, ignore_index=True)
    return data_agedist
 
#country_list = ['Italy','Spain','China','France','Germany',] ### INDICATE THE COUNTRIES OF INTEREST
country_list = None # extract for all countries 
df = age_distribution_dataset(country_list)
df.head()
df.dtypes # the Location is a string whereas the other are all floats

# SAVE FINAL OUTPUT IN A .CSV FILE IN THE TARGET DIRECTORY
df.to_csv(os.path.join(target_dir,'World_Age_2019.csv'),index=False)

### 2.  Build Mother-Age-Distribution Dataset 
<a id="mother_distr"></a>

In [3]:
#######################################
### BUILD MOTHER AGE DISTRIBUTION DATASET 

def build_fertility_age_dataset(country_list=None): 
    '''Function to build the suitable dataset for fertility age distribution for the countries of interest
    Input = list of strings of countries of interest (i.e. ['Italy','Afghanistan']); if None generate for all countries;
    Output = DataFrame containing the fertility data for 2015-2020 and country of interest '''
    
    # general preprocessing of data from World Census
    data = pd.read_csv(os.path.join(b_dir,'WPP2019_Fertility_by_Age.csv')) # upload the dataset
    data['Births000'] = data['Births']*1000 # obtain the births number expressed in thousands (000)
    locations = list(data.Location.unique())
    data = data[data['Time'] == '2015-2020'] # select the time range 2015-2020
    data = data[['Location','Births000','AgeGrp','ASFR']] # order the columns keeping only the relevant ones
    #data # all countries from A to Z for yers 2015-2020 with Location, Births 000, AgeGrp, ASFR
    
    # create the needed datasets 
    data_final = pd.DataFrame() # initialize the DataFrame to store the results 
    data_final= pd.DataFrame(columns=['Location','15-19','20-24','25-29','30-34','35-39','40-44','45-49']) # initialize needed columns
    if country_list == None:
        for country in locations: # iterate over all countries
            if country != 'Europe' and country != 'Latin America and the Caribbean' and country!= 'Northern America': 
                l = [] 
                l.append(str(country))
                births = data[data['Location'] == country ]['ASFR'].to_list()
                l.extend(births)
                s = pd.Series(l, index=data_final.columns)
                data_final = data_final.append(s, ignore_index=True)  
    else: 
        for country in country_list: # iterate over the countries of interest
            l = [] 
            l.append(str(country))
            births = data[data['Location'] == country ]['ASFR'].to_list()
            l.extend(births)
            s = pd.Series(l, index=data_final.columns)
            data_final = data_final.append(s, ignore_index=True)
    return data_final

#country_list = ['Italy','Spain'] ### INDICATE THE COUNTRIES OF INTEREST
country_list = None 
df = build_fertility_age_dataset(country_list)
df.head(2)
df.dtypes # the Location is a string whereas the other are all floats

# SAVE FINAL OUTPUT IN A .CSV FILE IN THE TARGET DIRECTORY
df.to_csv(os.path.join(target_dir,'AgeSpecificFertility.csv'), index=False)