### To do

1. Extract data from xlsx file and create raw data frame - **DONE**
2. Remove empty columns - **DONE**
3. Remove unecessary columns - **DONE**
   1. Check for distinct data values - **DONE**
4. Rename columns - **DONE**
5. Reorder columns - **DONE**
1. Check for empty data - **DONE**
2. Verify data types - **DONE**
3. Create new dataframes by dimensions, indicators, and subgroups.

In [103]:
# Import necessary libraries
import pandas as pd
import numpy as np

### Extract data from the data.xlsx file

In [104]:
# Read the data into a Pandas DataFrame
raw_data_df = pd.read_excel('data/data.xlsx')
raw_data_df.head()

Unnamed: 0,setting,date,source,indicator_abbr,indicator_name,dimension,subgroup,estimate,se,ci_lb,...,iso3,favourable_indicator,indicator_scale,ordered_dimension,subgroup_order,reference_subgroup,whoreg6,wbincome2024,dataset_id,update
0,Afghanistan,2004,NNS,overweight,Overweight prevalence in children aged < 5 yea...,Child's age (6 groups) (0-59m),0-5 months,,,,...,AFG,0,100,1,1,0,Eastern Mediterranean,Low income,rep_nut,17 June 2024
1,Afghanistan,2004,NNS,overweight,Overweight prevalence in children aged < 5 yea...,Child's age (6 groups) (0-59m),12-23 months,4.3,,,...,AFG,0,100,1,3,0,Eastern Mediterranean,Low income,rep_nut,17 June 2024
2,Afghanistan,2004,NNS,overweight,Overweight prevalence in children aged < 5 yea...,Child's age (6 groups) (0-59m),24-35 months,3.0,,,...,AFG,0,100,1,4,0,Eastern Mediterranean,Low income,rep_nut,17 June 2024
3,Afghanistan,2004,NNS,overweight,Overweight prevalence in children aged < 5 yea...,Child's age (6 groups) (0-59m),36-47 months,5.6,,,...,AFG,0,100,1,5,0,Eastern Mediterranean,Low income,rep_nut,17 June 2024
4,Afghanistan,2004,NNS,overweight,Overweight prevalence in children aged < 5 yea...,Child's age (6 groups) (0-59m),48-59 months,6.4,,,...,AFG,0,100,1,6,0,Eastern Mediterranean,Low income,rep_nut,17 June 2024


In [105]:
# Get a brief summary of the raw_data DataFrame.
raw_data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157095 entries, 0 to 157094
Data columns (total 24 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   setting               157095 non-null  object 
 1   date                  157095 non-null  int64  
 2   source                157095 non-null  object 
 3   indicator_abbr        157095 non-null  object 
 4   indicator_name        157095 non-null  object 
 5   dimension             157095 non-null  object 
 6   subgroup              157095 non-null  object 
 7   estimate              153334 non-null  float64
 8   se                    136679 non-null  float64
 9   ci_lb                 142754 non-null  float64
 10  ci_ub                 142754 non-null  float64
 11  population            147089 non-null  float64
 12  flag                  157095 non-null  object 
 13  setting_average       157095 non-null  float64
 14  iso3                  157095 non-null  object 
 15  

There are 157,095 records and 24 columns of data in our dataframe.

Check for any columns that are empty and drop them from the dataframe.

In [106]:
# Check for empty columns
empty_columns = raw_data_df.columns[raw_data_df.isnull().all()]
print("Empty columns:", empty_columns)

# Remove empty columns
cleaned_df = raw_data_df.dropna(axis=1, how='all')

# Verify data frame info
print("DataFrame info after removing empty columns:")
cleaned_df.info()

Empty columns: Index([], dtype='object')
DataFrame info after removing empty columns:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157095 entries, 0 to 157094
Data columns (total 24 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   setting               157095 non-null  object 
 1   date                  157095 non-null  int64  
 2   source                157095 non-null  object 
 3   indicator_abbr        157095 non-null  object 
 4   indicator_name        157095 non-null  object 
 5   dimension             157095 non-null  object 
 6   subgroup              157095 non-null  object 
 7   estimate              153334 non-null  float64
 8   se                    136679 non-null  float64
 9   ci_lb                 142754 non-null  float64
 10  ci_ub                 142754 non-null  float64
 11  population            147089 non-null  float64
 12  flag                  157095 non-null  object 
 13  setting_average   

There are still 24 columns, so no empty columns were found.

Count the unique values for each column so that we can determine if any have a single value, we can drop them from the dataframe.  Such columns will not provide any meaningful insights.

In [107]:
# Count the number of unique values in each column
unique_counts = cleaned_df.nunique()

# Create a new DataFrame to store the counts
unique_counts_df = pd.DataFrame(unique_counts, columns=['Unique Count'])

# Display the new DataFrame
print("DataFrame with the count of unique values in each column:")
unique_counts_df

DataFrame with the count of unique values in each column:


Unnamed: 0,Unique Count
setting,156
date,34
source,19
indicator_abbr,15
indicator_name,15
dimension,6
subgroup,4517
estimate,130841
se,135364
ci_lb,136109


Columns 'favourable indicator', 'indicator_scale', 'dataset_id', and 'update' have single values so we can drop them.

In [108]:
# Remove columns with a single value from the DataFrame
cleaned_df = cleaned_df.drop(columns=['favourable_indicator', 'indicator_scale', 'dataset_id', 'update'])

# Display the updated DataFrame
print("DataFrame info after removing single value columns:\n")
cleaned_df.info()

DataFrame info after removing single value columns:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157095 entries, 0 to 157094
Data columns (total 20 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   setting             157095 non-null  object 
 1   date                157095 non-null  int64  
 2   source              157095 non-null  object 
 3   indicator_abbr      157095 non-null  object 
 4   indicator_name      157095 non-null  object 
 5   dimension           157095 non-null  object 
 6   subgroup            157095 non-null  object 
 7   estimate            153334 non-null  float64
 8   se                  136679 non-null  float64
 9   ci_lb               142754 non-null  float64
 10  ci_ub               142754 non-null  float64
 11  population          147089 non-null  float64
 12  flag                157095 non-null  object 
 13  setting_average     157095 non-null  float64
 14  iso3                157095 non-

The four single value columns were correctly dropped.  We now have 20 columns in the dataframe.

In [109]:
'''
***************
DELETE LATER
***************
# Get the unique values for each column in separate lists so that we can determine if any columns have unnecessary data.  
# Such columns will not be applicable to our analysis, so we will drop them from the dataframe.

# Create an empty dictionary to store the lists of unique values
unique_values = {}

# Iterate through each column in the cleaned DataFrame
for column in cleaned_df.columns:
    # Extract unique values and convert to a list
    unique_values[column] = cleaned_df[column].unique().tolist()

# Display the lists of unique values for each column
print("Unique values in each column:\n")
for column, values in unique_values.items():
    print(f"{column}: {values}")
'''

'\n***************\nDELETE LATER\n***************\n# Get the unique values for each column in separate lists so that we can determine if any columns have unnecessary data.  \n# Such columns will not be applicable to our analysis, so we will drop them from the dataframe.\n\n# Create an empty dictionary to store the lists of unique values\nunique_values = {}\n\n# Iterate through each column in the cleaned DataFrame\nfor column in cleaned_df.columns:\n    # Extract unique values and convert to a list\n    unique_values[column] = cleaned_df[column].unique().tolist()\n\n# Display the lists of unique values for each column\nprint("Unique values in each column:\n")\nfor column, values in unique_values.items():\n    print(f"{column}: {values}")\n'

1. The *flag* column includes notes, author, and reference title.
2. The *source* column includes a code for reprenting the source data type.
3. The meaning of the *reference_subgroup* column is unknown.
4. The *se* column refers to the standard error in the prevalence estimate.
5. The *ci_lb* and *ci_ub* columns refer to confidence intervals upper and lower bounds respectively.
6. The *ordered_dimension* column is a boolean flag that indicates if the *dimension* column is a nominal or ordinal variable.  In this dataset, *0* represents a nominal variable and *1* represents an ordinal variable.  Nominal variables are for the dimensions of *sex*, *place of residence*, and *subnational region*.  Ordinal variables are for the dimensions of *age*, *economic status*, and *education level*. Ordinal dimensions use the *subgroup_order* column to denote the specific ordering.

None data is needed for our analysis, so we will drop it from the dataframe.

In [110]:
# Remove 'flag' column from the DataFrame
cleaned_df = cleaned_df.drop(columns=['flag', 'source', 'reference_subgroup', 'se', 'ci_lb', 'ci_ub', 'ordered_dimension'])

# Display the updated DataFrame
print("DataFrame info after removing the unecessary columns:\n")
cleaned_df.info()

DataFrame info after removing the unecessary columns:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157095 entries, 0 to 157094
Data columns (total 13 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   setting          157095 non-null  object 
 1   date             157095 non-null  int64  
 2   indicator_abbr   157095 non-null  object 
 3   indicator_name   157095 non-null  object 
 4   dimension        157095 non-null  object 
 5   subgroup         157095 non-null  object 
 6   estimate         153334 non-null  float64
 7   population       147089 non-null  float64
 8   setting_average  157095 non-null  float64
 9   iso3             157095 non-null  object 
 10  subgroup_order   157095 non-null  int64  
 11  whoreg6          157095 non-null  object 
 12  wbincome2024     157095 non-null  object 
dtypes: float64(3), int64(2), object(8)
memory usage: 15.6+ MB


1. The *population* column represents the number of children under five years of age.
2. The *whoreg6* column is a regional classification provided by the WHO.
3. The *wbincome2024* column is an income group classification provided by The World Bank.

Rename the columns with more intuitive titles and reorder them.

In [111]:
# Rename the columns
cleaned_df = cleaned_df.rename(columns={
    'setting': 'Country',
    'date': 'Year',
    'indicator_abbr': 'Anthropometric Indicator',
    'indicator_name': 'Indicator Description',
    'dimension': 'Dimension',
    'subgroup': 'Dimension Value',
    'subgroup_order': 'Dimension Value Order',
    'setting_average': 'Country Avg',
    'iso3': 'Country ISO-3 Code',
    'whoreg6': 'Region',
    'wbincome2024': 'Income Group',
    'estimate': 'Prevalence Estimate %',
    'population': 'Under-Five Population (million)'
})

# Reorder the columns
cleaned_df = cleaned_df[
    [
        'Region',
        'Country ISO-3 Code',
        'Country',
        'Year',
        'Dimension',
        'Dimension Value',
        'Dimension Value Order',
        'Income Group',
        'Anthropometric Indicator',
        'Indicator Description',
        'Prevalence Estimate %',
        'Under-Five Population (million)',
        'Country Avg'
    ]
]

# Check the DataFrame information
cleaned_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157095 entries, 0 to 157094
Data columns (total 13 columns):
 #   Column                           Non-Null Count   Dtype  
---  ------                           --------------   -----  
 0   Region                           157095 non-null  object 
 1   Country ISO-3 Code               157095 non-null  object 
 2   Country                          157095 non-null  object 
 3   Year                             157095 non-null  int64  
 4   Dimension                        157095 non-null  object 
 5   Dimension Value                  157095 non-null  object 
 6   Dimension Value Order            157095 non-null  int64  
 7   Income Group                     157095 non-null  object 
 8   Anthropometric Indicator         157095 non-null  object 
 9   Indicator Description            157095 non-null  object 
 10  Prevalence Estimate %            153334 non-null  float64
 11  Under-Five Population (million)  147089 non-null  float64
 12  Co

In [112]:
# Remove rows where 'Under-Five Population (million)' and 'Prevalence Estimate %' are empty
cleaned_df = cleaned_df.dropna(subset=['Under-Five Population (million)', 'Prevalence Estimate %'])

# Check the DataFrame after dropping rows
cleaned_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 147089 entries, 1 to 157094
Data columns (total 13 columns):
 #   Column                           Non-Null Count   Dtype  
---  ------                           --------------   -----  
 0   Region                           147089 non-null  object 
 1   Country ISO-3 Code               147089 non-null  object 
 2   Country                          147089 non-null  object 
 3   Year                             147089 non-null  int64  
 4   Dimension                        147089 non-null  object 
 5   Dimension Value                  147089 non-null  object 
 6   Dimension Value Order            147089 non-null  int64  
 7   Income Group                     147089 non-null  object 
 8   Anthropometric Indicator         147089 non-null  object 
 9   Indicator Description            147089 non-null  object 
 10  Prevalence Estimate %            147089 non-null  float64
 11  Under-Five Population (million)  147089 non-null  float64
 12  Country

After removing unecessary columns and empty rows, we are left with 147,089 entries and 13 columns.

## To do
1. Create dataframes for
   1. Country and Codes
   2. Dimension
   3. Income Group
   4. Anthropometric Indicator and Indicator Description

In [152]:
# Create a new DataFrame with unique values from 'Country ISO-3 Code' and 'Country' columns
country_df = cleaned_df[['Country ISO-3 Code', 'Country']].drop_duplicates()

# Display the new DataFrame
print(country_df)

       Country ISO-3 Code                         Country
1                     AFG                     Afghanistan
1005                  ALB                         Albania
1555                  DZA                         Algeria
2600                  AGO                          Angola
3164                  ARG                       Argentina
...                   ...                             ...
144521                VNM                        Viet Nam
151646                YEM                           Yemen
153070                ZMB                          Zambia
154420                ZWE                        Zimbabwe
156020                PSE  occupied Palestinian territory

[154 rows x 2 columns]


In [168]:
# Create numpy array for each of the countries
country_ids = np.arange(1, len(country_df) + 1)
print(country_ids)

[  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
 145 146 147 148 149 150 151 152 153 154]


In [169]:
# Use a list comprehension to add "ctry" to each country_id.
ctry_ids = ["ctry" + str(ctry_id) for ctry_id in country_ids]

print(ctry_ids)

['ctry1', 'ctry2', 'ctry3', 'ctry4', 'ctry5', 'ctry6', 'ctry7', 'ctry8', 'ctry9', 'ctry10', 'ctry11', 'ctry12', 'ctry13', 'ctry14', 'ctry15', 'ctry16', 'ctry17', 'ctry18', 'ctry19', 'ctry20', 'ctry21', 'ctry22', 'ctry23', 'ctry24', 'ctry25', 'ctry26', 'ctry27', 'ctry28', 'ctry29', 'ctry30', 'ctry31', 'ctry32', 'ctry33', 'ctry34', 'ctry35', 'ctry36', 'ctry37', 'ctry38', 'ctry39', 'ctry40', 'ctry41', 'ctry42', 'ctry43', 'ctry44', 'ctry45', 'ctry46', 'ctry47', 'ctry48', 'ctry49', 'ctry50', 'ctry51', 'ctry52', 'ctry53', 'ctry54', 'ctry55', 'ctry56', 'ctry57', 'ctry58', 'ctry59', 'ctry60', 'ctry61', 'ctry62', 'ctry63', 'ctry64', 'ctry65', 'ctry66', 'ctry67', 'ctry68', 'ctry69', 'ctry70', 'ctry71', 'ctry72', 'ctry73', 'ctry74', 'ctry75', 'ctry76', 'ctry77', 'ctry78', 'ctry79', 'ctry80', 'ctry81', 'ctry82', 'ctry83', 'ctry84', 'ctry85', 'ctry86', 'ctry87', 'ctry88', 'ctry89', 'ctry90', 'ctry91', 'ctry92', 'ctry93', 'ctry94', 'ctry95', 'ctry96', 'ctry97', 'ctry98', 'ctry99', 'ctry100', 'ctry10

In [170]:
# Add the ctry_ids list as a new column
country_df['Country ID'] = ctry_ids
print(country_df)

       Country ISO-3 Code                         Country Country ID
1                     AFG                     Afghanistan      ctry1
1005                  ALB                         Albania      ctry2
1555                  DZA                         Algeria      ctry3
2600                  AGO                          Angola      ctry4
3164                  ARG                       Argentina      ctry5
...                   ...                             ...        ...
144521                VNM                        Viet Nam    ctry150
151646                YEM                           Yemen    ctry151
153070                ZMB                          Zambia    ctry152
154420                ZWE                        Zimbabwe    ctry153
156020                PSE  occupied Palestinian territory    ctry154

[154 rows x 3 columns]


In [171]:
# Reorder the columns so that 'Dimension ID' is first, then 'Dimension'
country_df = country_df[['Country ID', 'Country ISO-3 Code', 'Country']]

# Display the updated DataFrame
country_df

Unnamed: 0,Country ID,Country ISO-3 Code,Country
1,ctry1,AFG,Afghanistan
1005,ctry2,ALB,Albania
1555,ctry3,DZA,Algeria
2600,ctry4,AGO,Angola
3164,ctry5,ARG,Argentina
...,...,...,...
144521,ctry150,VNM,Viet Nam
151646,ctry151,YEM,Yemen
153070,ctry152,ZMB,Zambia
154420,ctry153,ZWE,Zimbabwe


In [173]:
# Export country_df as CSV file.
country_df.to_csv("data/country.csv", index=False)

In [153]:
# Create a new DataFrame with unique values from 'Anthropometric Indicator' and 'Indicator Description' columns
indicator_df = cleaned_df[['Anthropometric Indicator', 'Indicator Description']].drop_duplicates()

# Display the new DataFrame
print(indicator_df)

   Anthropometric Indicator                              Indicator Description
1                overweight  Overweight prevalence in children aged < 5 yea...
9              overweight_F  Overweight prevalence in children aged < 5 yea...
15             overweight_M  Overweight prevalence in children aged < 5 yea...
21                 stunting  Stunting prevalence in children aged < 5 years...
29               stunting_F  Stunting prevalence in children aged < 5 years...
35               stunting_M  Stunting prevalence in children aged < 5 years...
41              underweight  Underweight prevalence in children aged < 5 ye...
49            underweight_F  Underweight prevalence in children aged < 5 ye...
55            underweight_M  Underweight prevalence in children aged < 5 ye...
61                  wasting  Wasting prevalence in children aged < 5 years (%)
69                wasting_F  Wasting prevalence in children aged < 5 years ...
75                wasting_M  Wasting prevalence in c

In [175]:
# Create numpy array for each of the Dimensions
indicator_ids = np.arange(1, len(indicator_df) + 1)
print(indicator_ids)

[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]


In [176]:
# Use a list comprehension to add "ind" to each ind_id.
ind_ids = ["ind" + str(ind_id) for ind_id in indicator_ids]

print(ind_ids)

['ind1', 'ind2', 'ind3', 'ind4', 'ind5', 'ind6', 'ind7', 'ind8', 'ind9', 'ind10', 'ind11', 'ind12', 'ind13', 'ind14', 'ind15']


In [179]:
# Add the ind_ids list as a new column
indicator_df['Indicator ID'] = ind_ids

print(indicator_df)

   Anthropometric Indicator  \
1                overweight   
9              overweight_F   
15             overweight_M   
21                 stunting   
29               stunting_F   
35               stunting_M   
41              underweight   
49            underweight_F   
55            underweight_M   
61                  wasting   
69                wasting_F   
75                wasting_M   
81               wastingsev   
89             wastingsev_F   
95             wastingsev_M   

                                Indicator Description Indicator ID  
1   Overweight prevalence in children aged < 5 yea...         ind1  
9   Overweight prevalence in children aged < 5 yea...         ind2  
15  Overweight prevalence in children aged < 5 yea...         ind3  
21  Stunting prevalence in children aged < 5 years...         ind4  
29  Stunting prevalence in children aged < 5 years...         ind5  
35  Stunting prevalence in children aged < 5 years...         ind6  
41  Underweight prev

In [180]:
# Reorder the columns so that 'Indicator ID' is first
indicator_df = indicator_df[['Indicator ID', 'Anthropometric Indicator', 'Indicator Description']]

# Display the updated DataFrame
indicator_df

Unnamed: 0,Indicator ID,Anthropometric Indicator,Indicator Description
1,ind1,overweight,Overweight prevalence in children aged < 5 yea...
9,ind2,overweight_F,Overweight prevalence in children aged < 5 yea...
15,ind3,overweight_M,Overweight prevalence in children aged < 5 yea...
21,ind4,stunting,Stunting prevalence in children aged < 5 years...
29,ind5,stunting_F,Stunting prevalence in children aged < 5 years...
35,ind6,stunting_M,Stunting prevalence in children aged < 5 years...
41,ind7,underweight,Underweight prevalence in children aged < 5 ye...
49,ind8,underweight_F,Underweight prevalence in children aged < 5 ye...
55,ind9,underweight_M,Underweight prevalence in children aged < 5 ye...
61,ind10,wasting,Wasting prevalence in children aged < 5 years (%)


In [181]:
# Export indicator_df as CSV file.
indicator_df.to_csv("data/indicator.csv", index=False)

In [163]:
# Create a new DataFrame with unique values from the 'Dimension' column
dimension_df = cleaned_df[['Dimension']].drop_duplicates()

# Display the new DataFrame
print(dimension_df)

Unnamed: 0,Dimension
1,Child's age (6 groups) (0-59m)
6,Sex
106,Place of residence
110,Subnational region
386,Economic status (wealth quintile)
391,Education (3 groups)


In [164]:
# Create numpy array for each of the Dimensions
dimension_ids = np.arange(1, len(dimension_df) + 1)
print(dimension_ids)

[1 2 3 4 5 6]


In [165]:
# Use a list comprehension to add "dim" to each dimension_id.
dim_ids = ["dim" + str(dim_id) for dim_id in dimension_ids]

print(dim_ids)

['dim1', 'dim2', 'dim3', 'dim4', 'dim5', 'dim6']


In [166]:
# Add the dimension_ids list as a new column
dimension_df['Dimension ID'] = dimension_ids

print(dimension_df)

Unnamed: 0,Dimension,Dimension ID
1,Child's age (6 groups) (0-59m),1
6,Sex,2
106,Place of residence,3
110,Subnational region,4
386,Economic status (wealth quintile),5
391,Education (3 groups),6


In [167]:
# Reorder the columns so that 'Dimension ID' is first, then 'Dimension'
dimension_df = dimension_df[['Dimension ID', 'Dimension']]

# Display the updated DataFrame
dimension_df

Unnamed: 0,Dimension ID,Dimension
1,1,Child's age (6 groups) (0-59m)
6,2,Sex
106,3,Place of residence
110,4,Subnational region
386,5,Economic status (wealth quintile)
391,6,Education (3 groups)


In [187]:
# Export dimension_df as CSV file.
dimension_df.to_csv("data/dimension.csv", index=False)

In [117]:
# Create a new DataFrame with unique values from the 'Dimension Value' column
dimension_value_df = cleaned_df[['Dimension Value']].drop_duplicates()

# Display the new DataFrame
print(dimension_value_df)

           Dimension Value
1             12-23 months
2             24-35 months
3             36-47 months
4             48-59 months
5              6-11 months
...                    ...
155443    Mashonaland West
155444            Masvingo
155445  Matabeleland North
155446  Matabeleland South
155447            Midlands

[4214 rows x 1 columns]


In [182]:
# Create numpy array for each of the dimension values
dimension_value_ids = np.arange(1, len(dimension_value_df) + 1)
print(dimension_value_ids)

[   1    2    3 ... 4212 4213 4214]


In [183]:
# Use a list comprehension to add "dimval" to each dimension_id.
dimval_ids = ["dimval" + str(dimval_id) for dimval_id in dimension_value_ids]

print(dimval_ids)

['dimval1', 'dimval2', 'dimval3', 'dimval4', 'dimval5', 'dimval6', 'dimval7', 'dimval8', 'dimval9', 'dimval10', 'dimval11', 'dimval12', 'dimval13', 'dimval14', 'dimval15', 'dimval16', 'dimval17', 'dimval18', 'dimval19', 'dimval20', 'dimval21', 'dimval22', 'dimval23', 'dimval24', 'dimval25', 'dimval26', 'dimval27', 'dimval28', 'dimval29', 'dimval30', 'dimval31', 'dimval32', 'dimval33', 'dimval34', 'dimval35', 'dimval36', 'dimval37', 'dimval38', 'dimval39', 'dimval40', 'dimval41', 'dimval42', 'dimval43', 'dimval44', 'dimval45', 'dimval46', 'dimval47', 'dimval48', 'dimval49', 'dimval50', 'dimval51', 'dimval52', 'dimval53', 'dimval54', 'dimval55', 'dimval56', 'dimval57', 'dimval58', 'dimval59', 'dimval60', 'dimval61', 'dimval62', 'dimval63', 'dimval64', 'dimval65', 'dimval66', 'dimval67', 'dimval68', 'dimval69', 'dimval70', 'dimval71', 'dimval72', 'dimval73', 'dimval74', 'dimval75', 'dimval76', 'dimval77', 'dimval78', 'dimval79', 'dimval80', 'dimval81', 'dimval82', 'dimval83', 'dimval84', 

In [184]:
# Add the dimval_ids list as a new column
dimension_value_df['Dimension Value ID'] = dimension_value_ids

print(dimension_value_df)

           Dimension Value  Dimension Value ID
1             12-23 months                   1
2             24-35 months                   2
3             36-47 months                   3
4             48-59 months                   4
5              6-11 months                   5
...                    ...                 ...
155443    Mashonaland West                4210
155444            Masvingo                4211
155445  Matabeleland North                4212
155446  Matabeleland South                4213
155447            Midlands                4214

[4214 rows x 2 columns]


In [185]:
# Reorder the columns so that 'Dimension Value ID' is first
dimension_value_df = dimension_value_df[['Dimension Value ID', 'Dimension Value']]

# Display the updated DataFrame
dimension_value_df

Unnamed: 0,Dimension Value ID,Dimension Value
1,1,12-23 months
2,2,24-35 months
3,3,36-47 months
4,4,48-59 months
5,5,6-11 months
...,...,...
155443,4210,Mashonaland West
155444,4211,Masvingo
155445,4212,Matabeleland North
155446,4213,Matabeleland South


In [186]:
# Export dimension_value_df as CSV file.
dimension_value_df.to_csv("data/dimension_value.csv", index=False)

In [118]:
# Create a new DataFrame with unique values from the 'Income Group' column
income_group_df = cleaned_df[['Income Group']].drop_duplicates()

# Display the new DataFrame
print(income_group_df)

             Income Group
1              Low income
1005  Upper middle income
2600  Lower middle income
4525          High income


In [188]:
# Create numpy array for each of the income groups
income_ids = np.arange(1, len(income_group_df) + 1)
print(income_ids)

[1 2 3 4]


In [189]:
# Use a list comprehension to add "inc" to each income_id.
inc_ids = ["inc" + str(inc_id) for inc_id in income_ids]

print(inc_ids)

['inc1', 'inc2', 'inc3', 'inc4']


In [190]:
# Add the inc_ids list as a new column
income_group_df['Income ID'] = inc_ids

print(income_group_df)

             Income Group Income ID
1              Low income      inc1
1005  Upper middle income      inc2
2600  Lower middle income      inc3
4525          High income      inc4


In [191]:
# Reorder the columns so that 'Income ID' is first
income_group_df = income_group_df[['Income ID', 'Income Group']]

# Display the updated DataFrame
income_group_df

Unnamed: 0,Income ID,Income Group
1,inc1,Low income
1005,inc2,Upper middle income
2600,inc3,Lower middle income
4525,inc4,High income


In [206]:
# Export income_group_df as CSV file.
income_group_df.to_csv("data/income_group.csv", index=False)

In [200]:
# Create a new DataFrame with unique values from the 'Region' column
region_df = cleaned_df[['Region']].drop_duplicates()

# Display the new DataFrame
print(region_df)

                     Region
1     Eastern Mediterranean
1005               European
1555                African
3164               Americas
4525        Western Pacific
5308        South-East Asia


In [201]:
# Create numpy array for each of the regions
region_ids = np.arange(1, len(region_df) + 1)
print(region_ids)

[1 2 3 4 5 6]


In [202]:
# Use a list comprehension to add "reg" to each region_ids.
reg_ids = ["reg" + str(reg_id) for reg_id in region_ids]

print(reg_ids)

['reg1', 'reg2', 'reg3', 'reg4', 'reg5', 'reg6']


In [203]:
# Add the reg_ids list as a new column
region_df['Region ID'] = reg_ids

print(region_df)

                     Region Region ID
1     Eastern Mediterranean      reg1
1005               European      reg2
1555                African      reg3
3164               Americas      reg4
4525        Western Pacific      reg5
5308        South-East Asia      reg6


In [204]:
# Reorder the columns so that 'Region ID' is first
region_df = region_df[['Region ID', 'Region']]

# Display the updated DataFrame
region_df

Unnamed: 0,Region ID,Region
1,reg1,Eastern Mediterranean
1005,reg2,European
1555,reg3,African
3164,reg4,Americas
4525,reg5,Western Pacific
5308,reg6,South-East Asia


In [205]:
# Export region_df as CSV file.
region_df.to_csv("data/region.csv", index=False)