David Culhane
<br>
DSC540-T301
<br>
Milestone 1: Cleaning/Formatting Flat File Source
<br>
<br>
**Global Inflation Data Cleaning and Formating**
<br>
<br>
The data being cleaned and formatted here comes from the World Bank at https://datacatalog.worldbank.org/search/dataset/0065307/Inflation-in-Emerging-and-Developing-Economies. 
<br>
The CSV found at the link has eighteen sheets covering metrics for inflation in terms of the consumer price index in monthly, quarterly, and annually. There are 5 groups of sheets basased on those time increments for the headline consumer price index (HCPI), food price index (FCPI), energy price index (ECPI), official core consumer price index (CCPI), and producer price index (PPI). The sub-indices can help break down the overall index by component. Two sheets cover GDP deflator index and inflation rates (DEF), and another aggregate spreadsheet (AGGREGATE) looks at various statistics throughout all of the data for various subgroups.
<br>
<br>
The Consumer Price Index itself is a measure of the cost of a "basket of goods". Therefore, each value in the spreadsheet is the same basket of goods and what it would cost in the local currency. The values in the sheet themselves are not inflation rates themselves. Listing the inflation rates themselvs would require finding the percent changes from one value to the next.
<br>
<br>
Though each spreadsheet covers inflation oriented data across the world, the number of countries in each spreadsheet varies from sheet to sheet. This likely has to do with how often an individual country reports data to the World Bank and which statistics they choose to report.
<br>
<br>
The data in the spreadsheets cover a time period from 1970 through 2024. The world changed a lot during this time period. At the start, the Cold War was still ongoing and ended during this period. Yugoslavia and the Soviet Union broke up, Germany went from divided to unified, and other lands that had been colonies of European countries gained independence throughout the world. For these reasons, a number of countries may have data that starts later than others.
<br>
<br>
***Loading the Data***
<br>
<br>
The data is stored in the same folder as this ipynb file as Inflation-data.xlsx. The final product from this data cleaning is a dataset that will cleanly merge with a table scraped from online as well as the results for textual analysis from the final dataset. The online table has annual World Bank and UN data, so this data preparation stage will use the quarterly headline consumer price index data.

In [2]:
# Loading the Quarterly HCPI Sheets
import pandas as pd
hcpiq = pd.read_excel('Inflation-data.xlsx', sheet_name='hcpi_q') # Headline Consumer Price Index - Quarterly
hcpiq

Unnamed: 0,Country Code,IMF Country Code,Country,Indicator Type,Series Name,19701,19702,19703,19704,19711,...,Data source,Note,Unnamed: 223,20231.1,20232.1,20233.1,20234.1,Unnamed: 228,Unnamed: 229,Unnamed: 230
0,ABW,314.0,Aruba,index,Headline Consumer Price Index,,,,,,...,IFS,,,,,,,,ABW,
1,AFG,512.0,Afghanistan,index,Headline Consumer Price Index,,,,,,...,IFS,,,,,,,,AFG,
2,AGO,614.0,Angola,index,Headline Consumer Price Index,,,,,,...,IFS,,,11.621373,10.821176,13.566033,18.268832,,AGO,13.569354
3,AIA,264.0,Anguilla,index,Headline Consumer Price Index,,,,,,...,IFS,,,8.151830,5.389115,1.198717,,,AIA,4.913221
4,ALB,914.0,Albania,index,Headline Consumer Price Index,,,,,,...,IFS,,,6.517382,4.613066,,,,ALB,5.565224
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199,ZAF,199.0,South Africa,index,Headline Consumer Price Index,1.747645,1.793636,1.824296,1.839626,1.854957,...,OECDstat,,,7.263119,6.451678,5.020374,5.633413,,ZAF,6.092146
200,ZMB,754.0,Zambia,index,Headline Consumer Price Index,,,,,,...,IFS,,,,,,,,ZMB,
201,ZWE,698.0,Zimbabwe,index,Headline Consumer Price Index,,,,,,...,IFS,,,93.683514,117.120450,,,,ZWE,105.401982
202,,,,,,,,,,,...,,,,,,,,,,


***Transformation 1: Column Removal***
<br>
Each of the different indices have the same columns regarding country type. Each spreadsheet is structured to have some columns for country and report information, followed by the reported data with each time period being a single column. The annual spreadsheets just have each year be a column and the quarterly data column headers are of the format YYYYQ. For each sheet, we only need the columns that contain the country name and the data for each time period. The other columns can be dropped.

In [4]:
# Dropping from the quarterly dataframes
hcpiq = hcpiq.drop(columns=['Country Code', 'IMF Country Code', 'Indicator Type', 'Series Name', 'Data source', 'Note', '20231.1', '20232.1',
                           '20233.1', '20234.1', 'Unnamed: 223', 'Unnamed: 228', 'Unnamed: 229', 'Unnamed: 230'])
hcpiq.head()

Unnamed: 0,Country,19701,19702,19703,19704,19711,19712,19713,19714,19721,...,20213,20214,20221,20222,20223,20224,20231,20232,20233,20234
0,Aruba,,,,,,,,,,...,,,,,,,,,,
1,Afghanistan,,,,,,,,,,...,,,,,,,,,,
2,Angola,,,,,,,,,,...,600.280514,638.913942,676.609997,701.518441,718.886138,736.271167,755.24137,777.430983,816.410472,870.779307
3,Anguilla,,,,,,,,,,...,108.569873,108.599569,109.530055,111.678092,117.261006,119.765399,118.458759,117.696553,118.666634,
4,Albania,,,,,,,,,,...,122.98862,124.221785,128.667652,131.710696,132.643743,133.981248,137.053414,137.786597,,


***Transformation 2: Fixing the Column Headers***
<br>
The formatting for the column headers is slightly problematic for the quarterly inflation data. The dated column headers follow the pattern of YYYYQ. So the third quarter of 1986 would have a header of 19863. We as people know what this stands for but a computer would potentially read the string as 19,863 if it were converted to a string. If we convert these to a format like YYYY/MM, these data for each country could be treated as a time series.
<br>
The pattern we will want is YYYY/MM with each month being 03, 06, 09, and 12. So a double loop can be run to generate each year and quarter combination in order. A list with the column headers can be created for the quarterly dataframes. In that list, country will need to be added as the first header.

In [6]:
# Making the column headers
headers = ['country']
qmonths = [3, 6, 9, 12]
for i in range (1970, 2024):
    for j in qmonths:
        string = str(i) + '/' + str(j)
        headers.append(string)

hcpiq.columns = headers
hcpiq.head()

Unnamed: 0,country,1970/3,1970/6,1970/9,1970/12,1971/3,1971/6,1971/9,1971/12,1972/3,...,2021/9,2021/12,2022/3,2022/6,2022/9,2022/12,2023/3,2023/6,2023/9,2023/12
0,Aruba,,,,,,,,,,...,,,,,,,,,,
1,Afghanistan,,,,,,,,,,...,,,,,,,,,,
2,Angola,,,,,,,,,,...,600.280514,638.913942,676.609997,701.518441,718.886138,736.271167,755.24137,777.430983,816.410472,870.779307
3,Anguilla,,,,,,,,,,...,108.569873,108.599569,109.530055,111.678092,117.261006,119.765399,118.458759,117.696553,118.666634,
4,Albania,,,,,,,,,,...,122.98862,124.221785,128.667652,131.710696,132.643743,133.981248,137.053414,137.786597,,


***Transformation 3: Row Drops***
<br>
<br>
At the bottom of each spreadsheet in the first column, one that was dropped in an earlier step, was a note left for information purposes. Since this note was below all of the data, it was incorporated in the initial upload and created two more rows without data in them. They need to be removed.

In [8]:
hcpiq = hcpiq.drop(axis=0, index=[len(hcpiq)-2, len(hcpiq)-1])
hcpiq

Unnamed: 0,country,1970/3,1970/6,1970/9,1970/12,1971/3,1971/6,1971/9,1971/12,1972/3,...,2021/9,2021/12,2022/3,2022/6,2022/9,2022/12,2023/3,2023/6,2023/9,2023/12
0,Aruba,,,,,,,,,,...,,,,,,,,,,
1,Afghanistan,,,,,,,,,,...,,,,,,,,,,
2,Angola,,,,,,,,,,...,600.280514,638.913942,676.609997,701.518441,718.886138,736.271167,755.241370,777.430983,816.410472,870.779307
3,Anguilla,,,,,,,,,,...,108.569873,108.599569,109.530055,111.678092,117.261006,119.765399,118.458759,117.696553,118.666634,
4,Albania,,,,,,,,,,...,122.988620,124.221785,128.667652,131.710696,132.643743,133.981248,137.053414,137.786597,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
197,Kosovo,,,,,,,,,,...,122.664273,125.689202,129.805846,135.995353,138.995595,140.908187,142.585545,141.715646,143.519994,144.864690
198,"Yemen, Rep.",,,,,,,,,,...,,,,,,,,,,
199,South Africa,1.747645,1.793636,1.824296,1.839626,1.854957,1.885617,1.916277,1.977598,1.977598,...,132.913200,134.308500,136.334000,139.529600,143.445500,144.615700,146.236100,148.531600,150.647000,152.762500
200,Zambia,,,,,,,,,,...,306.364687,308.950004,323.367999,331.983668,336.605808,339.157230,,,,


***Transformation 4: Adding Regions***
<br>
<br>
The world is a large and dynamic place. Being able to eventually subset the data by region would be handy. Unfortunately, the data in hcpiq does not have regions listed. At the same time, adding regions by hand for all 202 countries listed in hcpiq would be a massive chore. So I went and found a dataset that has has countries and regions to associate them with.\
<br>
<br>
The dataset was found at: https://www.kaggle.com/datasets/andradaolteanu/country-mapping-iso-continent-region/data
<br>
<br>
This dataset comes with region, sub-region, and intermediate region labels for each country. We only want one region for each country listed, so the sub-region will be used. We can then check which countries in both sets and move forward using those countries.

In [10]:
# Loading the data containing the regions
continents2 = pd.read_csv('continents2.csv')
continents2.head()

Unnamed: 0,name,alpha-2,alpha-3,country-code,iso_3166-2,region,sub-region,intermediate-region,region-code,sub-region-code,intermediate-region-code
0,Afghanistan,AF,AFG,4,ISO 3166-2:AF,Asia,Southern Asia,,142.0,34.0,
1,Åland Islands,AX,ALA,248,ISO 3166-2:AX,Europe,Northern Europe,,150.0,154.0,
2,Albania,AL,ALB,8,ISO 3166-2:AL,Europe,Southern Europe,,150.0,39.0,
3,Algeria,DZ,DZA,12,ISO 3166-2:DZ,Africa,Northern Africa,,2.0,15.0,
4,American Samoa,AS,ASM,16,ISO 3166-2:AS,Oceania,Polynesia,,9.0,61.0,


In [11]:
# Making a smaller version of continents2, filtered to contain only countries present in hcpiq
regiondf = continents2[continents2['name'].isin(hcpiq['country'])]
regiondf = regiondf.drop(columns=['alpha-2', 'alpha-3', 'country-code', 'iso_3166-2', 'region', 'intermediate-region',
              'region-code', 'sub-region-code', 'intermediate-region-code'])
regiondf.columns = ['country', 'region']
regiondf.head()

Unnamed: 0,country,region
0,Afghanistan,Southern Asia
2,Albania,Southern Europe
3,Algeria,Northern Africa
6,Angola,Sub-Saharan Africa
7,Anguilla,Latin America and the Caribbean


Now, we just need to merge the dataframes to make it tidier as a single object.

In [13]:
hcpiqr = regiondf.merge(hcpiq, on='country', how='inner')
hcpiqr

Unnamed: 0,country,region,1970/3,1970/6,1970/9,1970/12,1971/3,1971/6,1971/9,1971/12,...,2021/9,2021/12,2022/3,2022/6,2022/9,2022/12,2023/3,2023/6,2023/9,2023/12
0,Afghanistan,Southern Asia,,,,,,,,,...,,,,,,,,,,
1,Albania,Southern Europe,,,,,,,,,...,122.988620,124.221785,128.667652,131.710696,132.643743,133.981248,137.053414,137.786597,,
2,Algeria,Northern Africa,,,,,,,,,...,167.797108,172.267489,175.005505,181.161761,183.378600,186.933862,192.089359,198.774132,200.861289,202.478651
3,Angola,Sub-Saharan Africa,,,,,,,,,...,600.280514,638.913942,676.609997,701.518441,718.886138,736.271167,755.241370,777.430983,816.410472,870.779307
4,Anguilla,Latin America and the Caribbean,,,,,,,,,...,108.569873,108.599569,109.530055,111.678092,117.261006,119.765399,118.458759,117.696553,118.666634,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
164,Uruguay,Latin America and the Caribbean,0.000136,0.000141,0.000145,0.000152,0.000162,0.000169,0.00018,0.000202,...,242.106705,246.398904,254.239959,259.919500,265.544232,267.593708,273.669277,277.838248,276.846351,280.410492
165,Vanuatu,Melanesia,,,,,,,,,...,126.170806,127.275629,129.264310,130.590098,137.366347,141.564675,144.216251,149.445747,,
166,Vietnam,South-eastern Asia,,,,,,,,,...,172.889624,172.234562,174.100240,176.661726,178.634683,179.826715,181.375684,180.921669,183.803421,186.192866
167,Zambia,Sub-Saharan Africa,,,,,,,,,...,306.364687,308.950004,323.367999,331.983668,336.605808,339.157230,,,,


***Transformation 5: Removal of Largely Blank Data***
<br>
<br>
While there are currently 169 countries represented in the data, a number of them are missing data for one reason or another. Some countries just didn't exist while others may have not reported/published data for one reason or another. Dropping countries that have NaN values would leave the dataset with only 56 countries, a drastic cut in the data. It would also get rid of countries that resulted from the aforementioned breakups as well as those who gained independence/bega reporting data after 1970 as well as mandating that data MUST hae been reported each quarter from 1970-2023. It would simply be an unreasonable thing to do to the data.
<br>
<br>
What could be more reasonable is slightly culling the data based on how many NaN values a country has. The Soviet Union and Yugoslavia broke up in 1991 and from 1990-1992Q1, respectively. 1992Q2-2023Q4 lasts 127 quarters. Adding two to that count for the country and region makes 129 fields with non-NaN data as a semi-arbitrary theshold. We can perform a final culling on this basis - if a row has less than 129 values, it is removed from the dataframe.

In [15]:
# Counting whether a row has at least 129 non-NaN fields

# Initializing the Boolean list for indexing hcpiqr
atleast129 = []
i = 0
while i < len(hcpiqr):
    count = hcpiqr.iloc[i].count() >= 129
    atleast129.append(count)
    i += 1

hcpiqr_129 = hcpiqr[atleast129]
hcpiqr_129 = hcpiqr_129.apply(lambda x: round(x, 2))  # Rounding the data to two decimal places for legibility's sake

***Implications from Wangling the Data***
<br>
<br>
The changes made to the data in this wrangling included dropping rows that had actually zero inflation data, dropping rows of countries that did not report enough data, modification of the column names to more closely align with datetime objects should they be needed for further analysis, and an addition of a region column to make future analysis by subetting easier.
<br>
<br>
To my knowledge, there were no legal or regulatory guidelines regarding/restrcting analysis of data on inflation around the world. Specifically, the data is listed as public with a creative commons license on the site listed above. Th data was sourced by economists who submitted their report to the World Bank
<br>
<br>
One issue that could arise with the data is its interpretability. The current form of the data has each value listed as the cost of a basket of goods in the local currency. If the data is not interpreted correctly, it could lead to disinformation. Creating a new dataframe tht contains the percent change from one cell to the next would likely clear that up. It was not perfrmed here since this wrangling was not looking to change any of the data themselves.

In [17]:
# Writing the hcpiqr_129 dataframe to an excel file.
hcpiqr_129.to_csv('hcpiqr_129.csv', index=False)