# Inflation - Consumer Price Index - Cleaning

Summary:
1. We read the dataset from inflation_cpi.csv
2. We then check for null values for all columns, which are not existent in the dataset for the values of share prices.
3. Since the column flag codes provide no relevant information and contain only 72 non-null objects, we drop that column.
4. Then we perform datatype reset of ['LOCATION','SUBJECT','MEASURE','TIME'] from object to string and int.
5. We perform a unique value analysis for all the columns. The columns ['INDICATOR'] has only a single value in it, so we can drop this column as well.
6. Next we analyse the columns ['FREQUENCY'], here we see that there are three unique values ['A','Q','M'] but since we intend to only perform the dataset joins on ['A'] (Annual values), we use only the annual values.
7. Once we have selected the data with ['FREQUENCY'] == Annual, we now have the data reported in two different 'MEASURES' == ['IDX2015','AGRWTH'] and three different 'SUBJECTS' == ['TOT','FOOD','ENRG','TOT_FOODENRG']. Now we need only the data which is SUBJECT == 'TOTAL' and not divided into the subcategory of 'FOOD' or 'ENERGY', so we analyse the data loss for both the measures and finally choose values where 'MEASURE' == 'IDX2015' and 'SUBJECTS' == 'TOT'.
8. We check for the final values and null values to create the final dataset and write it to the data/temp folder

In [1]:
import pandas as pd

In [2]:
df_cpi = pd.read_csv('../data/uncleaned/inflation_cpi.csv')
df_cpi.head()

Unnamed: 0,LOCATION,INDICATOR,SUBJECT,MEASURE,FREQUENCY,TIME,Value,Flag Codes
0,AUS,CPI,ENRG,AGRWTH,A,1972,4.91007,
1,AUS,CPI,ENRG,AGRWTH,A,1973,3.762801,
2,AUS,CPI,ENRG,AGRWTH,A,1974,13.17354,
3,AUS,CPI,ENRG,AGRWTH,A,1975,19.42247,
4,AUS,CPI,ENRG,AGRWTH,A,1976,8.833195,


In [3]:
print(df_cpi.shape)

(294281, 8)


## Null Value Analysis

In [4]:
print("Analysis of Null values:")
df_cpi.info()

Analysis of Null values:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294281 entries, 0 to 294280
Data columns (total 8 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   LOCATION    294281 non-null  object 
 1   INDICATOR   294281 non-null  object 
 2   SUBJECT     294281 non-null  object 
 3   MEASURE     294281 non-null  object 
 4   FREQUENCY   294281 non-null  object 
 5   TIME        294281 non-null  object 
 6   Value       294281 non-null  float64
 7   Flag Codes  72 non-null      object 
dtypes: float64(1), object(7)
memory usage: 18.0+ MB


### Hence we see that there are only 72 non-null value in Flag Codes, hence we dropp it.

In [5]:
df_cpi = df_cpi.drop(columns=['Flag Codes'])
df_cpi.head()

Unnamed: 0,LOCATION,INDICATOR,SUBJECT,MEASURE,FREQUENCY,TIME,Value
0,AUS,CPI,ENRG,AGRWTH,A,1972,4.91007
1,AUS,CPI,ENRG,AGRWTH,A,1973,3.762801
2,AUS,CPI,ENRG,AGRWTH,A,1974,13.17354
3,AUS,CPI,ENRG,AGRWTH,A,1975,19.42247
4,AUS,CPI,ENRG,AGRWTH,A,1976,8.833195


## Datatype Reset

In [6]:
df_cpi['LOCATION']=df_cpi['LOCATION'].astype('string')
df_cpi['TIME']=df_cpi['TIME'].astype('string')
df_cpi['MEASURE']=df_cpi['MEASURE'].astype('string')
df_cpi['SUBJECT']=df_cpi['SUBJECT'].astype('string')

In [7]:
df_cpi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294281 entries, 0 to 294280
Data columns (total 7 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   LOCATION   294281 non-null  string 
 1   INDICATOR  294281 non-null  object 
 2   SUBJECT    294281 non-null  string 
 3   MEASURE    294281 non-null  string 
 4   FREQUENCY  294281 non-null  object 
 5   TIME       294281 non-null  string 
 6   Value      294281 non-null  float64
dtypes: float64(1), object(2), string(4)
memory usage: 15.7+ MB


## Unique Value Analysis

In [8]:
df_cpi.nunique()

LOCATION         52
INDICATOR         1
SUBJECT           4
MEASURE           2
FREQUENCY         3
TIME           1855
Value        247327
dtype: int64

#### Hence we can drop the INDICATOR column

In [9]:
df_cpi = df_cpi.drop(columns="INDICATOR")

### FREQUENCY unique Analysis

In [10]:
df_cpi['FREQUENCY'].unique()

array(['A', 'Q', 'M'], dtype=object)

We need to analyse the FREQUENCY column to see if all countries have data consistent in the annual "A" format. We first need to set a range of which we will check the dates for

In [11]:
start_year = int(min(df_cpi["TIME"]))
end_year = int(max(df_cpi.loc[df_cpi["FREQUENCY"] == "A"]["TIME"]))
print("Start Year",start_year)
print("End Year",end_year)

Start Year 1914
End Year 2022


In [12]:
# start_year = 2001
countries = []
# print("Missing Values:\n")
for c in df_cpi["LOCATION"].unique():
    temp = df_cpi.loc[df_cpi["LOCATION"] == c]
    temp = temp[temp["FREQUENCY"] == "A"]["TIME"]
    for i in range(start_year,end_year+1):
        if(str(i) not in temp.values):
            # print("Location:",c,"\tYear:",i)
            if c not in countries:
                countries.append(c)
print("Countries with missing values between",start_year,"and",end_year,":",len(countries))

Countries with missing values between 1914 and 2022 : 51


In [13]:
df_cpi = df_cpi.loc[df_cpi["FREQUENCY"]=='A']

In [14]:
df_cpi["FREQUENCY"].unique()

array(['A'], dtype=object)

### MEASURE Analysis

In [15]:
df_cpi['MEASURE'].value_counts()

IDX2015    9068
AGRWTH     8942
Name: MEASURE, dtype: Int64

#### Let us now split this into two separate tables to perform our further analysis

In [16]:
df_cpi_idx = df_cpi.loc[df_cpi["MEASURE"] == "IDX2015"]
df_cpi_idx.head()

Unnamed: 0,LOCATION,SUBJECT,MEASURE,FREQUENCY,TIME,Value
144706,AUS,ENRG,IDX2015,A,1972,6.649913
144707,AUS,ENRG,IDX2015,A,1973,6.900136
144708,AUS,ENRG,IDX2015,A,1974,7.809128
144709,AUS,ENRG,IDX2015,A,1975,9.325853
144710,AUS,ENRG,IDX2015,A,1976,10.14962


In [17]:
df_cpi_agrowth = df_cpi.loc[df_cpi["MEASURE"] == "AGRWTH"]
df_cpi_agrowth.head()

Unnamed: 0,LOCATION,SUBJECT,MEASURE,FREQUENCY,TIME,Value
0,AUS,ENRG,AGRWTH,A,1972,4.91007
1,AUS,ENRG,AGRWTH,A,1973,3.762801
2,AUS,ENRG,AGRWTH,A,1974,13.17354
3,AUS,ENRG,AGRWTH,A,1975,19.42247
4,AUS,ENRG,AGRWTH,A,1976,8.833195


Once we have the data split on the basis of the measure, we create a further split on the basis of subject

### SUBJECT Analysis

In [18]:
x = df_cpi_idx["SUBJECT"].value_counts()
x

TOT             2840
FOOD            2285
ENRG            1988
TOT_FOODENRG    1955
Name: SUBJECT, dtype: Int64

In [19]:
y = df_cpi_agrowth["SUBJECT"].value_counts()
y

TOT             2792
FOOD            2249
ENRG            1967
TOT_FOODENRG    1934
Name: SUBJECT, dtype: Int64

In [20]:
print("Difference in data (IDX-AGROWTH):")
x-y

Difference in data (IDX-AGROWTH):


TOT             48
FOOD            36
ENRG            21
TOT_FOODENRG    21
Name: SUBJECT, dtype: Int64

#### Hence we choose the IDX Measure as the value to prevent data loss. Now we need to choose the subject we will take

In [21]:
df_cpi_idx['SUBJECT'].value_counts()

TOT             2840
FOOD            2285
ENRG            1988
TOT_FOODENRG    1955
Name: SUBJECT, dtype: Int64

#### Now we will only select the total values as we need only the total

In [22]:
df_cpi_idx = df_cpi_idx.loc[df_cpi_idx["SUBJECT"]=="TOT"]

In [23]:
df_cpi_idx.nunique()

LOCATION       51
SUBJECT         1
MEASURE         1
FREQUENCY       1
TIME          109
Value        2787
dtype: int64

In [24]:
df_cpi_idx = df_cpi_idx.drop(columns=["SUBJECT","MEASURE","FREQUENCY"])
df_cpi_idx = df_cpi_idx.reset_index(drop=True)
df_cpi_idx.shape

(2840, 3)

In [25]:
df_cpi_idx.head()

Unnamed: 0,LOCATION,TIME,Value
0,AUS,1949,3.738101
1,AUS,1950,4.063153
2,AUS,1951,4.852566
3,AUS,1952,5.688414
4,AUS,1953,5.943812


In [26]:
df_cpi_idx.to_csv('../data/temp/inflation_cpi_cleaned.csv',mode='wb',index=False)
df_cpi_idx.to_csv('../data/cleaned/inflation_cpi_cleaned.csv',mode='wb',index=False)