In [1]:
import numpy as np
import pandas as pd

import constants as consts

In [2]:
# Allows for .py files to automatically reload.
%reload_ext autoreload
%autoreload 2

## Inflation Cleaning

In this notebook, we will create a dataframe that will be used to show the price of the various types of gasoline after they are adjusted for inflation. Inflation data was gathered [here](https://www.in2013dollars.com/brazil/inflation/2004?endYear=2019&amount=1), and reflects inflation relative to 2004.

In [3]:
df = pd.read_csv('../../data/gas_prices_brazil/brazil_gas_cleaned.csv')

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,First Day of Week,Last Day of Week,Macro Region,State,Type of Product,Number of Stations,Unit of Measurement,Mean Market Value,Std Dev,...,Mean Distribution Price,Distribution Standard Deviation,Distribution Min Price,Distribution Max Price,Distribution Variation Coefficient,Month,Year,Weeks Since First Day,Percent of Total Population in 2020,Population
0,0,2004-05-09,2004-05-15,CENTRO OESTE,DISTRITO FEDERAL,ETANOL HIDRATADO,127,R$/l,1.288,0.016,...,0.825,0.11,0.4201,0.9666,0.133,5,2004,1,1.4%,2051146
1,1,2004-05-16,2004-05-22,CENTRO OESTE,DISTRITO FEDERAL,ETANOL HIDRATADO,144,R$/l,1.271,0.039,...,0.823,0.111,0.4094,1.1931,0.135,5,2004,2,1.4%,2051146
2,2,2004-05-23,2004-05-29,CENTRO OESTE,DISTRITO FEDERAL,ETANOL HIDRATADO,129,R$/l,1.282,0.024,...,0.818,0.137,0.3879,1.0336,0.167,5,2004,3,1.4%,2051146
3,3,2004-05-30,2004-06-05,CENTRO OESTE,DISTRITO FEDERAL,ETANOL HIDRATADO,144,R$/l,1.373,0.051,...,0.894,0.147,0.4094,1.4206,0.164,5,2004,4,1.4%,2051146
4,4,2004-06-06,2004-06-12,CENTRO OESTE,DISTRITO FEDERAL,ETANOL HIDRATADO,129,R$/l,1.373,0.059,...,0.951,0.125,0.5169,1.115,0.131,6,2004,5,1.4%,2051146


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 106823 entries, 0 to 106822
Data columns (total 24 columns):
 #   Column                               Non-Null Count   Dtype  
---  ------                               --------------   -----  
 0   Unnamed: 0                           106823 non-null  int64  
 1   First Day of Week                    106823 non-null  object 
 2   Last Day of Week                     106823 non-null  object 
 3   Macro Region                         106823 non-null  object 
 4   State                                106823 non-null  object 
 5   Type of Product                      106823 non-null  object 
 6   Number of Stations                   106823 non-null  int64  
 7   Unit of Measurement                  106823 non-null  object 
 8   Mean Market Value                    106823 non-null  float64
 9   Std Dev                              106823 non-null  float64
 10  Min Price Observed                   106823 non-null  float64
 11  Max Price Obs

In [6]:
df.columns

Index(['Unnamed: 0', 'First Day of Week', 'Last Day of Week', 'Macro Region',
       'State', 'Type of Product', 'Number of Stations', 'Unit of Measurement',
       'Mean Market Value', 'Std Dev', 'Min Price Observed',
       'Max Price Observed', 'Mean Price Margin', 'Variation Coefficient',
       'Mean Distribution Price', 'Distribution Standard Deviation',
       'Distribution Min Price', 'Distribution Max Price',
       'Distribution Variation Coefficient', 'Month', 'Year',
       'Weeks Since First Day', 'Percent of Total Population in 2020',
       'Population'],
      dtype='object')

In [7]:
# Keeps all of the columns in the list.

df.drop(df.columns.difference([
    'Last Day of Week', 'Macro Region', 'State', 'Type of Product', 'Mean Distribution Price', 'Month', 'Year', 'Weeks Since First Day', 'Percent of Total Population in 2020'
]), axis = 1, inplace = True)

In [8]:
df.head()

Unnamed: 0,Last Day of Week,Macro Region,State,Type of Product,Mean Distribution Price,Month,Year,Weeks Since First Day,Percent of Total Population in 2020
0,2004-05-15,CENTRO OESTE,DISTRITO FEDERAL,ETANOL HIDRATADO,0.825,5,2004,1,1.4%
1,2004-05-22,CENTRO OESTE,DISTRITO FEDERAL,ETANOL HIDRATADO,0.823,5,2004,2,1.4%
2,2004-05-29,CENTRO OESTE,DISTRITO FEDERAL,ETANOL HIDRATADO,0.818,5,2004,3,1.4%
3,2004-06-05,CENTRO OESTE,DISTRITO FEDERAL,ETANOL HIDRATADO,0.894,5,2004,4,1.4%
4,2004-06-12,CENTRO OESTE,DISTRITO FEDERAL,ETANOL HIDRATADO,0.951,6,2004,5,1.4%


In [9]:
consts.inf_rates[2005]

1.07

In [10]:
df['Year'].map(type).value_counts()

<class 'int'>    106823
Name: Year, dtype: int64

In [11]:
df['Mean Distribution Price'].map(type).value_counts()

<class 'str'>    106823
Name: Mean Distribution Price, dtype: int64

Because the Mean Distribution Price column is represented as a string, we will need to convert this column to a float in order to perform the necessary calculations.

In [12]:
df[df['Mean Distribution Price'] == '-']

Unnamed: 0,Last Day of Week,Macro Region,State,Type of Product,Mean Distribution Price,Month,Year,Weeks Since First Day,Percent of Total Population in 2020
291,2010-01-09,CENTRO OESTE,DISTRITO FEDERAL,ETANOL HIDRATADO,-,1,2010,296,1.4%
292,2010-01-16,CENTRO OESTE,DISTRITO FEDERAL,ETANOL HIDRATADO,-,1,2010,297,1.4%
294,2010-01-30,CENTRO OESTE,DISTRITO FEDERAL,ETANOL HIDRATADO,-,1,2010,299,1.4%
305,2010-04-17,CENTRO OESTE,DISTRITO FEDERAL,ETANOL HIDRATADO,-,4,2010,310,1.4%
306,2010-04-24,CENTRO OESTE,DISTRITO FEDERAL,ETANOL HIDRATADO,-,4,2010,311,1.4%
...,...,...,...,...,...,...,...,...,...
106612,2018-10-27,SUL,SANTA CATARINA,GNV,-,10,2018,755,3.4%
106636,2018-11-24,SUL,SANTA CATARINA,GNV,-,11,2018,759,3.4%
106642,2018-12-01,SUL,SANTA CATARINA,GNV,-,11,2018,760,3.4%
106660,2018-12-22,SUL,SANTA CATARINA,GNV,-,12,2018,763,3.4%


In [13]:
# Converting the values in the 'Mean Distribution Price' column to floats.

df['Mean Distribution Price'] = df['Mean Distribution Price'].str.replace('-', '0').fillna(0).astype(float)
df['Mean Distribution Price'] = pd.to_numeric(df['Mean Distribution Price'], downcast = 'float')
df['Mean Distribution Price'].map(type).value_counts()

<class 'float'>    106823
Name: Mean Distribution Price, dtype: int64

In [14]:
inf_adj_dist_price = []

for i in range (len(df)):
    inf_adj_dist_price.append(df.loc[i, 'Mean Distribution Price'] / consts.inf_rates[df.loc[i, 'Year']])

In [15]:
df['Adjusted Mean Distribution Price'] = inf_adj_dist_price

In [16]:
# Making sure that the replaced values match the dataframe prior to the type conversion.

df[df['Mean Distribution Price'] == 0]

Unnamed: 0,Last Day of Week,Macro Region,State,Type of Product,Mean Distribution Price,Month,Year,Weeks Since First Day,Percent of Total Population in 2020,Adjusted Mean Distribution Price
291,2010-01-09,CENTRO OESTE,DISTRITO FEDERAL,ETANOL HIDRATADO,0.0,1,2010,296,1.4%,0.0
292,2010-01-16,CENTRO OESTE,DISTRITO FEDERAL,ETANOL HIDRATADO,0.0,1,2010,297,1.4%,0.0
294,2010-01-30,CENTRO OESTE,DISTRITO FEDERAL,ETANOL HIDRATADO,0.0,1,2010,299,1.4%,0.0
305,2010-04-17,CENTRO OESTE,DISTRITO FEDERAL,ETANOL HIDRATADO,0.0,4,2010,310,1.4%,0.0
306,2010-04-24,CENTRO OESTE,DISTRITO FEDERAL,ETANOL HIDRATADO,0.0,4,2010,311,1.4%,0.0
...,...,...,...,...,...,...,...,...,...,...
106612,2018-10-27,SUL,SANTA CATARINA,GNV,0.0,10,2018,755,3.4%,0.0
106636,2018-11-24,SUL,SANTA CATARINA,GNV,0.0,11,2018,759,3.4%,0.0
106642,2018-12-01,SUL,SANTA CATARINA,GNV,0.0,11,2018,760,3.4%,0.0
106660,2018-12-22,SUL,SANTA CATARINA,GNV,0.0,12,2018,763,3.4%,0.0


In [17]:
# Checking which product types have the most missing values.

df[df['Mean Distribution Price'] == 0].groupby(['Month', 'Year', 'Type of Product']).count().reset_index()['Type of Product'].value_counts()

GNV                 173
ETANOL HIDRATADO     85
GLP                  83
ÓLEO DIESEL S10      40
ÓLEO DIESEL          14
GASOLINA COMUM       11
Name: Type of Product, dtype: int64

## Addressing the Issue of Replacing NA Values with Zero

Because we have replaced missing values with zero, the price of each product may be calculated to be lower than its true value when accounting for inflation. Given the value counts in the cell above, we can determine which product types will show the most inaccuracies between the calculated value and the true value.

In [18]:
zero_adj_df = df.groupby(['Last Day of Week', 'Year', 'Month', 'Macro Region', 'State', 'Type of Product', 'Percent of Total Population in 2020']).mean().reset_index()

In [19]:
# Randomly select ten rows to check that our calculation is working as intended.

zero_adj_df.sample(10)

Unnamed: 0,Last Day of Week,Year,Month,Macro Region,State,Type of Product,Percent of Total Population in 2020,Mean Distribution Price,Weeks Since First Day,Adjusted Mean Distribution Price
73848,2015-04-18,2015,4,NORTE,RORAIMA,ETANOL HIDRATADO,0.3%,2.577,571.0,1.385484
1217,2004-07-17,2004,7,SUL,PARANA,GLP,5.4%,25.823999,10.0,25.823999
29306,2008-12-06,2008,11,CENTRO OESTE,MATO GROSSO,GLP,1.7%,34.592999,239.0,28.354917
78164,2015-11-07,2015,11,SUL,SANTA CATARINA,GLP,3.4%,43.0,600.0,23.11828
14096,2006-07-29,2006,7,SUDESTE,RIO DE JANEIRO,GLP,8.2%,26.384001,116.0,23.76937
95089,2017-12-30,2017,12,SUL,SANTA CATARINA,ETANOL HIDRATADO,3.4%,3.142,712.0,1.503349
72735,2015-02-28,2015,2,NORDESTE,PERNAMBUCO,GNV,4.5%,1.422,564.0,0.764516
99897,2018-08-11,2018,8,SUDESTE,RIO DE JANEIRO,ÓLEO DIESEL,8.2%,3.03,744.0,1.402778
72066,2015-01-24,2015,1,SUL,SANTA CATARINA,GNV,3.4%,1.6,559.0,0.860215
76063,2015-08-01,2015,7,NORDESTE,ALAGOAS,GNV,1.6%,1.43,586.0,0.768817


In [20]:
# zero_adj_df.to_csv(path_or_buf = '../../data/gas_prices_brazil/brazil_gas_inflation.csv')

## Note About Tableau Visualization Using this Table

Because 0 values were creating troubles while showing the 'Adjusted Mean Distribution Price' over time, the following calculated field was created in Tableau to better visualize the data.

IF [Adjusted Mean Distribution Price] =! 0 THEN [Adjusted Mean Distribution Price]
ELSE NULL
END