## Tax Revenue

In [1]:
import pandas as pd
import numpy as np

This data is for the tax revenue for all the countries in OECD. The CSV files has been downloaded using the `driver.ipynb` notebook, which downloads the data and places it in the folder. In this notebook, we open the file, clean the data and output a new CSV to the `data/temp` folder.

In [2]:
tax_revenue = pd.read_csv('../data/uncleaned/tax_revenue.csv')
print(tax_revenue.shape)
tax_revenue.head()

(3549, 8)


Unnamed: 0,LOCATION,INDICATOR,SUBJECT,MEASURE,FREQUENCY,TIME,Value,Flag Codes
0,AUS,TAXREV,TOT,PC_GDP,A,1965,20.609,
1,AUS,TAXREV,TOT,PC_GDP,A,1966,19.829,
2,AUS,TAXREV,TOT,PC_GDP,A,1967,20.409,
3,AUS,TAXREV,TOT,PC_GDP,A,1968,20.34,
4,AUS,TAXREV,TOT,PC_GDP,A,1969,20.681,


In [3]:
tax_revenue.FREQUENCY.unique()

array(['A'], dtype=object)

In [4]:
tax_revenue.SUBJECT.unique()

array(['TOT'], dtype=object)

The column `Flag Codes` is not needed for our purpose, and since this data is all annual, we can also drop the column `FREQUENCY`. Additionally, since we know this data is for tax revenue, we can drop the column `INDICATOR` too. The column `SUBJECT` also has only 1 value, so we can drop that column as well.

In [5]:
tax_revenue = tax_revenue.drop(['Flag Codes','FREQUENCY','INDICATOR','SUBJECT'], axis=1)

In [6]:
tax_revenue.MEASURE.unique()

array(['PC_GDP', 'MLN_USD'], dtype=object)

Here, we are only concerned with the measure `MLN_USD`, so we keep only those rows.

In [7]:
tax_revenue = tax_revenue[(tax_revenue['MEASURE']=='MLN_USD')]

We reset the index of this dataframe, since we have deleted a lot of rows.

In [8]:
tax_revenue.reset_index(drop=True, inplace=True)

In [9]:
tax_revenue.head()

Unnamed: 0,LOCATION,MEASURE,TIME,Value
0,AUS,MLN_USD,1965,5.608
1,AUS,MLN_USD,1966,5.996
2,AUS,MLN_USD,1967,6.631
3,AUS,MLN_USD,1968,7.405
4,AUS,MLN_USD,1969,8.488


Checking the total number of unique elements per column:

In [10]:
tax_revenue.nunique()

LOCATION      39
MEASURE        1
TIME          57
Value       1684
dtype: int64

Now, we can check the datatypes of our columns.

In [11]:
tax_revenue.dtypes

LOCATION     object
MEASURE      object
TIME          int64
Value       float64
dtype: object

Since `LOCATION`, `MEASURE` and `TIME` are not in the datatype that we want, we change them to the new datatypes as shown below:

In [12]:
tax_revenue['LOCATION']= tax_revenue['LOCATION'].astype('string')
tax_revenue['MEASURE']= tax_revenue['MEASURE'].astype('string')
tax_revenue['TIME']= tax_revenue['TIME'].astype('string')

When we recheck the datatypes, we can see that they have been changed.

In [13]:
tax_revenue.dtypes

LOCATION     string
MEASURE      string
TIME         string
Value       float64
dtype: object

To be thorough, we check if any values in our dataframe are null, in case we need to handle them.

In [14]:
tax_revenue.isna().sum()

LOCATION    0
MEASURE     0
TIME        0
Value       0
dtype: int64

No column in the dataframe is missing any values, so we don't need to handle anything here.

Finally, our table is shaped as below:

In [15]:
tax_revenue.shape

(1697, 4)

We now need to check if we are missing any data. Fot all unique values of `LOCATION` in our dataframe, we perform an analysis to find the first year that their data is available from, the last year that the data is available for, and any missing years between this range.

In [16]:
highestmin = 1
highestmincountry = ''
lowestmax = 9999
lowestmaxcountry = ''

for country in tax_revenue.LOCATION.unique():
    tempdf = tax_revenue[tax_revenue['LOCATION']==country]
    minyear = int(tempdf.TIME.min())
    if(highestmin<minyear):
        highestmin=minyear
        highestmincountry = country
        
    maxyear = int(tempdf.TIME.max())
    if(lowestmax>maxyear):
        lowestmax=maxyear
        lowestmaxcountry=country
   
    missingyear = pd.Series([], dtype='string')
    for i in range(minyear,maxyear):
        if((tempdf['TIME']==str(i)).any() == False):
            tempseries = pd.Series([str(i)])
            missingyear = pd.concat([missingyear,tempseries])
            missingyear.reset_index(inplace=True, drop = True)
    
    print('Country = ', country, 'Minimum Year = ',minyear, 'Maximum Year = ', maxyear, 'Missing Years = ', missingyear)
    
print('Highest Min = ', highestmin, ' Country = ', highestmincountry)
print('Lowest Max = ', lowestmax, ' Country = ', lowestmaxcountry)

Country =  AUS Minimum Year =  1965 Maximum Year =  2020 Missing Years =  Series([], dtype: string)
Country =  AUT Minimum Year =  1979 Maximum Year =  2021 Missing Years =  Series([], dtype: string)
Country =  BEL Minimum Year =  1979 Maximum Year =  2021 Missing Years =  Series([], dtype: string)
Country =  CAN Minimum Year =  1965 Maximum Year =  2021 Missing Years =  Series([], dtype: string)
Country =  CZE Minimum Year =  1993 Maximum Year =  2021 Missing Years =  Series([], dtype: string)
Country =  DNK Minimum Year =  1965 Maximum Year =  2021 Missing Years =  Series([], dtype: string)
Country =  FIN Minimum Year =  1979 Maximum Year =  2021 Missing Years =  Series([], dtype: string)
Country =  FRA Minimum Year =  1979 Maximum Year =  2021 Missing Years =  Series([], dtype: string)
Country =  DEU Minimum Year =  1979 Maximum Year =  2021 Missing Years =  Series([], dtype: string)
Country =  GRC Minimum Year =  1979 Maximum Year =  2021 Missing Years =  Series([], dtype: string)


Now that we have cleaned our data, we can output it to our final CSV file.

In [None]:
from pathlib import Path 
filepath = Path('../data/temp/tax_revenue_cleaned.csv')
final_df.to_csv(filepath, mode='wb')  
