In [4]:
import numpy as np
import pandas as pd
import os

In [5]:
pd.options.display.max_columns = None

# Tax Data

Here will download, convert, clean and organize **Tax Data** from the Pakistan Federal Board of Revenue (FBR). The text files that we manipulate in this notebook can be found in the "taxpayerData" folder. The original PDFs can be found here, https://www.fbr.gov.pk/Categ/income-tax-directory/742.

Pakistan first started releasing taxes in 2014 in the form of PDFs, with individual name, registration / CNIC / NTN number (depending on the year), and amount of taxes paid. The country has only publically released tax data from tax years 2013 - 2017. Note that the tax year in Pakistan is from July 1st to June 30th.

First, I will run a script to download the original PDFs from the website above so you have it locally. Secondly, I convert each PDF to text. Finally, we scrape and organize the data from the text file. 

The names of the relevant data is listed below. See the bottom of this section for code to download these tables and notes for others to build on this.

**Final Data**

**Final Graphs and Tables**

## PDF Data Download

The original tax PDF files are too large. Here, we download the files locally (if they have already been downloaded with the appropriate names, then this script will skip over those files).

WARNING: Each PDF is about 20,000 pages and takes approximately 10 minutes. By the end we will have downloaded more than 88,000 pages.

In [6]:
folder_name = "taxpayerData"
tax_links = {
    "2013_ParliamentarianTax": "http://download1.fbr.gov.pk/Docs/201469962916404PARLIMENTARIANSTAXDIRECTORY2013Dt.09.06.2014.pdf",
    "2014_ParliamentarianTax": "http://download1.fbr.gov.pk/Docs/201569963615881PARLIMENTARIANSTAXDIRECTORY2014Dt10042015.pdf",
    "2015_ParliamentarianTax": "http://download1.fbr.gov.pk/Docs/2016991991929997TaxDirectory-Parliamentarians2015.pdf",
    "2016_ParliamentarianTax": "http://download1.fbr.gov.pk/Docs/2017727974535733TaxDirectory-Parliamentarians2016.pdf",
    "2017_ParliamentarianTax": "http://download1.fbr.gov.pk/Docs/201922213121521755Parliamentarians.pdf",
    "2013_Tax": "http://download1.fbr.gov.pk/Docs/AllTaxpayersDirectory2013-2.pdf",
    "2014_Tax": "http://download1.fbr.gov.pk/Docs/20154101641715153TaxDirectoryofallTaxpayersforTaxYear2014.pdf",
    "2015_Tax": "http://download1.fbr.gov.pk/Docs/2016991591713355ALLTAXPAYERSTAXDIRECTORY2015.pdf",
    "2016_Tax": "http://download1.fbr.gov.pk/Docs/201781188488931ALL-Taxpayer-Directory2016.pdf",
    "2017_Tax": "http://download1.fbr.gov.pk/Docs/201922215123116259AllTaxpayerDirectoryJune2017.pdf"
}

if not os.path.exists("./" + folder_name + "/"):
    os.makedirs("./" + folder_name + "/")
    
for k in tax_links.keys():
    if os.path.exists("./" + folder_name + "/" + k + ".pdf"):
        print("file already exists:", k)
        continue
    link = tax_links[k]
    response = urllib.request.urlopen(str(link))
    with open("./" + folder_name + "/" + k + ".pdf", 'wb') as f:
        f.write(response.read())
    print("file uploaded:", k)

file already exists: 2017_ParliamentarianTax
file already exists: 2017_Tax
file already exists: 2013_ParliamentarianTax
file already exists: 2013_Tax
file already exists: 2016_Tax
file already exists: 2015_Tax
file already exists: 2016_ParliamentarianTax
file already exists: 2015_ParliamentarianTax
file already exists: 2014_ParliamentarianTax
file already exists: 2014_Tax


## PDF to Text Data Conversion

The text files should have been downloaded with this notebook, but if not you can run the scraper here. Most PDF to txt scrapers take far too long to process this much data (PDFMinerSix took about 5 hours for one PDF, Tabula never completed). I found an open source implementation and adjusted it to work in under 4 minutes per PDF.

Drag all taxpayer PDFs for which you do not have a .txt file into taxpayerData/txtCreateFolder.

Uncomment the script below and run it to create the .txt files. Once done, drag the .txt files back into the taxpayerData folder.

In [7]:
# !"./tools/convertmyfiles.sh"

## Parsing the Full Tax Data

Now we will parse through the .txt files, creating lists with taxpayers' income. Note that parlimentarians' income is included in the yearly tax files. 

I have also downloaded FBR Tax Yearbooks in taxpayerData/supplemental_tax_information. They hold important summary statistics for data that hasn't been publically released (like percent of direct taxes that are withholding taxes).

**Helper Functions**

In [9]:
def is_comma_separated_number(w):
    if "," in w:
        potential_nums = w.split(",")
        for num in potential_nums:
            if not num.isdigit():
                return False
        return True
    else:
        return False
    
def three_digit_or_less_number(w):
    if len(w) <= 3 and len(w) >= 1 and w.isdigit():
        return True
    
def calculate_taxable_income(tax_paid, tax_levels, top_rate):
    i = 0
    while tax_paid > 0 and i < len(tax_levels):
        tax_paid -= tax_levels[i][1]
        i += 1
    i -= 1
    if tax_paid < 0:
        to_subtract_from_top = int(abs(tax_paid) / tax_levels[i][2])
        return tax_levels[i][0] - to_subtract_from_top
    elif tax_paid == 0:
        return tax_levels[i][0]
    else:
        to_add_to_top = tax_paid / top_rate
        return tax_levels[len(tax_levels) - 1][0] + to_add_to_top

**Relevant Tables for Statistics**

In [10]:
labor_force_total = pd.read_excel("./miscData/labor_force_total.xls")
national_income = pd.read_excel("./miscData/national_income.xlsx")

### 2012 - 2013

First we collect the amount of taxes paid and NTN numbers of the taxpayers. I've built custom parsers for each year as the format changes.

In [11]:
f = open('./taxpayerData/2013_Tax.txt', 'r')

in_individuals_section = False

taxes_paid_2012_13 = []
registration_numbers_2012_13 = set()

line_number = 0
for line in f:
    if "INDIVIDUAL" in line and len(line) == 11:
        in_individuals_section = True
        continue
    if in_individuals_section:
        line = line.strip().split(" ")
        if line_number < 5:
            line_number += 1
            continue
        
#         print(line)
        for w in line:
            if w == 0:
                continue
            elif "-" in w:
                registration_numbers_2012_13.add(w)
            elif is_comma_separated_number(w) or three_digit_or_less_number(w):
                tax_paid = int(w.replace(',', ''))
                taxes_paid_2012_13.append(tax_paid)
        
        line_number += 1
#         if line_number % 5000 == 0:
#                 print("%.2f%%" % (len(registration_numbers_2012_13)  * 100 / 728346))
        
#         if line_number == 10:
#             break
taxes_paid_2012_13 = [x for x in taxes_paid_2012_13 if x != 0]
taxes_paid_2012_13 = sorted([w for w in taxes_paid_2012_13 if w != 0])

Now we calculate some quick statistics related to this year's taxes. Note that the sample we have is all individual filers who paid "voluntary payments" or "withholding tax". Those who filed and paid more than 0 PKR in taxes are taxpayers.

In [12]:
num_filers_2012_13 = len(registration_numbers_2012_13)
num_taxpayers_2012_13 = len(taxes_paid_2012_13)

proportion_filers_2012_13 = num_filers_2012_13 / ((labor_force_total["2012"][1] + labor_force_total["2013"][1]) / 2)
proportion_taxpayers_2012_13 = num_taxpayers_2012_13 / ((labor_force_total["2012"][1] + labor_force_total["2013"][1]) / 2)

Now, from the "taxes_paid_2012_13" that we just scraped, we can back-calculate the taxable incomes of each individual - assuming all of their income was under salaried or unsalaried taxable income. We are generating this for later analysis in the "CombiningData" notebook, where we will analyze whether this is a robust assumption. We do some analysis in the "Understanding the Tax Data" section, under "Assessing Income Calculations" to understand how we can improve it our estimates of taxable, and actual income.

In [13]:
# top_rate = 0.20
tax_levels_salaried_2012_13 = [
        (400000, 750, 0.015),
        (450000, 1250, 0.025),
        (550000, 3500, 0.035),
        (650000, 4500, 0.045),
        (750000, 6000, 0.06),
        (900000, 11250, 0.075),
        (1050000, 13500, 0.09),
        (1200000, 15000, 0.10),
        (1450000, 27500, 0.11),
        (1700000, 31250, 0.125),
        (1950000, 35000, 0.14),
        (2250000, 45000, 0.15),
        (2850000, 96000, 0.16),
        (3550000, 122500, 0.175),
        (4550000, 185000, 0.185),
    ]

# top_rate = 0.25
tax_levels_nonsalaried_2012_13 = [
        (500000, 7500, 0.05),
        (750000, 25000, 0.10),
        (1000000, 37500, 0.15),
        (1500000, 100000, 0.20)
    ]

The above rates and threshold sizes are for this taxyear. You can find them in taxpayerData/supplemental_tax_information, where I have downloaded EY's Pakistan tax profile for these years.

In [14]:
taxable_incomes_salaried_2012_13 = [calculate_taxable_income(i, tax_levels_salaried_2012_13, 0.2) for i in taxes_paid_2012_13]
taxable_incomes_nonsalaried_2012_13 = [calculate_taxable_income(i, tax_levels_nonsalaried_2012_13, 0.25) for i in taxes_paid_2012_13]

### 2013 - 2014

We now repeat the process for the rest of the available tax years.

In [15]:
f = open('./taxpayerData/2014_Tax.txt', 'r')

in_individuals_section = False
incomes_begin = False

taxes_paid_2013_14 = []
registration_numbers_2013_14 = set()

line_number = 0
for line in f:
    if "INDIVIDUAL" in line and len(line) == 11:
        in_individuals_section = True
        continue
    if in_individuals_section:
        line = line.strip().split(" ")
        if line_number < 2:
            line_number += 1
            continue
            
        if "174,783" in line:
            incomes_begin = True
#         print(line)
        for w in line:
            if w == 0:
                continue
            elif w.isdigit() and len(w) > 3:
                registration_numbers_2013_14.add(w)
            elif (is_comma_separated_number(w) or three_digit_or_less_number(w)) and incomes_begin:
                tax_paid = int(w.replace(',', ''))
                taxes_paid_2013_14.append(tax_paid)
        
        line_number += 1
#         if line_number % 5000 == 0:
#             print("%.2f%%" % (len(registration_numbers_2013_14)  * 100 / 780462))
        
#         if line_number == 10:
#             break
taxes_paid_2013_14 = [x for x in taxes_paid_2013_14 if x != 0]
taxes_paid_2013_14 = sorted([w for w in taxes_paid_2013_14 if w != 0])

In [16]:
num_filers_2013_14 = len(registration_numbers_2013_14)
num_taxpayers_2013_14 = len(taxes_paid_2013_14)

proportion_filers_2013_14 = num_filers_2013_14 / ((labor_force_total["2013"][1] + labor_force_total["2014"][1]) / 2)
proportion_taxpayers_2013_14 = num_taxpayers_2013_14 / ((labor_force_total["2013"][1] + labor_force_total["2014"][1]) / 2)

In [17]:
# top_rate = 0.30
tax_levels_salaried_2013_14 = [
        (750000, 17500, 0.05),
        (1400000, 65000, 0.10),
        (1500000, 12500, 0.125),
        (1800000, 45000, 0.15),
        (2500000, 122500, 0.175),
        (3000000, 100000, 0.20),
        (3500000, 112500, 0.225),
        (4000000, 125000, 0.25),
        (7000000, 825000, 0.275)
    ]

# top_rate = 0.35
tax_levels_nonsalaried_2013_14 = [
        (750000, 35000, 0.10),
        (1500000, 112500, 0.15),
        (2500000, 200000, 0.20),
        (4000000, 375000, 0.25),
        (6000000, 600000, 0.30)
    ]

In [18]:
taxable_incomes_salaried_2013_14 = [calculate_taxable_income(i, tax_levels_salaried_2013_14, 0.3) for i in taxes_paid_2013_14]
taxable_incomes_nonsalaried_2013_14 = [calculate_taxable_income(i, tax_levels_nonsalaried_2013_14, 0.35) for i in taxes_paid_2013_14]

### 2014 - 2015

In [19]:
f = open('./taxpayerData/2015_Tax.txt', 'r')

in_individuals_section = False

taxes_paid_2014_15 = []
registration_numbers_2014_15 = set()

print_number = 0
for line in f:
    if "INDIVIDUAL" in line and len(line) == 11:
        in_individuals_section = True
        continue
    
    if in_individuals_section:
        line = line.strip().split(" ")
        
#         print(line)
#         print_number += 1
        for w in line:
            if w == 0:
                continue
            elif w.isdigit() and len(w) > 3:
                registration_numbers_2014_15.add(w)
            elif is_comma_separated_number(w) or three_digit_or_less_number(w):
                tax_paid = int(w.replace(',', ''))
                taxes_paid_2014_15.append(tax_paid) 
            else:
                continue
        
        print_number += 1
#         if print_number % 5000 == 0:
#             print("%.2f%%" % (len(registration_numbers_2014_15)  * 100 / 1000718))
        
#         if print_number == 10:
#             break
taxes_paid_2014_15 = [x for x in taxes_paid_2014_15 if x != 0]
taxes_paid_2014_15 = sorted(taxes_paid_2014_15)

In [20]:
num_filers_2014_15 = len(registration_numbers_2014_15)
num_taxpayers_2014_15 = len(taxes_paid_2014_15)

proportion_filers_2014_15 = num_filers_2014_15 / ((labor_force_total["2014"][1] + labor_force_total["2015"][1]) / 2)
proportion_taxpayers_2014_15 = num_taxpayers_2014_15 / ((labor_force_total["2014"][1] + labor_force_total["2015"][1]) / 2)

In [21]:
# Note this is the same as last year
# top_rate = 0.30
tax_levels_salaried_2014_15 = [
        (750000, 17500, 0.05),
        (1400000, 65000, 0.10),
        (1500000, 12500, 0.125),
        (1800000, 45000, 0.15),
        (2500000, 122500, 0.175),
        (3000000, 100000, 0.20),
        (3500000, 112500, 0.225),
        (4000000, 125000, 0.25),
        (7000000, 825000, 0.275)
    ]

# top_rate = 0.35
tax_levels_nonsalaried_2014_15 = [
        (750000, 35000, 0.10),
        (1500000, 112500, 0.15),
        (2500000, 200000, 0.20),
        (4000000, 375000, 0.25),
        (6000000, 600000, 0.30)
    ]

In [22]:
taxable_incomes_salaried_2014_15 = [calculate_taxable_income(i, tax_levels_salaried_2014_15, 0.3) for i in taxes_paid_2014_15]
taxable_incomes_nonsalaried_2014_15 = [calculate_taxable_income(i, tax_levels_nonsalaried_2014_15, 0.35) for i in taxes_paid_2014_15]

### 2015 - 2016

In [23]:
f = open('./taxpayerData/2016_Tax.txt', 'r')

in_individuals_section = False
incomes_begin = False
registration_numbers_begin = False

taxes_paid_2015_16 = []
registration_numbers_2015_16 = set()

line_number = 0
for line in f:
    if "INDIVIDUAL" in line and len(line) == 11:
        in_individuals_section = True
        continue
    
    if in_individuals_section:
        line = line.strip().split(" ")
            
        if "19,000" in line:
            incomes_begin = True
        for w in line:
            if w == "19,000":
                incomes_begin = True
            if w == "3520254679721":
                registration_numbers_begin = True
                
                
            if w == 0:
                continue
            elif registration_numbers_begin and w.isdigit() and len(w) > 3:
                registration_numbers_2015_16.add(w)
            elif incomes_begin and (is_comma_separated_number(w) or three_digit_or_less_number(w)) and incomes_begin:
                tax_paid = int(w.replace(',', ''))
                taxes_paid_2015_16.append(tax_paid)
        
        line_number += 1
#         if line_number % 5000 == 0:
#             print("%.2f%%" % (len(registration_numbers_2015_16)  * 100 / 1135764))

taxes_paid_2015_16 = [x for x in taxes_paid_2015_16 if x != 0]
taxes_paid_2015_16 = sorted(taxes_paid_2015_16)

In [24]:
num_filers_2015_16 = len(registration_numbers_2015_16)
num_taxpayers_2015_16 = len(taxes_paid_2015_16)

proportion_filers_2015_16 = num_filers_2015_16 / ((labor_force_total["2015"][1] + labor_force_total["2016"][1]) / 2)
proportion_taxpayers_2015_16 = num_taxpayers_2015_16 / ((labor_force_total["2015"][1] + labor_force_total["2016"][1]) / 2)

In [25]:
# top_rate = 0.30
tax_levels_salaried_2015_16 = [
        (500000, 2000, 0.02),
        (750000, 12500, 0.05),
        (1400000, 65000, 0.10),
        (1500000, 12500, 0.125),
        (1800000, 45000, 0.15),
        (2500000, 122500, 0.175),
        (3000000, 100000, 0.20),
        (3500000, 112500, 0.225),
        (4000000, 125000, 0.25),
        (7000000, 825000, 0.275),
    ]

# top_rate = 0.35
tax_levels_nonsalaried_2015_16 = [
        (500000, 7000, 0.07),
        (750000, 25000, 0.10),
        (1500000, 112500, 0.15),
        (2500000, 200000, 0.20),
        (4000000, 375000, 0.25),
        (6000000, 600000, 0.30)
    ]

In [26]:
taxable_incomes_salaried_2015_16 = [calculate_taxable_income(i, tax_levels_salaried_2015_16, 0.3) for i in taxes_paid_2015_16]
taxable_incomes_nonsalaried_2015_16 = [calculate_taxable_income(i, tax_levels_nonsalaried_2015_16, 0.35) for i in taxes_paid_2015_16]

### 2016 - 2017

In [27]:
f = open('./taxpayerData/2017_Tax.txt', 'r')

in_individuals_section = False
in_tax_paid = False
in_registration = False

taxes_paid_2016_17 = []
registration_numbers_2016_17 = set()

print_number = 0
for line in f:
    if "INDIVIDUALS" in line:
        in_individuals_section = True
        continue
        
    if in_individuals_section:
        line = line.strip().split(" ")
        if "Tax" in line and "Paid" in line:
            in_tax_paid = True
            in_registration = False
        elif "Registration" in line and "No." in line:
            in_tax_paid = False
            in_registration = True
        elif "Sr." in line:
            in_tax_paid = False
            in_registration = False
        
        if in_registration:
#             print(line)
#             print_number += 1
            for w in line:
                if w.isdigit() and len(w) > 3:
                    registration_numbers_2016_17.add(w)
        elif in_tax_paid:
#             print(line)
#             print_number += 1
            for w in line:
                if is_comma_separated_number(w) or three_digit_or_less_number(w):
                    tax_paid = int(w.replace(',', ''))
                    taxes_paid_2016_17.append(tax_paid) 
        else:
            continue
        
#         if print_number == 4:
#             break

taxes_paid_2016_17 = [x for x in taxes_paid_2016_17 if x != 0]
taxes_paid_2016_17 = sorted(taxes_paid_2016_17)

In [28]:
num_filers_2016_17 = len(registration_numbers_2016_17)
num_taxpayers_2016_17 = len(taxes_paid_2016_17)

proportion_filers_2016_17 = num_filers_2016_17 / ((labor_force_total["2016"][1] + labor_force_total["2017"][1]) / 2)
proportion_taxpayers_2016_17 = num_taxpayers_2016_17 / ((labor_force_total["2016"][1] + labor_force_total["2017"][1]) / 2)

In [29]:
# Note that this is the same as last year
# top_rate = 0.30
tax_levels_salaried_2016_17 = [
        (500000, 2000, 0.02),
        (750000, 12500, 0.05),
        (1400000, 65000, 0.10),
        (1500000, 12500, 0.125),
        (1800000, 45000, 0.15),
        (2500000, 122500, 0.175),
        (3000000, 100000, 0.20),
        (3500000, 112500, 0.225),
        (4000000, 125000, 0.25),
        (7000000, 825000, 0.275),
    ]

# top_rate = 0.35
tax_levels_nonsalaried_2016_17 = [
        (500000, 7000, 0.07),
        (750000, 25000, 0.10),
        (1500000, 112500, 0.15),
        (2500000, 200000, 0.20),
        (4000000, 375000, 0.25),
        (6000000, 600000, 0.30)
    ]

In [30]:
taxable_incomes_salaried_2016_17 = [calculate_taxable_income(i, tax_levels_salaried_2016_17, 0.3) for i in taxes_paid_2016_17]
taxable_incomes_nonsalaried_2016_17 = [calculate_taxable_income(i, tax_levels_nonsalaried_2016_17, 0.35) for i in taxes_paid_2016_17]

### Corrections

Let's take a look at the top taxes paid for the last few years as that's usually where the data entry errors occur. Also we'll look at taxes_paid because most other data derives from it.

In [31]:
taxes_paid_2012_13[-8:]

[192509499,
 204700230,
 210332864,
 240639499,
 275009499,
 300009499,
 423039499,
 749008253]

In [32]:
taxes_paid_2013_14[-8:]

[143208224,
 143271006,
 163076683,
 212457880,
 249977620,
 298689781,
 304672293,
 485758739]

In [33]:
taxes_paid_2014_15[-8:]

[215161877,
 221062820,
 224273240,
 246943190,
 322592659,
 1980419800,
 30105506000,
 4220151942873]

In [34]:
# Page 9809 | MUHAMMAD HUSSAIN | 1,980,419,800 (nearly 2 Billion PKR in tax)
# Page 5040 | HABIB ULLAH S O HASHMAT ULLAH | 30,105,506,000 (over 30 Billion PKR in tax)
# Page 12,449 | NADEM AHMED MIR | 4,220,151,942,873 (over 4 Trillion PKR in tax)

In [35]:
taxes_paid_2015_16[-8:]

[263740827,
 280565417,
 293528571,
 295523097,
 311335339,
 402745414,
 420422332,
 2830640001]

In [36]:
# Page 3,217 | ASIM RAZZAQ | 2,830,640,001 (over 2 Billion PKR in tax)

In [37]:
taxes_paid_2016_17[-8:]

[254245709,
 259367593,
 294884853,
 314146096,
 389634689,
 408852429,
 411422476,
 716026507]

Looks like the highest tax amount paid is usually around 500 million PKR. For tax years 2015 - 2016 and 2014 - 2015 there seem to be outliers that are at least 1 order of magnitude off. I've added comments above - those numbers exist in the original PDFs, but I am going to assume that those numbers must have been wrongly entered. 

Just for a sanity check, let's compare those outliers to national income.

In [38]:
ni2014 = (list(national_income[national_income["Year"] == 2014]["National Income"])[0] + \
          list(national_income[national_income["Year"] == 2015]["National Income"])[0]) / 2
ni2015 = (list(national_income[national_income["Year"] == 2015]["National Income"])[0] + \
          list(national_income[national_income["Year"] == 2016]["National Income"])[0]) / 2

In [39]:
print("2014 - 2015: 1,980,419,800 (nearly 2 Billion PKR in tax)")
print("Percent of National Income:", str(round(taxes_paid_2014_15[-3] / ni2014 * 100, 3)) + "%")

2014 - 2015: 1,980,419,800 (nearly 2 Billion PKR in tax)
Percent of National Income: 0.008%


In [40]:
print("2014 - 2015: 30,105,506,000 (over 30 Billion PKR in tax)")
print("Percent of National Income:", str(round(taxes_paid_2014_15[-2] / ni2014 * 100, 3)) + "%")

2014 - 2015: 30,105,506,000 (over 30 Billion PKR in tax)
Percent of National Income: 0.116%


In [41]:
print("2014 - 2015: 4,220,151,942,873 (over 4 Trillion PKR in tax)")
print("Percent of National Income:", str(round(taxes_paid_2014_15[-1] / ni2014 * 100, 3)) + "%")

2014 - 2015: 4,220,151,942,873 (over 4 Trillion PKR in tax)
Percent of National Income: 16.26%


In [42]:
print("2015 - 2016: 2,830,640,001 (over 2 Billion PKR in tax)")
print("Percent of National Income:", str(round(taxes_paid_2015_16[-1] / ni2015 * 100, 3)) + "%")

2015 - 2016: 2,830,640,001 (over 2 Billion PKR in tax)
Percent of National Income: 0.01%


Considering these numbers are orders of magnitude off (no single taxpayer is making 16% of national income for example) compared to most samples, I remove them from the sample and re-run the calculations for those years. I do this below.

In [43]:
taxes_paid_2014_15 = taxes_paid_2014_15[:-3]

taxable_incomes_salaried_2014_15 = [calculate_taxable_income(i, tax_levels_salaried_2014_15, 0.3) for i in taxes_paid_2014_15]
taxable_incomes_nonsalaried_2014_15 = [calculate_taxable_income(i, tax_levels_nonsalaried_2014_15, 0.35) for i in taxes_paid_2014_15]

num_filers_2014_15 = len(registration_numbers_2014_15)
num_taxpayers_2014_15 = len(taxes_paid_2014_15)

proportion_filers_2014_15 = num_filers_2014_15 / ((labor_force_total["2014"][1] + labor_force_total["2015"][1]) / 2)
proportion_taxpayers_2014_15 = num_taxpayers_2014_15 / ((labor_force_total["2014"][1] + labor_force_total["2015"][1]) / 2)

In [44]:
taxes_paid_2015_16 = taxes_paid_2015_16[:-1]

taxable_incomes_salaried_2015_16 = [calculate_taxable_income(i, tax_levels_salaried_2015_16, 0.3) for i in taxes_paid_2015_16]
taxable_incomes_nonsalaried_2015_16 = [calculate_taxable_income(i, tax_levels_nonsalaried_2015_16, 0.35) for i in taxes_paid_2015_16]

num_filers_2015_16 = len(registration_numbers_2015_16)
num_taxpayers_2015_16 = len(taxes_paid_2015_16)

proportion_filers_2015_16 = num_filers_2015_16 / ((labor_force_total["2015"][1] + labor_force_total["2016"][1]) / 2)
proportion_taxpayers_2015_16 = num_taxpayers_2015_16 / ((labor_force_total["2015"][1] + labor_force_total["2016"][1]) / 2)

## Parsing the Parlimentarian Tax Data

Interesting space for analysis that should be further explored! Once I found that their returns were included in the main return file, I stopped doing analysis (I was under time constraint). I will put what I know about how to parse these PDFs here.

1) Download Tabula (it will be easier than parsing all of the text files) from here: https://github.com/tabulapdf/tabula. Tabula allows you to convert PDFs with tables into Excel tables.

2) Use Tabula to create excel files of the Parlimentarian Tax for each year. Put it in the taxpayerData folder (next to the PDFs and txt files.

3) Load the tables as dataframes into cells below, run analysis and generate statistics and tables. Start by creating the visualizations and tables in these sections. Once you can construct those statistics across years (create a time series), then move that time series and write a description under the "Understanding Parlimentarian Tax Data" section.

In [45]:
# code to load tables from the taxpayerData folder
# parliamentarian_tax = pd.read_excel("./taxpayerData/2017_ParliamentarianTax.xlsx")
# parliamentarian_tax.head()

### 2012 -  2013

### 2013 -  2014

### 2014 -  2015

### 2015 -  2016

### 2016 -  2017

## Understanding the Full Tax Data

In [46]:
import plotly.plotly as py
import plotly.graph_objs as go

**Helper Functions**

In [47]:
def format_number(value):
    return "{:,}".format(value)

def format_percent(value, sigfigs):
    return str(round(value * 100, sigfigs)) + "%"

def int_average(val1, val2):
    return int((val1 + val2) / 2)

**Relevant Tables**

In [48]:
pakistan_cpi = pd.read_excel("./miscData/pakistan_cpi.xls")
gdp = pd.read_excel("./miscData/pakistan_gdp.xlsx")
income_tax_comparison = pd.read_excel("./miscData/personal_income_tax_comparison.xlsx")
grd_pak_direct_tax = pd.read_excel("./miscData/grd_pak_taxes.xlsx")
india_gdp = pd.read_excel("./miscData/india_gdp.xlsx")
india_taxes = pd.read_excel("./miscData/india_taxes2.xls")
lfp = pd.read_excel("./miscData/lfp.xls")
lfp_female = pd.read_excel("./miscData/lfp_female.xls")

**GDP Deflator Calculations**

In [49]:
gdpd_2012 = np.mean([list(gdp[gdp["year"] == 2012]["gdp_index_2017"])[0],
          list(gdp[gdp["year"] == 2013]["gdp_index_2017"])[0]])
gdpd_2013 = np.mean([list(gdp[gdp["year"] == 2013]["gdp_index_2017"])[0],
         list(gdp[gdp["year"] == 2014]["gdp_index_2017"])[0]])
gdpd_2014 = np.mean([list(gdp[gdp["year"] == 2014]["gdp_index_2017"])[0],
         list(gdp[gdp["year"] == 2015]["gdp_index_2017"])[0]])
gdpd_2015 = np.mean([list(gdp[gdp["year"] == 2015]["gdp_index_2017"])[0],
          list(gdp[gdp["year"] == 2016]["gdp_index_2017"])[0]])
gdpd_2016 = np.mean([list(gdp[gdp["year"] == 2016]["gdp_index_2017"])[0],
          list(gdp[gdp["year"] == 2017]["gdp_index_2017"])[0]])

Let the graphs begin!

### Tax Directory in Perspective

#### Tax Returns (% of Labor Force)

In [50]:
lf2012 = format_number(int_average(labor_force_total["2012"][1], labor_force_total["2013"][1]))
lf2013 = format_number(int_average(labor_force_total["2013"][1], labor_force_total["2014"][1]))
lf2014 = format_number(int_average(labor_force_total["2014"][1], labor_force_total["2015"][1]))
lf2015 = format_number(int_average(labor_force_total["2015"][1], labor_force_total["2016"][1]))
lf2016 = format_number(int_average(labor_force_total["2016"][1], labor_force_total["2017"][1]))

pf2012 = format_percent(proportion_filers_2012_13, 2)
pf2013 = format_percent(proportion_filers_2013_14, 2)
pf2014 = format_percent(proportion_filers_2014_15, 2)
pf2015 = format_percent(proportion_filers_2015_16, 2)
pf2016 = format_percent(proportion_filers_2016_17, 2)

tx2012 = format_percent(proportion_taxpayers_2012_13, 2)
tx2013 = format_percent(proportion_taxpayers_2013_14, 2)
tx2014 = format_percent(proportion_taxpayers_2014_15, 2)
tx2015 = format_percent(proportion_taxpayers_2015_16, 2)
tx2016 = format_percent(proportion_taxpayers_2016_17, 2)

pt2012 = format_percent(num_taxpayers_2012_13 / num_filers_2012_13, 1)
pt2013 = format_percent(num_taxpayers_2013_14 / num_filers_2013_14, 1)
pt2014 = format_percent(num_taxpayers_2014_15 / num_filers_2014_15, 1)
pt2015 = format_percent(num_taxpayers_2015_16 / num_filers_2015_16, 1)
pt2016 = format_percent(num_taxpayers_2016_17 / num_filers_2016_17, 1)

d = {
    'Tax Year': ["2012 - 2013", "2013 - 2014", "2014 - 2015", "2015 - 2016", "2016 - 2017"], 
    'Labor Force': [lf2012, lf2013, lf2014, lf2015, lf2016], 
    'Proportion of Tax Filers (% of Labor Force)': [pf2012, pf2013, pf2014, pf2015, pf2016], 
    'Proportion of Tax Payers (% of Labor Force)': [tx2012, tx2013, tx2014, tx2015, tx2016],
    'Proportion of Filers who pay tax (% of Filers)': [pt2012, pt2013, pt2014, pt2015, pt2016]
    }

tax_returns_summary_statistics = pd.DataFrame(data=d)
tax_returns_summary_statistics = tax_returns_summary_statistics[["Tax Year", "Labor Force", "Proportion of Tax Filers (% of Labor Force)", "Proportion of Tax Payers (% of Labor Force)", "Proportion of Filers who pay tax (% of Filers)"]]
tax_returns_summary_statistics

Unnamed: 0,Tax Year,Labor Force,Proportion of Tax Filers (% of Labor Force),Proportion of Tax Payers (% of Labor Force),Proportion of Filers who pay tax (% of Filers)
0,2012 - 2013,60357706,1.21%,0.7%,58.0%
1,2013 - 2014,62025233,1.26%,0.8%,63.3%
2,2014 - 2015,64070380,1.56%,1.08%,69.2%
3,2015 - 2016,66149979,1.72%,1.21%,70.7%
4,2016 - 2017,67602890,2.49%,1.61%,64.7%


Note that the tax year in Pakistan is from July 1st - June 30th of the next year. 
As a result, Labor Force is calculated as the average of the measured labor force between the two years.
Proportion of Tax Filers is taken as the number of filers divided by the derived number for labor force above.
Proportion of Tax Payers is calculated in a similar way.
Proportion of Filers who pay tax is taken as the number of taxpayers over the number of filers.

Note that the government states that many of the manually filed returns could not be included in the tax directory. As a result, we may underestimate the number of filers and derived statistics on government tax revenue as a percent of national income or GDP.

The fact that this number (Proportion of Income Tax Payers) has more than doubled in the past 4 years is impressive. Yet, comparatively, the current figure is similar to the levels observed in France and in the USA in the mid 1910s, and much lower than the levels observed in the interwar period (about 10-15%) and in the decades following World War 2 (50% or more) in these two countries (Piketty, 2001; Piketty and Saez, 2003). It is similar to India in the mid-1990s (the country is nearer 7% today), and slightly below China (Chancel and Piketty, 2017; Piketty, Yang and Zucman, 2018; CIA World Factbook).

Part of this explosive growth has been a result of government reforms aimed a increasing the ease of paying taxes and cracking down on tax evasion. This includes the elimination of tax exemptions, introducing self-assessment for taxable income and differential taxation that rewards compliance and penalizes noncompliance. It also includes integrating the National Tax Number (NTN) system with the Computerized National Identity Card (CNIC) database. The latter is much more comprehensive, and can help the government track down evaders (Pakistan Selected Issues Paper, IMF, 2016).

Now let's explore a more realistic statistic that gives us a relative size of our taxpaying population. The exemption level for paying taxes is 350,000 PKR in the 2013 taxyear, and 400,000 PKR in the following tax years. This means that the population of eligible taxpayers is much smaller than the labor force. According to the survey data, it's around the top 10% (see the "Combining Data" notebook). 

Let's use those percentiles to estimate what proportion of eligible taxpayers actually pay their taxes. This will give is an idea of the size of our sample.

#### Tax Returns (% of Eligible Labor Force)

In [51]:
lf2012_t10 = int(int_average(labor_force_total["2012"][1], labor_force_total["2013"][1])*(1 - 0.900109657512))
lf2013_t10 = int(int_average(labor_force_total["2013"][1], labor_force_total["2014"][1])*(1 - 0.938917499403))
lf2014_t10 = int(int_average(labor_force_total["2014"][1], labor_force_total["2015"][1])*(1 - 0.920433052892))
lf2015_t10 = int(int_average(labor_force_total["2015"][1], labor_force_total["2016"][1])*(1 - 0.882081344677))
lf2016_t10 = int(int_average(labor_force_total["2016"][1], labor_force_total["2017"][1])*(1 - 0.87)) # estimate


tx2012_t10 = format_percent(len(taxes_paid_2012_13) / lf2012_t10, 2)
tx2013_t10 = format_percent(len(taxes_paid_2013_14) / lf2013_t10, 2)
tx2014_t10 = format_percent(len(taxes_paid_2014_15) / lf2014_t10, 2)
tx2015_t10 = format_percent(len(taxes_paid_2015_16) / lf2015_t10, 2)
tx2016_t10 = format_percent(len(taxes_paid_2016_17) / lf2016_t10, 2)

d = {
    'Tax Year': ["2012 - 2013", "2013 - 2014", "2014 - 2015", "2015 - 2016", "2016 - 2017"], 
    'Population of Potential Taxpayers': [format_number(lf2012_t10), format_number(lf2013_t10), format_number(lf2014_t10), format_number(lf2015_t10), format_number(lf2016_t10)], 
    'Proportion of Tax Payers (% of Eligible Labor Force)': [tx2012_t10, tx2013_t10, tx2014_t10, tx2015_t10, tx2016_t10],
    }

tax_returns_summary_statistics_eligible = pd.DataFrame(data=d)
tax_returns_summary_statistics_eligible = tax_returns_summary_statistics_eligible[["Tax Year", "Population of Potential Taxpayers", "Proportion of Tax Payers (% of Eligible Labor Force)"]]
tax_returns_summary_statistics_eligible

Unnamed: 0,Tax Year,Population of Potential Taxpayers,Proportion of Tax Payers (% of Eligible Labor Force)
0,2012 - 2013,6029151,7.01%
1,2013 - 2014,3788656,13.05%
2,2014 - 2015,5097884,13.58%
3,2015 - 2016,7800316,10.3%
4,2016 - 2017,8788375,12.37%


This is a strikingly large number, and lends support to the veracity of our results. Although there are not many reports on the labor force population for recent years, if we look at past ministry of finance reports the population of unemployed out of the labor force is shockingly low (regularly less than 10% of the labor force), meaning that this percentage is relatively accurate (see miscData/labor_force_pakistan_brochure_finance_ministry.pdf).

### Comparing Personal Income Tax Revenue Across Countries

#### Personal Income Tax Revenue (% of GDP)

In [52]:
x = np.arange(2013, 2018, 1)

gdp_2012_13 = int_average(list(gdp[gdp["year"] == 2012]["gdp"])[0], list(gdp[gdp["year"] == 2013]["gdp"])[0])
gdp_2013_14 = int_average(list(gdp[gdp["year"] == 2013]["gdp"])[0], list(gdp[gdp["year"] == 2014]["gdp"])[0])
gdp_2014_15 = int_average(list(gdp[gdp["year"] == 2014]["gdp"])[0], list(gdp[gdp["year"] == 2015]["gdp"])[0])
gdp_2015_16 = int_average(list(gdp[gdp["year"] == 2015]["gdp"])[0], list(gdp[gdp["year"] == 2016]["gdp"])[0])
gdp_2016_17 = int_average(list(gdp[gdp["year"] == 2016]["gdp"])[0], list(gdp[gdp["year"] == 2017]["gdp"])[0])

pak_tr_gdp_2012 = sum(taxes_paid_2012_13) / gdpd_2012 / gdp_2012_13
pak_tr_gdp_2013 = sum(taxes_paid_2013_14) / gdpd_2013 / gdp_2013_14
pak_tr_gdp_2014 = sum(taxes_paid_2014_15) / gdpd_2014 / gdp_2014_15
pak_tr_gdp_2015 = sum(taxes_paid_2015_16) / gdpd_2015 / gdp_2015_16
pak_tr_gdp_2016 = sum(taxes_paid_2016_17) / gdpd_2016 / gdp_2016_17

fra_tr_gdp_2012 = np.mean([list(income_tax_comparison[np.logical_and(income_tax_comparison["loc"] == "FRA", income_tax_comparison["time"] == 2012)]["val"])[0],
                          list(income_tax_comparison[np.logical_and(income_tax_comparison["loc"] == "FRA", income_tax_comparison["time"] == 2013)]["val"])[0]])
fra_tr_gdp_2013 = np.mean([list(income_tax_comparison[np.logical_and(income_tax_comparison["loc"] == "FRA", income_tax_comparison["time"] == 2013)]["val"])[0],
                          list(income_tax_comparison[np.logical_and(income_tax_comparison["loc"] == "FRA", income_tax_comparison["time"] == 2014)]["val"])[0]])
fra_tr_gdp_2014 = np.mean([list(income_tax_comparison[np.logical_and(income_tax_comparison["loc"] == "FRA", income_tax_comparison["time"] == 2014)]["val"])[0],
                          list(income_tax_comparison[np.logical_and(income_tax_comparison["loc"] == "FRA", income_tax_comparison["time"] == 2015)]["val"])[0]])
fra_tr_gdp_2015 = np.mean([list(income_tax_comparison[np.logical_and(income_tax_comparison["loc"] == "FRA", income_tax_comparison["time"] == 2015)]["val"])[0],
                          list(income_tax_comparison[np.logical_and(income_tax_comparison["loc"] == "FRA", income_tax_comparison["time"] == 2016)]["val"])[0]])
fra_tr_gdp_2016 = np.mean([list(income_tax_comparison[np.logical_and(income_tax_comparison["loc"] == "FRA", income_tax_comparison["time"] == 2016)]["val"])[0],
                          list(income_tax_comparison[np.logical_and(income_tax_comparison["loc"] == "FRA", income_tax_comparison["time"] == 2017)]["val"])[0]])

usa_tr_gdp_2012 = np.mean([list(income_tax_comparison[np.logical_and(income_tax_comparison["loc"] == "USA", income_tax_comparison["time"] == 2012)]["val"])[0],
                          list(income_tax_comparison[np.logical_and(income_tax_comparison["loc"] == "USA", income_tax_comparison["time"] == 2013)]["val"])[0]])
usa_tr_gdp_2013 = np.mean([list(income_tax_comparison[np.logical_and(income_tax_comparison["loc"] == "USA", income_tax_comparison["time"] == 2013)]["val"])[0],
                          list(income_tax_comparison[np.logical_and(income_tax_comparison["loc"] == "USA", income_tax_comparison["time"] == 2014)]["val"])[0]])
usa_tr_gdp_2014 = np.mean([list(income_tax_comparison[np.logical_and(income_tax_comparison["loc"] == "USA", income_tax_comparison["time"] == 2014)]["val"])[0],
                          list(income_tax_comparison[np.logical_and(income_tax_comparison["loc"] == "USA", income_tax_comparison["time"] == 2015)]["val"])[0]])
usa_tr_gdp_2015 = np.mean([list(income_tax_comparison[np.logical_and(income_tax_comparison["loc"] == "USA", income_tax_comparison["time"] == 2015)]["val"])[0],
                          list(income_tax_comparison[np.logical_and(income_tax_comparison["loc"] == "USA", income_tax_comparison["time"] == 2016)]["val"])[0]])
usa_tr_gdp_2016 = np.mean([list(income_tax_comparison[np.logical_and(income_tax_comparison["loc"] == "USA", income_tax_comparison["time"] == 2016)]["val"])[0],
                          list(income_tax_comparison[np.logical_and(income_tax_comparison["loc"] == "USA", income_tax_comparison["time"] == 2017)]["val"])[0]])

turk_tr_gdp_2012 = np.mean([list(income_tax_comparison[np.logical_and(income_tax_comparison["loc"] == "TUR", income_tax_comparison["time"] == 2012)]["val"])[0],
                          list(income_tax_comparison[np.logical_and(income_tax_comparison["loc"] == "TUR", income_tax_comparison["time"] == 2013)]["val"])[0]])
turk_tr_gdp_2013 = np.mean([list(income_tax_comparison[np.logical_and(income_tax_comparison["loc"] == "TUR", income_tax_comparison["time"] == 2013)]["val"])[0],
                          list(income_tax_comparison[np.logical_and(income_tax_comparison["loc"] == "TUR", income_tax_comparison["time"] == 2014)]["val"])[0]])
turk_tr_gdp_2014 = np.mean([list(income_tax_comparison[np.logical_and(income_tax_comparison["loc"] == "TUR", income_tax_comparison["time"] == 2014)]["val"])[0],
                          list(income_tax_comparison[np.logical_and(income_tax_comparison["loc"] == "TUR", income_tax_comparison["time"] == 2015)]["val"])[0]])
turk_tr_gdp_2015 = np.mean([list(income_tax_comparison[np.logical_and(income_tax_comparison["loc"] == "TUR", income_tax_comparison["time"] == 2015)]["val"])[0],
                          list(income_tax_comparison[np.logical_and(income_tax_comparison["loc"] == "TUR", income_tax_comparison["time"] == 2016)]["val"])[0]])
turk_tr_gdp_2016 = np.mean([list(income_tax_comparison[np.logical_and(income_tax_comparison["loc"] == "TUR", income_tax_comparison["time"] == 2016)]["val"])[0],
                          list(income_tax_comparison[np.logical_and(income_tax_comparison["loc"] == "TUR", income_tax_comparison["time"] == 2017)]["val"])[0]])

india_gdp_2012 = np.mean([list(india_gdp[india_gdp["year"] == 2012]["gdp_local"])[0],
                          list(india_gdp[india_gdp["year"] == 2013]["gdp_local"])[0]])
india_gdp_2013 = np.mean([list(india_gdp[india_gdp["year"] == 2013]["gdp_local"])[0],
                          list(india_gdp[india_gdp["year"] == 2014]["gdp_local"])[0]])
india_gdp_2014 = np.mean([list(india_gdp[india_gdp["year"] == 2014]["gdp_local"])[0],
                          list(india_gdp[india_gdp["year"] == 2015]["gdp_local"])[0]])
india_gdp_2015 = np.mean([list(india_gdp[india_gdp["year"] == 2015]["gdp_local"])[0],
                          list(india_gdp[india_gdp["year"] == 2016]["gdp_local"])[0]])
india_gdp_2016 = np.mean([list(india_gdp[india_gdp["year"] == 2016]["gdp_local"])[0],
                          list(india_gdp[india_gdp["year"] == 2017]["gdp_local"])[0]])

india_tr_gdp_2012 = (list(india_taxes[india_taxes["year"] == "2012-13"]["income_revenue"])[0] * 10000000) / india_gdp_2012
india_tr_gdp_2013 = (list(india_taxes[india_taxes["year"] == "2013-14"]["income_revenue"])[0] * 10000000) / india_gdp_2013
india_tr_gdp_2014 = (list(india_taxes[india_taxes["year"] == "2014-15(P)"]["income_revenue"])[0] * 10000000) / india_gdp_2014
india_tr_gdp_2015 = (288717 * 10000000) / india_gdp_2015 # Numbers are from Ministry of Finance Press Release in miscData
india_tr_gdp_2016 = (346789 * 10000000) / india_gdp_2016 # Numbers are from Ministry of Finance Press Release in miscData

pak_gdp_trace = go.Scatter(
    x=x,
    y=np.round(np.array([pak_tr_gdp_2012, pak_tr_gdp_2013, pak_tr_gdp_2014, pak_tr_gdp_2015, pak_tr_gdp_2016]) * 100, 2),
    name="Pakistan",
    line = dict(color = ('rgb(0,100,80)'))
)

fra_gdp_trace = go.Scatter(
    x=x,
    y=np.round(np.array([fra_tr_gdp_2012, fra_tr_gdp_2013, fra_tr_gdp_2014, fra_tr_gdp_2015, fra_tr_gdp_2016]) * 100, 2),
    name="France"
)

usa_gdp_trace = go.Scatter(
    x=x,
    y=np.round(np.array([usa_tr_gdp_2012, usa_tr_gdp_2013, usa_tr_gdp_2014, usa_tr_gdp_2015, usa_tr_gdp_2016]) * 100, 2),
    name="USA"
)

turk_gdp_trace = go.Scatter(
    x=x,
    y=np.round(np.array([turk_tr_gdp_2012, turk_tr_gdp_2013, turk_tr_gdp_2014, turk_tr_gdp_2015, turk_tr_gdp_2016]) * 100, 2),
    name="Turkey"
)

india_gdp_trace = go.Scatter(
    x=x,
    y=np.round(np.array([india_tr_gdp_2012, india_tr_gdp_2013, india_tr_gdp_2014, india_tr_gdp_2015, india_tr_gdp_2016]) * 100, 2),
    name="India",
    line = dict(color = ('rgb(255, 127, 14)'))
)

data = [usa_gdp_trace, india_gdp_trace, pak_gdp_trace, turk_gdp_trace, fra_gdp_trace]

layout = dict(title = 'Personal Income Tax Revenue (% of GDP)',
              xaxis = dict(
                            title = 'Tax Year End (July 1st - June 30th)',
                            tickmode = 'linear',
                            tick0 = 2012,
                            dtick = 1
                        ),
              yaxis = dict(title = 'Percent of GDP')
        )
fig = go.Figure(data=data, layout=layout)
comparative_personal_income_tax_revenue_graph = py.iplot(fig, filename='styled-line')
comparative_personal_income_tax_revenue_graph

With revenues from income tax equivalent to approximately 1% of GDP, Pakistan receives around the same revenue as China (1%), but less than other emerging countries such as India (2%), Brazil and Russia (4%), and South Africa and the OECD countries (9%) (Chancel and Piketty, 2017; OECD, 2017; World Bank Databank). Still, the rate of increase is impressive, espicially just comparing to neighboring India, which has similar institutional history regarding tax collection economy.

#### Personal Income Tax Revenue, Pakistan and India

In [53]:
data = [pak_gdp_trace, india_gdp_trace]

layout = dict(title = 'Personal Income Tax revenue (% of GDP)',
              xaxis = dict(
                            title = 'Tax Year End (July 1st - June 30th)',
                            tickmode = 'linear',
                            tick0 = 2012,
                            dtick = 1
                        ),
              yaxis = dict(title = 'Percent of GDP')
        )
fig = go.Figure(data=data, layout=layout)
india_pak_personal_income_tax_revenue_graph = py.iplot(fig, filename='styled-line')
india_pak_personal_income_tax_revenue_graph

#### Personal Income Tax Revenue Growth, Pakistan and India

In [54]:
x = np.arange(2014, 2018, 1)

pak_gr_2014 = (pak_tr_gdp_2013 - pak_tr_gdp_2012) / pak_tr_gdp_2012
pak_gr_2015 = (pak_tr_gdp_2014 - pak_tr_gdp_2013) / pak_tr_gdp_2013
pak_gr_2016 = (pak_tr_gdp_2015 - pak_tr_gdp_2014) / pak_tr_gdp_2014
pak_gr_2017 = (pak_tr_gdp_2016 - pak_tr_gdp_2015) / pak_tr_gdp_2015

india_gr_2014 = (india_tr_gdp_2013 - india_tr_gdp_2012) / india_tr_gdp_2012
india_gr_2015 = (india_tr_gdp_2014 - india_tr_gdp_2013) / india_tr_gdp_2013
india_gr_2016 = (india_tr_gdp_2015 - india_tr_gdp_2014) / india_tr_gdp_2014
india_gr_2017 = (india_tr_gdp_2016 - india_tr_gdp_2015) / india_tr_gdp_2015

pak_gdp_trace = go.Scatter(
    x=x,
    y=np.round(np.array([pak_gr_2014, pak_gr_2015, pak_gr_2016, pak_gr_2017]) * 100, 2),
    name="Pakistan",
    line = dict(color = ('rgb(0,100,80)'))
)

india_gdp_trace = go.Scatter(
    x=x,
    y=np.round(np.array([india_gr_2014, india_gr_2015, india_gr_2016, india_gr_2017]) * 100, 2),
    name="India",
    line = dict(color = ('rgb(255, 127, 14)'))
)

data = [pak_gdp_trace, india_gdp_trace]

layout = dict(title = 'Personal Income Tax Revenue Growth',
              xaxis = dict(
                            title = 'Tax Year End (July 1st - June 30th)',
                            tickmode = 'linear',
                            tick0 = 2012,
                            dtick = 1
                        ),
              yaxis = dict(title = 'Real income tax revenue growth (%)')
        )
fig = go.Figure(data=data, layout=layout)
india_pak_personal_income_tax_growth_graph = py.iplot(fig, filename='styled-line')
india_pak_personal_income_tax_growth_graph

### Assessing Income Calculations

So ultimately, we are trying to back-calculate income earned by all individuals who are above the exemption limit for taxes. Obviously, because such a small percent of the labor force pays taxes, we are only getting a biased sample. We do not know if the bias is towards the richer or poorer end of the population.

Given this sample however, we can make reasonable estimates for taxable income (after deductions), and then total income. From the amount paid, we have calculated taxable income assuming everyone is a salaried worker (and all of their income comes from salary and not capital gains, which is subject to another tax), or a non-salaried worker. We then can further assume a deduction level. 

Let's compare these numbers to national income to see if we have a reasonable estimate for the actual total income of the sample. We will make assumptions about how this income is distributed compared to the total population of potential taxpayers later.

#### Comparing Taxes and Real Income Calculations (% of National Income)

In [55]:
s_deduction_rate = 0.05
ns_deduction_rate = 0.05

ni2012 = np.mean([list(national_income[national_income["Year"] == 2012]["National Income"])[0],
          list(national_income[national_income["Year"] == 2013]["National Income"])[0]])
ni2013 = np.mean([list(national_income[national_income["Year"] == 2013]["National Income"])[0],
          list(national_income[national_income["Year"] == 2014]["National Income"])[0]])
ni2014 = np.mean([list(national_income[national_income["Year"] == 2014]["National Income"])[0],
          list(national_income[national_income["Year"] == 2015]["National Income"])[0]])
ni2015 = np.mean([list(national_income[national_income["Year"] == 2015]["National Income"])[0],
          list(national_income[national_income["Year"] == 2016]["National Income"])[0]])
ni2016 = np.mean([list(national_income[national_income["Year"] == 2016]["National Income"])[0],
          list(national_income[national_income["Year"] == 2017]["National Income"])[0]])

pnis2012 = sum(taxable_incomes_salaried_2012_13) / gdpd_2012 / ni2012
pnis2013 = sum(taxable_incomes_salaried_2013_14) / gdpd_2013 / ni2013
pnis2014 = sum(taxable_incomes_salaried_2014_15) / gdpd_2014 / ni2014
pnis2015 = sum(taxable_incomes_salaried_2015_16) / gdpd_2015 / ni2015
pnis2016 = sum(taxable_incomes_salaried_2016_17) / gdpd_2016 / ni2016

r_pnis2012 = sum(taxable_incomes_salaried_2012_13) / gdpd_2012 / (1 - s_deduction_rate) / ni2012
r_pnis2013 = sum(taxable_incomes_salaried_2013_14) / gdpd_2013 / (1 - s_deduction_rate) / ni2013
r_pnis2014 = sum(taxable_incomes_salaried_2014_15) / gdpd_2014 / (1 - s_deduction_rate) / ni2014
r_pnis2015 = sum(taxable_incomes_salaried_2015_16) / gdpd_2015 / (1 - s_deduction_rate) / ni2015
r_pnis2016 = sum(taxable_incomes_salaried_2016_17) / gdpd_2016 / (1 - s_deduction_rate) / ni2016

pnins2012 = sum(taxable_incomes_nonsalaried_2012_13) / gdpd_2012 / ni2012
pnins2013 = sum(taxable_incomes_nonsalaried_2013_14) / gdpd_2013 / ni2013
pnins2014 = sum(taxable_incomes_nonsalaried_2014_15) / gdpd_2014 / ni2014
pnins2015 = sum(taxable_incomes_nonsalaried_2015_16) / gdpd_2015 / ni2015
pnins2016 = sum(taxable_incomes_nonsalaried_2016_17) / gdpd_2016 / ni2016

r_pnins2012 = sum(taxable_incomes_nonsalaried_2012_13) / gdpd_2012 / (1 - ns_deduction_rate) / ni2012
r_pnins2013 = sum(taxable_incomes_nonsalaried_2013_14) / gdpd_2013 / (1 - ns_deduction_rate) / ni2013
r_pnins2014 = sum(taxable_incomes_nonsalaried_2014_15) / gdpd_2014 / (1 - ns_deduction_rate) / ni2014
r_pnins2015 = sum(taxable_incomes_nonsalaried_2015_16) / gdpd_2015 / (1 - ns_deduction_rate) / ni2015
r_pnins2016 = sum(taxable_incomes_nonsalaried_2016_17) / gdpd_2016 / (1 - ns_deduction_rate) / ni2016

dt_percent_2012 = np.mean([list(grd_pak_direct_tax[grd_pak_direct_tax["YEAR"] == 2012]["DIRECT TAXES"])[0],
                      list(grd_pak_direct_tax[grd_pak_direct_tax["YEAR"] == 2013]["DIRECT TAXES"])[0]])
dt_percent_2013 = np.mean([list(grd_pak_direct_tax[grd_pak_direct_tax["YEAR"] == 2013]["DIRECT TAXES"])[0],
                      list(grd_pak_direct_tax[grd_pak_direct_tax["YEAR"] == 2014]["DIRECT TAXES"])[0]])
dt_percent_2014 = np.mean([list(grd_pak_direct_tax[grd_pak_direct_tax["YEAR"] == 2014]["DIRECT TAXES"])[0],
                      list(grd_pak_direct_tax[grd_pak_direct_tax["YEAR"] == 2015]["DIRECT TAXES"])[0]])
dt_percent_2015 = list(grd_pak_direct_tax[grd_pak_direct_tax["YEAR"] == 2015]["DIRECT TAXES"])[0]

dt_calc_2012 = sum(taxes_paid_2012_13) / gdpd_2012 / ni2012
dt_calc_2013 = sum(taxes_paid_2013_14) / gdpd_2013 / ni2013
dt_calc_2014 = sum(taxes_paid_2014_15) / gdpd_2014 / ni2014
dt_calc_2015 = sum(taxes_paid_2015_16) / gdpd_2015 / ni2015
dt_calc_2016 = sum(taxes_paid_2016_17) / gdpd_2016 / ni2016

d = {
    'Tax Year': ["2012 - 2013", "2013 - 2014", "2014 - 2015", "2015 - 2016", "2016 - 2017"], 
    'National Income (PKR, Constant 2017)': [format_number(int(ni2012)), format_number(int(ni2013)), format_number(int(ni2014)), format_number(int(ni2015)), format_number(int(ni2016))], 
    "Salaried Workers Real Taxable Income (% of National Income)": [format_percent(pnis2012, 1), format_percent(pnis2013, 1), format_percent(pnis2014, 1), format_percent(pnis2015, 1), format_percent(pnis2016, 1)], 
    "Salaried Workers estimated Real Income (Taxable Income with 5% Deduction, % of National Income)": [format_percent(r_pnis2012, 1), format_percent(r_pnis2013, 1), format_percent(r_pnis2014, 1), format_percent(r_pnis2015, 1), format_percent(r_pnis2016, 1)],
    "Non-salaried workers' Real Taxable Income (% of National Income)": [format_percent(pnins2012, 1), format_percent(pnins2013, 1), format_percent(pnins2014, 1), format_percent(pnins2015, 1), format_percent(pnins2016, 1)], 
    "Non-salaried workers' estimated Real Income (Taxable Income with 5% Deduction, % of National Income)": [format_percent(r_pnins2012, 1), format_percent(r_pnins2013, 1), format_percent(r_pnins2014, 1), format_percent(r_pnins2015, 1), format_percent(r_pnins2016, 1)], 
    "Taxes Collected for Individuals (% of National Income)*": [format_percent(dt_calc_2012, 1), format_percent(dt_calc_2013, 1), format_percent(dt_calc_2014, 1), format_percent(dt_calc_2015, 1), format_percent(dt_calc_2016, 1)]
    }

income_calculations = pd.DataFrame(data=d)
income_calculations = income_calculations[["Tax Year", 
         "National Income (PKR, Constant 2017)", 
         "Taxes Collected for Individuals (% of National Income)*",
         "Non-salaried workers' Real Taxable Income (% of National Income)", 
         "Non-salaried workers' estimated Real Income (Taxable Income with 5% Deduction, % of National Income)", 
         "Salaried Workers Real Taxable Income (% of National Income)", 
         "Salaried Workers estimated Real Income (Taxable Income with 5% Deduction, % of National Income)"]]
income_calculations

Unnamed: 0,Tax Year,"National Income (PKR, Constant 2017)",Taxes Collected for Individuals (% of National Income)*,Non-salaried workers' Real Taxable Income (% of National Income),"Non-salaried workers' estimated Real Income (Taxable Income with 5% Deduction, % of National Income)",Salaried Workers Real Taxable Income (% of National Income),"Salaried Workers estimated Real Income (Taxable Income with 5% Deduction, % of National Income)"
0,2012 - 2013,23699811991552,0.5%,3.1%,3.3%,4.1%,4.3%
1,2013 - 2014,24764663988224,0.5%,3.3%,3.5%,3.9%,4.1%
2,2014 - 2015,25953806843904,0.8%,4.5%,4.7%,5.3%,5.5%
3,2015 - 2016,27178023518208,0.8%,4.9%,5.2%,5.8%,6.1%
4,2016 - 2017,28476922920960,1.0%,5.8%,6.1%,6.9%,7.2%


\*Note that this is the number we have collected from the tax directory. It is an underestimate, as the FBR states that not all manually filed returns were able to be digitized and included in the sample.

Some quick, objective facts to put these numbers in perspective:

**1)** Any individual eligible to pay taxes (potential taxpayers who earns an amount above the tax exemption threshold) is in approximately the top 10% of earners in Pakistan. This is calculated in the "Combining Data" notebook as the junction percentile, we use those numbers in an earlier table.

**2)** We have the population of tax filers (which is a subset of the population above). We have an underestimate of that population as not all manually filed returns were able to be added to the directory.

**3)** Direct taxes are made up of Withholding tax, Voluntary payments, Collection on Demand, and other (a small proportion).

Taking a look at the quick statistics, we can learn a bit more about this population and why we are calculating based on salaried and non-salaried workers' tax rates. Given facts 2 and 3 above, we simply need to find out what type of taxes our sample is paying to correctly back-calculate their taxable income. Let's look at some quick statistics to understand more.

#### Quick Statistics

We're going to look at 2014. Let's square the Federal Board of Revenue (FBR) Taxpayer Directory which has summary statistics with our sample. Let's start with total direct taxes as a percent of national income.

In [56]:
# Direct taxes as a percent of national income, 2014
# Src: UNU WIDER GRD dataset, which comes from the Federal Board of Revenue Taxpayer Directory (confirmed manually)
print(891000000000 / ni2014 * 100, "% [DIRECT TAXES]")

3.43302238997 % [DIRECT TAXES]


Let's look at a split of Voluntary Payments and Witholding Tax collected from all companies, association of persons, and individuals. Collection on Demand is when a tax authority figure shows up and does an assessment. It is not a voluntary return and therefore wouldn't be reflected in our directory (FBR TaxPayer Directory, 2014).

In [57]:
# All Companies, AOP, Individuals [ VP, WHT ] collected by FBR, 2014
print(891000000000  / ni2014 * 100 * (1 - 0.115), "% of national income [DIRECT TAXES]")

3.03822481512 % of national income [DIRECT TAXES]


It seems unlikely that you would file a return but not pay anything. If you know your tax liability is higher than the withholding amount, why would you file? Behaviorially, it doesn't make sense to file a return when you know you're not going to pay the tax anyway. Let's look at the amount of voluntary payments from companies, association of persons, and individuals.

In [58]:
# All Companies, AOP, Individuals [ VP ] collected by FBR, 2014
print(891000000000  / ni2014 * 100 * (0.284), "% of national income [VOLUNTARY PAYMENTS]")

0.974978358751 % of national income [VOLUNTARY PAYMENTS]


Let's compare that to our sample of taxes paid by individuals.

In [59]:
# Individuals in our sample
print(sum(taxes_paid_2013_14) / ni2014 * 100, "% of national income [OUR SAMPLE]")

0.442131994948 % of national income [OUR SAMPLE]


It's about half the previous number for all companies, AOPs and individuals. Considering AOPs and companies pay a much larger amount of voluntary tax than individuals, it makes sense that our sample is voluntary payments of individuals. In expanding this research, scraping / extracting the amount of taxes paid by companies and AOPs by modifying the scraper could confirm this.

**At this point, given this analysis, I am making the assumption that we have all voluntary payments made by individuals.** This means I am not assuming withholding taxes are in the sample (collection on demand would presumably follow the same tax schedule as the voluntary payment, plus a negligible fine).

We still need to know the distribution of non-salaried and salaried workers in the sample. There are also a few other taxes that could be part of this, including dividends and selling of stocks or land if ownership is less than a year. In the "Combining Data" section, we will deal with this and provide a range of final tables (lower bound, upper bound and middle estimates)

In [60]:
# SOME REFERENCE DATA, SEE THE FBR YEARBOOK FOR MORE
# 796805000000 collected by FBR (direct amount in that years' currency, see headbook in supplemental_tax_information)
# 947712000000 collected FBR (direct amount in that years' currency, see headbook in supplemental_tax_information)
# 1096046000000 collected FBR (direct amount in that years' currency, see headbook in supplemental_tax_information)
# 1217500000000 collected FBR (direct amount in that years' currency, see headbook in supplemental_tax_information)
# 1344200000000 collected FBR (direct amount in that years' currency, see headbook in supplemental_tax_information)

# 43.1% CoD and VP (direct amount in that years' currency, see headbook in supplemental_tax_information)
# 37.5% CoD and VP (direct amount in that years' currency, see headbook in supplemental_tax_information)
# 36.9% CoD and VP (direct amount in that years' currency, see headbook in supplemental_tax_information)
# 39% CoD and VP (direct amount in that years' currency, see headbook in supplemental_tax_information)
# 34.5% CoD and VP (direct amount in that years' currency, see headbook in supplemental_tax_information)

Again, I believe it's safe to assume that all individual filings are voluntary payments. Those who file and pay 0 in taxes already had it deducted from their salary at source, had it collected on demad, or are below the income threshold.

#### Real Income Calculations (% of National Income)

In [61]:
x = np.arange(2013, 2018, 1)

pnins_trace = go.Scatter(
    x=x,
    y=np.round(np.array([pnins2012, pnins2013, pnins2014, pnins2015, pnins2016]) * 100, 2),
    name = 'Non-Salaried Workers: Real Taxable Income',
)

r_pnins_trace = go.Scatter(
    x=x,
    y=np.round(np.array([r_pnins2012, r_pnins2013, r_pnins2014, r_pnins2015, r_pnins2016]) * 100, 2), 
    name = 'Non-Salaried Workers: Estimated Real Income (5% Deduction)',
)

pnis_trace = go.Scatter(
    x=x,
    y=np.round(np.array([pnis2012, pnis2013, pnis2014, pnis2015, pnis2016]) * 100, 2),
    name = 'Salaried Workers: Real Taxable Income',
)

r_pnis_trace = go.Scatter(
    x=x,
    y=np.round(np.array([r_pnis2012, r_pnis2013, r_pnis2014, r_pnis2015, r_pnis2016]) * 100, 2),
    name = 'Salaried Workers: Estimated Real Income (5% Deduction)',
)

data = [r_pnins_trace, pnis_trace, r_pnis_trace, pnins_trace]

layout = dict(title = 'Real Income Calculations as a Percent of National Income',
              xaxis = dict(title = 'Tax Year (July 1st - June 30th)'),
              yaxis = dict(title = 'Percent of National Income'),
              )
fig = go.Figure(data=data, layout=layout)
real_income_graph = py.iplot(fig, filename='styled-line')
real_income_graph

The markers for the x axis indicate the end of the tax year. Note that the tax code changes from tax years 2013 - 2014, and also 2015 - 2016 (causing the spread / potential amount of coverage that we can expect from the data to change). 

### Comparative Summary Statistics, Tax Collection

In [62]:
pop = pd.read_excel("./miscData/population/pop.xls")

pop0_14 = pd.read_excel("./miscData/population/pop0_14.xls")

pop_male = pd.read_excel("./miscData/population/pop_male.xls")
pop_female = pd.read_excel("./miscData/population/pop_female.xls")
pop15_19_male = pd.read_excel("./miscData/population/pop15_19_male.xls")
pop15_19_female = pd.read_excel("./miscData/population/pop15_19_female.xls")
pop20_24_male = pd.read_excel("./miscData/population/pop20_24_male.xls")
pop20_24_female = pd.read_excel("./miscData/population/pop20_24_female.xls")
pop25_29_male = pd.read_excel("./miscData/population/pop25_29_male.xls")
pop25_29_female = pd.read_excel("./miscData/population/pop25_29_female.xls")

In [63]:
pak_pop_female = pop_female[pop_female["Country Code"] == "PAK"]["2017"].values[0]
pak_pop_male = pop_male[pop_male["Country Code"] == "PAK"]["2017"].values[0]
pak_pop = pop[pop["Country Code"] == "PAK"]["2017"].values[0]
pak_under_30 = pop0_14[pop0_14["Country Code"] == "PAK"]["2017"].values[0] + \
(pop15_19_male[pop15_19_male["Country Code"] == "PAK"]["2017"].values[0] / 100 * pak_pop_male) + \
(pop15_19_female[pop15_19_female["Country Code"] == "PAK"]["2017"].values[0] / 100 * pak_pop_female) + \
(pop20_24_male[pop20_24_male["Country Code"] == "PAK"]["2017"].values[0] / 100 * pak_pop_male) + \
(pop20_24_female[pop20_24_female["Country Code"] == "PAK"]["2017"].values[0] / 100 * pak_pop_female) + \
(pop25_29_male[pop25_29_male["Country Code"] == "PAK"]["2017"].values[0] / 100 * pak_pop_male) + \
(pop25_29_female[pop25_29_female["Country Code"] == "PAK"]["2017"].values[0] / 100 * pak_pop_female)

india_pop_female = pop_female[pop_female["Country Code"] == "IND"]["2017"].values[0]
india_pop_male = pop_male[pop_male["Country Code"] == "IND"]["2017"].values[0]
india_pop = pop[pop["Country Code"] == "IND"]["2017"].values[0]
india_under_30 = pop0_14[pop0_14["Country Code"] == "IND"]["2017"].values[0] + \
(pop15_19_male[pop15_19_male["Country Code"] == "IND"]["2017"].values[0] / 100 * india_pop_male) + \
(pop15_19_female[pop15_19_female["Country Code"] == "IND"]["2017"].values[0] / 100 * india_pop_female) + \
(pop20_24_male[pop20_24_male["Country Code"] == "IND"]["2017"].values[0] / 100 * india_pop_male) + \
(pop20_24_female[pop20_24_female["Country Code"] == "IND"]["2017"].values[0] / 100 * india_pop_female) + \
(pop25_29_male[pop25_29_male["Country Code"] == "IND"]["2017"].values[0] / 100 * india_pop_male) + \
(pop25_29_female[pop25_29_female["Country Code"] == "IND"]["2017"].values[0] / 100 * india_pop_female)

turk_pop_female = pop_female[pop_female["Country Code"] == "TUR"]["2017"].values[0]
turk_pop_male = pop_male[pop_male["Country Code"] == "TUR"]["2017"].values[0]
turk_pop = pop[pop["Country Code"] == "TUR"]["2017"].values[0]
turk_under_30 = pop0_14[pop0_14["Country Code"] == "TUR"]["2017"].values[0] + \
(pop15_19_male[pop15_19_male["Country Code"] == "TUR"]["2017"].values[0] / 100 * turk_pop_male) + \
(pop15_19_female[pop15_19_female["Country Code"] == "TUR"]["2017"].values[0] / 100 * turk_pop_female) + \
(pop20_24_male[pop20_24_male["Country Code"] == "TUR"]["2017"].values[0] / 100 * turk_pop_male) + \
(pop20_24_female[pop20_24_female["Country Code"] == "TUR"]["2017"].values[0] / 100 * turk_pop_female) + \
(pop25_29_male[pop25_29_male["Country Code"] == "TUR"]["2017"].values[0] / 100 * turk_pop_male) + \
(pop25_29_female[pop25_29_female["Country Code"] == "TUR"]["2017"].values[0] / 100 * turk_pop_female)

usa_pop_female = pop_female[pop_female["Country Code"] == "USA"]["2017"].values[0]
usa_pop_male = pop_male[pop_male["Country Code"] == "USA"]["2017"].values[0]
usa_pop = pop[pop["Country Code"] == "USA"]["2017"].values[0]
usa_under_30 = pop0_14[pop0_14["Country Code"] == "USA"]["2017"].values[0] + \
(pop15_19_male[pop15_19_male["Country Code"] == "USA"]["2017"].values[0] / 100 * usa_pop_male) + \
(pop15_19_female[pop15_19_female["Country Code"] == "USA"]["2017"].values[0] / 100 * usa_pop_female) + \
(pop20_24_male[pop20_24_male["Country Code"] == "USA"]["2017"].values[0] / 100 * usa_pop_male) + \
(pop20_24_female[pop20_24_female["Country Code"] == "USA"]["2017"].values[0] / 100 * usa_pop_female) + \
(pop25_29_male[pop25_29_male["Country Code"] == "USA"]["2017"].values[0] / 100 * usa_pop_male) + \
(pop25_29_female[pop25_29_female["Country Code"] == "USA"]["2017"].values[0] / 100 * usa_pop_female)

fra_pop_female = pop_female[pop_female["Country Code"] == "FRA"]["2017"].values[0]
fra_pop_male = pop_male[pop_male["Country Code"] == "FRA"]["2017"].values[0]
fra_pop = pop[pop["Country Code"] == "FRA"]["2017"].values[0]
fra_under_30 = pop0_14[pop0_14["Country Code"] == "FRA"]["2017"].values[0] + \
(pop15_19_male[pop15_19_male["Country Code"] == "FRA"]["2017"].values[0] / 100 * fra_pop_male) + \
(pop15_19_female[pop15_19_female["Country Code"] == "FRA"]["2017"].values[0] / 100 * fra_pop_female) + \
(pop20_24_male[pop20_24_male["Country Code"] == "FRA"]["2017"].values[0] / 100 * fra_pop_male) + \
(pop20_24_female[pop20_24_female["Country Code"] == "FRA"]["2017"].values[0] / 100 * fra_pop_female) + \
(pop25_29_male[pop25_29_male["Country Code"] == "FRA"]["2017"].values[0] / 100 * fra_pop_male) + \
(pop25_29_female[pop25_29_female["Country Code"] == "FRA"]["2017"].values[0] / 100 * fra_pop_female)

chn_pop_female = pop_female[pop_female["Country Code"] == "CHN"]["2017"].values[0]
chn_pop_male = pop_male[pop_male["Country Code"] == "CHN"]["2017"].values[0]
chn_pop = pop[pop["Country Code"] == "CHN"]["2017"].values[0]
chn_under_30 = pop0_14[pop0_14["Country Code"] == "CHN"]["2017"].values[0] + \
(pop15_19_male[pop15_19_male["Country Code"] == "CHN"]["2017"].values[0] / 100 * chn_pop_male) + \
(pop15_19_female[pop15_19_female["Country Code"] == "CHN"]["2017"].values[0] / 100 * chn_pop_female) + \
(pop20_24_male[pop20_24_male["Country Code"] == "CHN"]["2017"].values[0] / 100 * chn_pop_male) + \
(pop20_24_female[pop20_24_female["Country Code"] == "CHN"]["2017"].values[0] / 100 * chn_pop_female) + \
(pop25_29_male[pop25_29_male["Country Code"] == "CHN"]["2017"].values[0] / 100 * chn_pop_male) + \
(pop25_29_female[pop25_29_female["Country Code"] == "CHN"]["2017"].values[0] / 100 * chn_pop_female)

chn_tr_gdp_2016 = 0.08 * 0.09157385536 # Taken from Economist article and World Bank data, see miscData files

d = {
    'Pakistan': [format_percent(pak_tr_gdp_2016, 2), format_percent(list(lfp[lfp["Country Name"] == "Pakistan"]["2017"])[0] / 100, 1), format_percent(list(lfp_female[lfp_female["Country Name"] == "Pakistan"]["2017"])[0] / 100, 1), format_percent(pak_under_30 / pak_pop, 1)], 
    'India': [format_percent(india_tr_gdp_2016, 2), format_percent(list(lfp[lfp["Country Name"] == "India"]["2017"])[0] / 100, 1), format_percent(list(lfp_female[lfp_female["Country Name"] == "India"]["2017"])[0] / 100, 1), format_percent(india_under_30 / india_pop, 1)], 
    'Turkey': [format_percent(turk_tr_gdp_2016, 2), format_percent(list(lfp[lfp["Country Name"] == "Turkey"]["2017"])[0] / 100, 1), format_percent(list(lfp_female[lfp_female["Country Name"] == "Turkey"]["2017"])[0] / 100, 1), format_percent(turk_under_30 / turk_pop, 1)], 
    'China': [format_percent(chn_tr_gdp_2016, 2), format_percent(list(lfp[lfp["Country Name"] == "China"]["2017"])[0] / 100, 1), format_percent(list(lfp_female[lfp_female["Country Name"] == "China"]["2017"])[0] / 100, 1), format_percent(chn_under_30 / chn_pop, 1)], 
    'France': [format_percent(fra_tr_gdp_2016, 2), format_percent(list(lfp[lfp["Country Name"] == "France"]["2017"])[0] / 100, 1), format_percent(list(lfp_female[lfp_female["Country Name"] == "France"]["2017"])[0] / 100, 1), format_percent(fra_under_30 / fra_pop, 1)], 
    'USA': [format_percent(usa_tr_gdp_2016, 2), format_percent(list(lfp[lfp["Country Code"] == "USA"]["2017"])[0] / 100, 1), format_percent(list(lfp_female[lfp_female["Country Code"] == "USA"]["2017"])[0] / 100, 1), format_percent(usa_under_30 / usa_pop, 1)], 
    }

comparative_summary_tax_statistics = pd.DataFrame(data=d)
comparative_summary_tax_statistics = comparative_summary_tax_statistics[["Pakistan", "India", "Turkey", "China", "USA", "France"]]
comparative_summary_tax_statistics.rename(index={0:'Income Tax Revenue (% of GDP)',1:'Labor Force Participation Rate', 2: 'Female Labor Force Participation Rate (% of female population)', 3:'Population under 30 (% of Total Population)'}, inplace=True)
print("Tax and Labor Data in Perspective, 2017")
comparative_summary_tax_statistics

Tax and Labor Data in Perspective, 2017


Unnamed: 0,Pakistan,India,Turkey,China,USA,France
Income Tax Revenue (% of GDP),0.84%,2.23%,3.66%,0.73%,10.46%,8.57%
Labor Force Participation Rate,53.2%,52.0%,52.6%,69.2%,62.3%,55.3%
Female Labor Force Participation Rate (% of female population),23.7%,23.8%,33.6%,61.8%,56.3%,50.4%
Population under 30 (% of Total Population),62.9%,54.6%,49.2%,38.3%,39.6%,35.7%


Some summary statistics to understand the data at hand in comparison to similar and more advanced countries.

## Understanding the Parlimentarian Tax Data

Some interesting time series analysis can come here.

# Export Data

In [64]:
import pickle

We will now export these final numbers and graphs to a seperate folder so we can use it in other notebooks.

In [65]:
'''============================ 2012 - 2013 ============================'''

with open('./finalData/taxes_paid_2012_13.pkl', 'wb') as f: pickle.dump(taxes_paid_2012_13, f)
with open('./finalData/taxable_incomes_salaried_2012_13.pkl', 'wb') as f: pickle.dump(taxable_incomes_salaried_2012_13, f)
with open('./finalData/taxable_incomes_nonsalaried_2012_13.pkl', 'wb') as f: pickle.dump(taxable_incomes_nonsalaried_2012_13, f)

with open('./finalData/num_filers_2012_13.pkl', 'wb') as f: pickle.dump(num_filers_2012_13, f)
with open('./finalData/num_taxpayers_2012_13.pkl', 'wb') as f: pickle.dump(num_taxpayers_2012_13, f)
with open('./finalData/proportion_filers_2012_13.pkl', 'wb') as f: pickle.dump(proportion_filers_2012_13, f)
with open('./finalData/proportion_taxpayers_2012_13.pkl', 'wb') as f: pickle.dump(proportion_taxpayers_2012_13, f)

'''============================ 2013 - 2014 ============================'''

with open('./finalData/taxes_paid_2013_14.pkl', 'wb') as f: pickle.dump(taxes_paid_2013_14, f)
with open('./finalData/taxable_incomes_salaried_2013_14.pkl', 'wb') as f: pickle.dump(taxable_incomes_salaried_2013_14, f)
with open('./finalData/taxable_incomes_nonsalaried_2013_14.pkl', 'wb') as f: pickle.dump(taxable_incomes_nonsalaried_2013_14, f)

with open('./finalData/num_filers_2013_14.pkl', 'wb') as f: pickle.dump(num_filers_2013_14, f)
with open('./finalData/num_taxpayers_2013_14.pkl', 'wb') as f: pickle.dump(num_taxpayers_2013_14, f)
with open('./finalData/proportion_filers_2013_14.pkl', 'wb') as f: pickle.dump(proportion_filers_2013_14, f)
with open('./finalData/proportion_taxpayers_2013_14.pkl', 'wb') as f: pickle.dump(proportion_taxpayers_2013_14, f)
    
'''============================ 2014 - 2015 ============================'''

with open('./finalData/taxes_paid_2014_15.pkl', 'wb') as f: pickle.dump(taxes_paid_2014_15, f)
with open('./finalData/taxable_incomes_salaried_2014_15.pkl', 'wb') as f: pickle.dump(taxable_incomes_salaried_2014_15, f)
with open('./finalData/taxable_incomes_nonsalaried_2014_15.pkl', 'wb') as f: pickle.dump(taxable_incomes_nonsalaried_2014_15, f)

with open('./finalData/num_filers_2014_15.pkl', 'wb') as f: pickle.dump(num_filers_2014_15, f)
with open('./finalData/num_taxpayers_2014_15.pkl', 'wb') as f: pickle.dump(num_taxpayers_2014_15, f)
with open('./finalData/proportion_filers_2014_15.pkl', 'wb') as f: pickle.dump(proportion_filers_2014_15, f)
with open('./finalData/proportion_taxpayers_2014_15.pkl', 'wb') as f: pickle.dump(proportion_taxpayers_2014_15, f)

'''============================ 2015 - 2016 ============================'''

with open('./finalData/taxes_paid_2015_16.pkl', 'wb') as f: pickle.dump(taxes_paid_2015_16, f)
with open('./finalData/taxable_incomes_salaried_2015_16.pkl', 'wb') as f: pickle.dump(taxable_incomes_salaried_2015_16, f)
with open('./finalData/taxable_incomes_nonsalaried_2015_16.pkl', 'wb') as f: pickle.dump(taxable_incomes_nonsalaried_2015_16, f)

with open('./finalData/num_filers_2015_16.pkl', 'wb') as f: pickle.dump(num_filers_2015_16, f)
with open('./finalData/num_taxpayers_2015_16.pkl', 'wb') as f: pickle.dump(num_taxpayers_2015_16, f)
with open('./finalData/proportion_filers_2015_16.pkl', 'wb') as f: pickle.dump(proportion_filers_2015_16, f)
with open('./finalData/proportion_taxpayers_2015_16.pkl', 'wb') as f: pickle.dump(proportion_taxpayers_2015_16, f)

'''============================ 2016 - 2017 ============================'''

with open('./finalData/taxes_paid_2016_17.pkl', 'wb') as f: pickle.dump(taxes_paid_2016_17, f)
with open('./finalData/taxable_incomes_salaried_2016_17.pkl', 'wb') as f: pickle.dump(taxable_incomes_salaried_2016_17, f)
with open('./finalData/taxable_incomes_nonsalaried_2016_17.pkl', 'wb') as f: pickle.dump(taxable_incomes_nonsalaried_2016_17, f)

with open('./finalData/num_filers_2016_17.pkl', 'wb') as f: pickle.dump(num_filers_2016_17, f)
with open('./finalData/num_taxpayers_2016_17.pkl', 'wb') as f: pickle.dump(num_taxpayers_2016_17, f)
with open('./finalData/proportion_filers_2016_17.pkl', 'wb') as f: pickle.dump(proportion_filers_2016_17, f)
with open('./finalData/proportion_taxpayers_2016_17.pkl', 'wb') as f: pickle.dump(proportion_taxpayers_2016_17, f)

In [68]:
with open('./finalData/finalGraphsAndTables/tax_returns_summary_statistics.pkl', 'wb') as f: pickle.dump(tax_returns_summary_statistics, f)
with open('./finalData/finalGraphsAndTables/tax_returns_summary_statistics_eligible.pkl', 'wb') as f: pickle.dump(tax_returns_summary_statistics_eligible, f)
with open('./finalData/finalGraphsAndTables/comparative_personal_income_tax_revenue_graph.pkl', 'wb') as f: pickle.dump(comparative_personal_income_tax_revenue_graph, f)
with open('./finalData/finalGraphsAndTables/india_pak_personal_income_tax_revenue_graph.pkl', 'wb') as f: pickle.dump(india_pak_personal_income_tax_revenue_graph, f)
with open('./finalData/finalGraphsAndTables/india_pak_personal_income_tax_growth_graph.pkl', 'wb') as f: pickle.dump(india_pak_personal_income_tax_growth_graph, f)
with open('./finalData/finalGraphsAndTables/income_calculations.pkl', 'wb') as f: pickle.dump(income_calculations, f)
with open('./finalData/finalGraphsAndTables/real_income_graph.pkl', 'wb') as f: pickle.dump(real_income_graph, f)
with open('./finalData/finalGraphsAndTables/comparative_summary_tax_statistics.pkl', 'wb') as f: pickle.dump(comparative_summary_tax_statistics, f)

To import the data into another notebook run the below code. Note that your import statement changes whether the file is a list or a table.

In [67]:
# with open('./finalData/taxes_paid_2012_13.pkl', 'rb') as f: taxes_paid_2012_13 = pickle.load(f)

Now onto combining the tax data with the survey data to generate our final tables!