# Data Science Homework 1

For this homework we'll be looking at the data released by the State Bank of Pakistan. The data is the daily bank-wise and donor-wise receipts of the fund for the Daimer Bhasha and Mohmand Dam. You can find them in the following link: http://www.sbp.org.pk/notifications/FD/DamFund/Damfund.htm. Take a moment to look around the data and try to figure out what the possible challenges could be.

The main purpose of this homework is to teach how to scrape data from the web, clean it, and import it into Pandas for data analysis purposes. There are however some things to note:
1. As you can tell, the data is in PDF form. PDF is the most difficult to handle data format and if you get extremely broken CSV files, there isn't a need to worry, that's where the cleaning part comes in.
2. We'll be using an API to convert the data from PDF to CSV, and then from CSV to Pandas. There are, however, other ways to do this. The reason we wanted to do this method is two-fold
    * It will teach you how to communicate with APIs using Python, which will be a useful skill when you want to deploy your data models as an API so that it can work with other APIs that need those data models. Moreover, a lot of data you get in the real world is from APIs. 
    * The CSV will be extremely inconsistent, so it will give you immense practice with using regular-expressions, which are extremely important in the Data Science tool-kit.
    
Submit the notebooks in a similar format to the Labs: print the relevant output in each cell **only if it has an output. The initial scraping and converting does not have any output**, and name the notebooks as:
**rollnumber_HW1.ipynb** for e.g **20100237_H1.ipynb**

Please make sure you complete full parts (denoted by a Header each in this notebook) as the grading will be based on parts. Needless to say, do not copy someone else's code. In most Data Science careers, the main skill is not how good you are at coding, but how well you are able to use the tools at your disposal and what inferences you are able to make with the information that you have. Thus, while you might be able to do the HW by looking at someone else's code, unless you go through the actual thought process, you won't learn a lot.

We'll be using a lot of libraries in this tutorial, make sure you go through them so you understand what they are used for.

**NOTE: If you are more comfortable doing so, as I am, you can do the assignment on your preferred text editor on simple Python and then write the code neatly in a notebook.** Personally, I find Sublime/Vim easier to use than Jupyter, mostly since a lot of shortcuts there make coding much easier, while here the shortcuts are more about navigation and controlling your cells.

**The homework is to be done in pairs of 2.** 

**Naming convention: rollnumber1_rollnumber2_HW1.ipynb**

Total Marks: 100

## Part 0: Getting the Data

You can have a look at the data through the link given above. Download a few PDF files and go through the data to see what it looks like. How many columns are there, each, in the PDF files? Are there any inconsistencies? Any particular values that pop out that would need to be taken care of later in your cleaning? Think of all these questions when going through the initial PDF because they will prove really helpful when you can not figure out why there are so many "NaN" values in your final DataFrame.

## Part 1: Data Scraping              
Marks: 20

We'll be using what is called the *requests* model to get an HTML page, and then use *BeautifulSoup* to parse that HTML page such that we are able to to derive the appropriate information from it. I recommend you go through the documentation of each to learn more about how to use the libraries. 

* [Requests Documentation](http://docs.python-requests.org/en/master/)
* [BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
* [BeautifulSoup + Requests Tutorial](https://www.pythonforbeginners.com/python-on-the-web/web-scraping-with-beautifulsoup)
* [BeautifulSoup](https://medium.freecodecamp.org/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe) Note that this tutorial is more detailed. I would highly recommend you go through this as well even though the library used here is urllib2 instead of requests (which you can do as well!). It also links to more web-scraping libraries like Scrapy for more complicated scraping.

In [141]:
# Import libraries
import requests
from bs4 import BeautifulSoup
import os
import re
import urllib.request

# os is being imported so you can make a new directory. 

In [None]:
## Write code here that will:
    # Open each PDF link
    # Save the PDF in a directory in the same folder 

r = requests.get('http://www.sbp.org.pk/notifications/FD/DamFund/Damfund.htm')
data = r.text

soup = BeautifulSoup(data, "html.parser")
a_tags = soup.find_all('a')
hrefs = [a.get('href') for a in a_tags if a.get('href').endswith('.pdf')]

for i in range(59):
    url = 'http://www.sbp.org.pk/notifications/FD/DamFund/' + hrefs[i]
    r = requests.get(url, allow_redirects=True)
    with open('./all_pdfs/' + hrefs[i].split('/')[-1], 'wb') as f:  
        print('Beginning download ' + hrefs[i].split('/')[-1])
        f.write(r.content)
        print('Ended download ' + hrefs[i].split('/')[-1])

# Your code goes here #

## Part 2: Converting from PDF to CSV
Marks: 15

You have two possible options between deciding what API to use for the conversion task.

The first option is communicating with an API called [Zamzar](https://www.zamzar.com) to send each PDF, ask them to convert it into CSV, and then download the converted CSV. They provide sample code to do everything from generating a simple request to starting a conversion job, checking for completion, and then downloading the finished file. You can find this information on the [Zamzar Documentation](https://developers.zamzar.com/docs) page.

**Important Information: **

The API only provides 100 points of free conversion, and each PDF to CSV conversion costs 3 points, that means with one account you can only convert **33** PDFs. However, this also means you have very little room to play around with this API, unless you have an extra email-address, so you need to be very careful when coding to communicate with this API. 

Moreover, the API only keeps the converted files for one day with a free account, so make sure you do this part in one go.

**Note: Using the Zamzar API grants a bonus of 10 marks. This will help if you are not able to complete this assignment, or it can be used up in a later assignment if you get 110/100 marks in this one.**

Another extremely simple API is the [PDF Tables](https://pdftables.com) API which is much simpler to use than the Zamzar API, however does not allow you to check the job for completion or for any intermediate steps. Moreover, this requires the installation of a library. Once again, they allow only 50 versions for free, but that is enough conversions for us. This [blog post](https://pdftables.com/blog/pdf-to-excel-with-python) will help you figure out how to convert the PDF to CSV using Python.

The cons of this API is that it will not really teach you any proper API communcation through requests since you do not have to navigate through any requests.

In [142]:
from requests.auth import HTTPBasicAuth
#import config
import glob

In [143]:
pdfs_folder = './all_pdfs/*.pdf'

# You need a list to store all the job_ids from the response of posting the conversion job, 
# if you are using the Zamzar API

job_ids = []

# This piece of code shows you what glob does
for file_name in glob.glob(pdfs_folder):
    print(file_name)
    
    ## Write code here to post a job, and append each job's id into job_ids ##

./all_pdfs\01-08-2018.pdf
./all_pdfs\01-10-2018.pdf
./all_pdfs\02-08-2018.pdf
./all_pdfs\02-10-2018.pdf
./all_pdfs\03-08-2018.pdf
./all_pdfs\03-09-2018.pdf
./all_pdfs\03-10-2018.pdf
./all_pdfs\04-09-2018.pdf
./all_pdfs\04-10-2018.pdf
./all_pdfs\05-09-2018.pdf
./all_pdfs\05-10-2018.pdf
./all_pdfs\06-08-2018.pdf
./all_pdfs\06-09-2018.pdf
./all_pdfs\07-08-2018.pdf
./all_pdfs\07-09-2018.pdf
./all_pdfs\08-08-2018.pdf
./all_pdfs\08-10-2018.pdf
./all_pdfs\09-07-2018.pdf
./all_pdfs\09-08-2018.pdf
./all_pdfs\10-07-2018.pdf
./all_pdfs\10-08-2018.pdf
./all_pdfs\10-09-2018.pdf
./all_pdfs\11-07-2018.pdf
./all_pdfs\11-09-2018.pdf
./all_pdfs\12-07-2018.pdf
./all_pdfs\12-09-2018.pdf
./all_pdfs\13-07-2018.pdf
./all_pdfs\13-08-2018.pdf
./all_pdfs\13-09-2018.pdf
./all_pdfs\14-09-2018.pdf
./all_pdfs\15-08-2018.pdf
./all_pdfs\16-07-2018.pdf
./all_pdfs\16-08-2018.pdf
./all_pdfs\17-07-2018.pdf
./all_pdfs\17-08-2018.pdf
./all_pdfs\17-09-2018.pdf
./all_pdfs\18-07-2018.pdf
./all_pdfs\18-09-2018.pdf
./all_pdfs\1

Below this cell write the code to download the completed files. First check if a job_id's status is completed and wait until it is. After it has been completed, download the file and save it.

The exact code required here is all in the documentation, the only additional task you have to do on your own is figure out a way to find out which file has just been received from the job_id, and name the local file.

**Please look at the Example JSON response in the [documentation](https://developers.zamzar.com/docs) to learn how to figure out the filenames, job status etc**

#### here we came back after discovering that some files ahd not been converted, and found the files which had not been converted to figure out what went wrong

In [144]:
csvs_folder = './all_csvs/*.csv'

all_csvs = [csv[11:-4] for csv in glob.glob(csvs_folder)]
all_pdfs = [pdf[11:-4] for pdf in glob.glob(pdfs_folder)]

set_pdfs = set(all_pdfs)
set_csvs = set(all_csvs)

missed_files = set_pdfs.difference(set_csvs)

In [145]:
## Your code goes here ##
import json
import requests

api_key = '35eb69fb9d8d674dc9b5cc5c0f6d251badc2cd0f';
endpoint = "https://sandbox.zamzar.com/v1/jobs"

target_format = "csv"

for idx, my_file in enumerate(glob.glob(pdfs_folder)):
    if my_file[11:-4] in missed_files:
        file_content = {'source_file': open(my_file, 'rb')}
        data_content = {'target_format': target_format}
        response = requests.post(endpoint, data=data_content, files=file_content, auth=HTTPBasicAuth(api_key, ''))
        print(response.json())
        data = response.json()
        if "id" in data.keys():
            missed_files.remove(my_file[11:-4])
            job_ids.append(data["id"])
        else:
            print("failed at: ", idx, my_file)
    else:
        print("file already converted")

file already converted
file already converted
file already converted
file already converted
file already converted
file already converted
file already converted
file already converted
file already converted
file already converted
file already converted
file already converted
{'errors': [{'code': 20, 'message': 'API key was missing or invalid'}]}
failed at:  12 ./all_pdfs\06-09-2018.pdf
file already converted
file already converted
file already converted
{'errors': [{'code': 20, 'message': 'API key was missing or invalid'}]}
failed at:  16 ./all_pdfs\08-10-2018.pdf
file already converted
file already converted
file already converted
file already converted
{'errors': [{'code': 20, 'message': 'API key was missing or invalid'}]}
failed at:  21 ./all_pdfs\10-09-2018.pdf
file already converted
file already converted
file already converted
file already converted
file already converted
file already converted
{'errors': [{'code': 20, 'message': 'API key was missing or invalid'}]}
failed at:  28

In [146]:
def check_response(job_ids):
    for job_id in job_ids:
        endpoint = "https://sandbox.zamzar.com/v1/jobs/{}".format(job_id)
        response = requests.get(endpoint, auth=HTTPBasicAuth(api_key, ''))
        data = response.json()
        print(data["status"])
        if data["status"] == "initialising":
            yield job_id
        elif (data["status"] == "successful"):
            filename = './all_csvs/' + data["target_files"][0]["name"]
            file_id = data["target_files"][0]["id"]
            endpoint = "https://sandbox.zamzar.com/v1/files/{}/content".format(file_id)
            response = requests.get(endpoint, stream=True, auth=HTTPBasicAuth(api_key, ''))
            try:
                with open(filename, 'wb') as f:
                    for chunk in response.iter_content(chunk_size=1024):
                        if chunk:
                            f.write(chunk)
                            f.flush()
                    print("File downloaded")

            except IOError:
                print("Error")

#### now we will impliment a function that calls check_response, and if the file download is not successful it waits and for 5 seconds and then tries again

In [147]:
import time


retry_files = list(check_response(job_ids))

while len(retry_files) is not 0:
    print("trying again\n")
    time.sleep(5)
    for retry_file in list(check_response):
        retry_files = list(check_response(retry_files))

#### since file at position 26 of the jobs_id was too large to convert using zamzar, we will try using pdf_tables api. the name of the file is 2018_27-08-2018.pdf

## Part 3: Parsing the CSV File
Marks: 35

This is perhaps the most difficult part of the assignment, you have to follow a similar strategy to what you did the Udacity Lab 1. You can not simply use Pandas read_csv since the conversion is not perfect and there will be rows with different number of columns, which Pandas does not take care of.

### **Main Task:**
* Write a function that parses a CSV into a Pandas DataFrame
* Each DataFrame should consist of three columns with headers Bank, Donor_Name, and Amount
* The date should be retrieved from the given filename 
* The Donor_Name can be NaN, as it is in a lot of cases. But try to retrieve as much information as possible
* Remove all "Page of" rows
* Don't include the header rows (e.g. "SUPREME COURT FUND....") into the DataFrame
* The Amount should be converted into a Pandas numeric at the end

### Other info:
Some important resources for this part are (you can choose any one tutorial that you feel is easy to understand, they all cover roughly the same content):
* [RegEx Tutorial 1](https://www.regular-expressions.info/)
* [RegEx Tutorial 2](https://regexone.com/lesson/introduction_abcs)
* [RegEx Tutorial 3](https://www.rexegg.com/)
* [RegEx Cheatsheat](https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285)
* This [RegEx Editor](https://regex101.com/) is your best friend since you can test your expression separately on this

You will probably have to use the CSV reader in order to get all the rows of the file. You can learn more about it using this [tutorial](https://www.alexkras.com/how-to-read-csv-file-in-python/).

Some tips:
* First find out how many columns are in each row
* Print out rows which are longer than they should be (they should all be of length 3)
* Try to find patterns in how the data is spread, and what common problems exist in all rows
* Write some regex to try an extract the amount from the problem row and then:
    * Put the amount as the third column
    * Merge the rest of the string as a name of the donor in the 2nd column
* Also check if the rows with 3 columns are correctly formatted or not, many of them would probably not be.

In [170]:
import csv
import re # To use regular expressions
import pandas as pd
import numpy as np
example_file = './all_csvs/01-10-2018.csv' # Assuming the file is in the folder all_csvs and is named appropriately
# This is one of the most problematic files which is why I have included this in the example

In [171]:
def parser(filename):
    # Complete this function
    
    with open(filename) as f:
        reader = csv.reader(f)
        data = [row for row in reader]
        ## Just an example of one way to use the CSV module
        return data

In [172]:
# Remember, remove headers and convert all amounts to Numeric; if it can't be converted it needs to be NaN
def is_header_or_page_num(entry):
    p = re.compile('(?:[Tt]otal)|(?:[Pp]age.*of)|(?:[Dd]epositor [Nn]ame)', re.IGNORECASE)
    if re.search(p, entry) is not None:
        return True
    return False
    
def convert_numeric(entry):
    num = ''
    digits = re.findall(r'(\d+)', entry)
    if(digits):
        for digits in digits:
            num += digits
    else:
        return entry
    return int(num)
        
def clean_data(raw_data):
    cleaned_data = []
        
    for data_row in raw_data:
        last_numeric_entry = 'NaN'
        last_numeric_entry_indx = -1
        row = []
        
        #removing empty entries, headers and Pages_of entries
        for entry in data_row:
            #in an attempt to address the problem with National Bank of Pakistan Overseas branches, of
            #gouping their Donor name with Bank Name, and leaving Donor Name empty and other such cases
            match = re.search(r'(?:(.*[Bb]ranches) ?(.+))|(?:(.*[Ll]td)(?![ ][tT]) ?(.+))', entry)
            if (match):
                data_row[0:1] = [match for match in match.groups() if match is not None]
            if is_header_or_page_num(entry):
                row.clear()
                
                continue
            if (entry != ""):
                row.append(entry)
                
        #finding the Amount entry, by looking at the numeric entry nearest the end which is in 
        #digits-and-commas format for each row
        
        for indx, entry in enumerate(row):
            try:
                last_numeric_entry = (float(entry.replace(',','')))
                last_numeric_entry_indx = indx 
            except:
                continue

        
        #taking care of rows longer than 3 by making the numeric row nearest the end of the row the Amount
        #and joining the remaining strings together and dumping at index = 1
        
        #also filling the Donor_name with NaN if no Donor info found
        if len(row)>1:
            row[1:-1] = [' '.join(row[1:last_numeric_entry_indx]+row[last_numeric_entry_indx+1:-1])]
            row[2] = last_numeric_entry
            if(row[1] == ''):
                row[1] = 'n/a'
        
        #finally appending the cleaned row to cleaned_data array
        if len(row)>1: #don't append empty rows, or size one rows
            cleaned_data.append(row)
          
    return cleaned_data

def read_csv(filename):
    headers = ['Bank', 'Donor_Name','Amount']
    raw_data = parser(filename)
    cleaned_data = clean_data(raw_data)
        
    #Ignore this (for debugging perposes):
            
    df = pd.DataFrame(cleaned_data,
                      columns = ['Bank', 'Donor_Name', 'Amount'])
    df['Date'] = filename[11:-4]
    #making the last row numeric, and filling the non-numeric data with NaN
    df.iloc[:,2]= df.iloc[:,2].apply(pd.to_numeric, errors='coerce')
    #to fill eventually with relevant 3 columns from the cleaned_data
    return df
    
df = read_csv(example_file)
#print(df)

the n/a Donor entries and NaN Amount entries seem pretty valid, and even expected with manual entering of data
these seem on the whole to be missing values rather than mis_parsed

## Part 4: Importing Full Dataset
Marks: 10 

The only additional task in this part is to:
* Run the parser on all the files
* For each file **add a 'Date' column, which should be inferred from the filename**
* Concatenate each DataFrame into one large DataFrame. *Hint: concat*

In [173]:
from datetime import datetime
import glob
import pandas as pd

files = glob.glob('./all_csvs/*.csv')
df['Date'] = pd.to_datetime(df['Date'], format = '%d-%m-%Y')

for file in files:
    if file[-14:] != example_file[-14:]:
        temp = read_csv(file)
        temp['Date'] = pd.to_datetime(temp['Date'], format = '%d-%m-%Y')
        df = df.append(temp, ignore_index = True)
    else:
        continue

full_data = df
df

#full_data = # Your code goes here

#print(full_data.head())
#print(full_data.shape)
#print(full_data.tail())

Unnamed: 0,Bank,Donor_Name,Amount,Date
0,AL BARAKA BANK (PAKISTAN) LTD,MOHAMMAD AKHTAR 0117,70848.64,2018-10-01
1,AL BARAKA BANK (PAKISTAN) LTD,RIZWAN SULTAN 0117,30000.00,2018-10-01
2,AL BARAKA BANK (PAKISTAN) LTD,MUSHTAQ AHMED JANJUA 0117,19000.00,2018-10-01
3,AL BARAKA BANK (PAKISTAN) LTD,adc 0117,17000.00,2018-10-01
4,AL BARAKA BANK (PAKISTAN) LTD,ABDUL JABBAR 0117,1000.00,2018-10-01
5,AL BARAKA BANK (PAKISTAN) LTD,mushtaq ahmad 0117,1000.00,2018-10-01
6,AL BARAKA BANK (PAKISTAN) LTD,majeed 0117,1000.00,2018-10-01
7,AL BARAKA BANK (PAKISTAN) LTD,M SIDDIQ KHILJI 0117,1000.00,2018-10-01
8,AL BARAKA BANK (PAKISTAN) LTD,SHAMIM AKHTAR 0117,1000.00,2018-10-01
9,AL BARAKA BANK (PAKISTAN) LTD,SYED MUMTAZ ALI 0117,1000.00,2018-10-01


In [174]:
df[df.loc[:, 'Amount'].isnull()]

Unnamed: 0,Bank,Donor_Name,Amount,Date
783,Dubai Islamic Bank Pakistan Ltd,MALIK ASGHAR ALI AND FITAM MALIK CLUB ROAD BRA...,,2018-10-01
865,Habib Bank Limited,PUNJAB UNIVERSITY ASISTANT TREASURER PU EMERGE...,,2018-10-01
880,Habib Bank Limited,JAMILA ARIF 240 TOWN LINE ROAD COMMACK NEW YOR...,,2018-10-01
881,Habib Bank Limited,ARSHAD BUTT NEAR BIJLI GAR SARGHODA GUJTAR KUN...,,2018-10-01
884,Habib Bank Limited,GOLD MINE PLAZA 105 FEROZ PUR ROAD ICCHRA LAHO...,,2018-10-01
886,Habib Bank Limited,RASHDA PARVEEN KHAS SOHASA PO BOX SOHAWA TEHSI...,,2018-10-01
893,Habib Bank Limited,SHAISTA IFTIKHAR PGECHS PHASE 2 NEAR WAPDA TOW...,,2018-10-01
894,Habib Bank Limited,MUKHTIAR AHMED ST?8 QASIM TOWN NEAR POLICE Hab...,,2018-10-01
923,Habib Bank Limited,ARIFA 39 HADAIYT ULLAH BLOCK MUSTAFA TOWN LAHO...,,2018-10-01
941,Habib Bank Limited,SAKHI MUHAMMAD ISLAMIA PARK HOUSE NO588 ST Hab...,,2018-10-01


## Part 5: Data Integrity Checks
Marks: 20

* How many NaN values are there in each column? Why are they there? 
* What are the maximum and minimum values, is there anything peculiar about the max values?
* Are there any rows which are not NaN but should still be a different DataFrame altogether?
* Should these problem rows be removed? Can they be useful in other ways?

In [175]:
#df.describe()
df[df.loc[:, 'Amount'] == df.max()['Amount']]

Unnamed: 0,Bank,Donor_Name,Amount,Date
1105,Habib Bank Limited,HIRA SULTANA ST ? 19 H ? 696 GULISTAN COLONY LHR,33499630000000.0,2018-10-01


In [176]:
df[df.loc[:, 'Donor_Name'] == df.min()['Donor_Name']]

Unnamed: 0,Bank,Donor_Name,Amount,Date
37124,The Bank of Punjab,#REF!,5000.0,2018-08-08
81911,Allied Bank Limited,#REF!,28565838.0,2018-08-17
137819,CITI Bank N A,#REF!,1239400.0,2018-07-30
142990,CITI Bank N A,#REF!,913950.0,2018-07-31
144636,The Bank of Punjab,#REF!,25829.78,2018-07-31


In [177]:
df[df.loc[:, 'Date'] == df.max()['Date']]

Unnamed: 0,Bank,Donor_Name,Amount,Date
25322,AL BARAKA BANK (PAKISTAN) LTD,HASSAN EJAZ 0117,100000.0,2018-10-05
25323,AL BARAKA BANK (PAKISTAN) LTD,MOHSIN 0117,100000.0,2018-10-05
25324,AL BARAKA BANK (PAKISTAN) LTD,ABDUL HAQ SIDDIQI 0117,50000.0,2018-10-05
25325,AL BARAKA BANK (PAKISTAN) LTD,javid akhter 0117,30000.0,2018-10-05
25326,AL BARAKA BANK (PAKISTAN) LTD,ADC 0117,10000.0,2018-10-05
25327,AL BARAKA BANK (PAKISTAN) LTD,MUHAMMAD USMAN 0117,9670.0,2018-10-05
25328,AL BARAKA BANK (PAKISTAN) LTD,HOPE SCHOOL 0117,3602.0,2018-10-05
25329,AL BARAKA BANK (PAKISTAN) LTD,HOPE SCHOOL 0117,3362.0,2018-10-05
25330,AL BARAKA BANK (PAKISTAN) LTD,HAFEEZ 0117,2000.0,2018-10-05
25331,AL BARAKA BANK (PAKISTAN) LTD,IRFAN 0117,1700.0,2018-10-05


The min amount corresponds to a data entry -> 'Amount Inadvertently credit in the DAM fund Account in Previous Dates', this appears to be an error in the entering of data on the other end.

The max amount appears to corresond to a phone number entry by which was read as amount entered because it was the last numeric entry in the row, and was seperated by the actual ammount only by a comma. The raw data entered was '033499632791,000.00', and it would have to be dealt with as an individual outlier, no regular expression could seperate it from from an actual transaction entry.

The minimum Date corresponds to the the day the dam fund was basically started by the honourable CJP Mian Saqib Nisar

The maximum Date corresponds to all the entries recorded on the 5th of October which is the last date we are taking into account for the purpose of our analysis taking into account limitations of credits on zamzar.

The minumum Donor_Name corresponds to a data entry -> #REF!. This might be due to the fact that names of the donors may not be recorded in these specific entries.

The maximum Donor_Name corresponds to the maximum ascii code convention followed while using strings in the dataframe.

There are some rows that include the amount merged with the phone numbers of the donor. These rows might be problematic as we will not be able to find the maximum amount donated. However, these rows might be useful in acquiring data about people and should be put in a different dataframe altogether. 
Similarly, all the rest of teh data, other than amount and date that we dumped in to the center column under Donor_Name includes things that might eb potential relevant later, such as addresses districs and cities and phone-numbers of the donors. All this data should not be discarded but put in a another column with more sophisticated and rigorous parsing.
It could potentially serve other purposes, such as an analysis of location based analysis of bank transaction, as opposed the temporal analysis that we seem to be focusing on. In that case it would serve us better to put it in another data-frame after converting it to meaninglful cleaned data.

In [178]:
df[df.loc[:, 'Donor_Name'] == 'n/a'].shape

(3892, 4)

The number of NaN values for Donor_Name is 3892

In [179]:
df[df.loc[:, 'Amount'].isnull()].shape

(1370, 4)

The number of NaN values for Amount is 1370

In [180]:
df.shape

(146846, 4)

146846 is the total number of entries