<div align=center>
# Reproducible Notebook (Draft)
## University of Illinois
### Open Data Mashups
### Alex Dryden
### Spring 2019

<div>

-----

## Table of Contents: 

### [Purpose](#Purpose)

### [Methods](#Methods)
- [Loading Data Files](#Load)
- [Filtering Data](#Selecting-Data)
- [Assembly (Rinse, Repeat)](#Rinse,-Repeat)
- [Output](#Write)

### [Executive Summary](#Executive-Summary)

-----

## Purpose

This notebook should allow anyone interested in the final dataset to confirm how it was created from the [intermediate](./intermediate_index.ipynb) data. The methods section is designed to walk someone with minimal Python experience through the I/O, filtering, concatonation of the final dataset. See the [Executive Summary](#Executive-Summary) section for a more compressed version.

-----


----

## Methods





<div align=right>[Table of Contents](#Table-of-Contents)</div>


### Loading Data Files

So long as you are running this as a jupyter notebook, the current working directory is the same directory with the intermediate data files. If you are not, change the directory variable to the correct path where you have stored the intermediate data files. 

All of the intermediate data has been saved as a CSV, but we will be filtering and combining that data using a popular data manipulation and analysis library, pandas. Pansas has a simple way to import CSV data. By convention, we import the pandas library aliased as 'pd'.

-----

In [1]:
import pandas as pd

#file location
directory = './intermediate_data/intermediate_csv/'


----

<div align=right>[Table of Contents](#Table-of-Contents)</div>


### Managing Memory

Pandas is good at reducing the amount of RAM required to store and manipulate data, especially for numeric data. However, our intermediate datasets, when stored as a pandas dataframe, range from a few hundred MB to several GB. On an older laptop with other programs running in the background, it might cause some problems if we are doing all of our work while holding all of the intermediate data in memory. So we will work with and filter them one at a time, using the Jupyter reset magic to help manage memeory. 

Because this dataset is fouced only on regulated buildings that were sold 2018, we will start with that dataset and use it to as the primary filter. We eliminated columns that we did not need in the [intermediate](./intermediate_index.ipynb) data cleaning phase. 

----

In [2]:

#generate a dataframe from the csv file and use the BBL column as the index
transactions_df = pd.read_csv(directory + 'TransactionsClean.csv', index_col='BBL')

#with any pandas DF, using Python's built-in list() function will generate a list 
#of the column names. We can visually inspect that we have the columns we need
list(transactions_df)

['price',
 'cap_rate',
 'borough_cap_rate',
 'Latitude',
 'Longitude',
 'BIN',
 'deed_date',
 'watchlist']

----
<div align=right>[Table of Contents](#Table-of-Contents)</div>


### Filtering Data

We are using the BBL building identifier as an index becuase it is a common language that we can translate other building identifiers into, which we did during cleaning. Now we have a convenient way to quickly and effeciently filter the remaining data. Pandas offers a database-style join opperation that allows you to combine data with Venn Diagram-like logic. 

In this case, we are going to keep <i>all</i> of the data from the transactions data and add to it <i>any</i> rows from the new dataset that share the same BBL as an existing row in the transactions data. This has the effect of adding additional columns to our transactions dataset. If a row in the transactions dataframe has no match in the columns being added, then a Null value will populate that field. We will do this with each of the remaining intermediate datasets.

----

In [4]:
#generate a dataframe from the csv file and use the BBL column as the indes
valuation_df = pd.read_csv(directory + 'DoFValuationClean.csv', index_col='BBL')

#left join the transaction data with the boiler data, default is to join index-to-index
final_dataset = transactions_df.join(other=valuation_df, how='left')

#inspect to make sure all required columns are present 
list(final_dataset)

['price',
 'cap_rate',
 'borough_cap_rate',
 'Latitude',
 'Longitude',
 'BIN',
 'deed_date',
 'watchlist',
 'Owner',
 'Total_Units',
 'Residential_Units',
 'Buildings',
 'Stories',
 'Land_Market_Value',
 'Total_Market_Value',
 'Transitional_Assesed_Land_Value',
 'Transitional_Assessed_Total_Value,',
 'Actual_Assesed_Land_Value',
 'Actual_Assessed_Total_Value,',
 '5_Year_Exemption_Flags',
 'Exempt_Total_Value']


<div align=right>[Table of Contents](#Table-of-Contents)</div>


### <i> Manage Memory </i>
These datasets are orders of magnatude smaller now than when they were coverted into intermediate datasets. But on lightweight machines it may still be worthwile to remove pointers to memory we no longer need.

In [5]:
%who_ls

['directory', 'final_dataset', 'pd', 'transactions_df', 'valuation_df']

---
Now lets remove the boiler_df and confirm that it is no longer in the namespace

---

In [6]:
#use reset magic to remove the df from memory

%reset_selective -f transactions_df
%reset_selective -f valuation_df

%who_ls

['directory', 'final_dataset', 'pd']

---

<div align=right>[Table of Contents](#Table-of-Contents)</div>



### Rinse, Repeat

Because we have BBL as a column in each of our intermediate datasets, we can easily use the same process we outlined above for the remaing intermediate datasets. To simplify the code, we store theremaining intermediate datasets in a list and process them in a loop.

---

In [7]:
#remaining 5 dataset names
intermediate_data = [
    'HDPComplaintProblemsClean.csv',
    'DHPFeesClean.csv',
]

#for each dataset, load it into a df, set index, join to transactions and then remove
for dataset in intermediate_data:
    other = pd.read_csv(directory + dataset, index_col='BBL')
    final_dataset.join(other=other, how='left')
    %reset_selective -f other


    

---

<div align=right>[Table of Contents](#Table-of-Contents)</div>


### Output

Now all that is left is to write the data to a CSV. To demonstrate that this has totally recreated the final dataset, we can use checksums to compare them. 

#### Get Checksum
We can use the md5 algorithm to get a hash value of the final dataset you downloaded with this directory and then delete that dataset. 

----

In [16]:
import hashlib
import os

loc = 'final_dataset.csv'

with open(loc, 'rb') as f:
    data = f.read()
    first_checksum = hashlib.md5(data).hexdigest()
    
os.remove(loc)
first_checksum

'a27d32de6bdf518332c54306e27d140c'

----

<div align=right>[Table of Contents](#Table-of-Contents)</div>



#### Write
Now we write the final dataset we just created to the same location

----

In [17]:
final_dataset.to_csv(r'./final_dataset.csv', index='BBL', header=True)

---

#### Compare

Now we get the checksum of the csv we just produced and compare it to the previous checksum

---

In [18]:
with open(loc, 'rb') as f:
    data = f.read()
    second_checksum = hashlib.md5(data).hexdigest()
    
first_checksum == second_checksum

True

---
The two values are the same, so we know we just reproduced the final dataset exactly

---