### **Z ASMT DATA EXTRACT & MERGE SCRIPT**

**PLACE THIS SCRIPT** in the folder or directory containing the quarterly data folders.

For instance, this script should be placed in the same folder or directory, and on the same level (aka, not in its own/separate folder), as the "20181230" and "20190616" data folders from ZTRAXs.

----------------------------------------------------
**This script is for ZASMT DATA; it does the following:**
- Works through any selected ZTRAXs data folders.

*For each folder:*
-   Takes data from selected files and only keeps selected variables from these files.
-   Merges the individual selected ZAsmt files/variables into a single output file for each selected quarter data file.
-   Creates a data_summary file for each merged output file which contains original and new summary statistics as well as the total number of dropped duplicate rows and execution time.
----------------------------------------------------

**TO RUN** the scipt, simply click the run button that appears near the top left of the first cell. 

Only run this cell, unless changes are made to any other cells. If other cells are changed, run the changed cells individually before running the main (first) cell again.

**Potential improvements:**
*This script was written for quick results/development rather than maintainability*. All data that needs to be changed by the user to run the script on a different ZTRAX dataset should be in one place for ease of use/modification.
Further, there is repetition in creating the dataframes from each separate file. Future development can remove these separate functions for each pre-defined file and create a single function which takes the file of interest as its parameter. This will benefit maintainability and allow the user to easily add additional files of interest; right now, the script assumes we will always want to extract the same individual files from each ZAsmt data folder.


In [16]:
import pandas as pd
import time

folders = ["20181230", "20190319", "20190616", "20190918"]

def main():
    start_time = time.time()

    for folder in folders:
        print("Current Folder: ", folder)

        dataframes = create_dfs_from_files(folder)
        print("Created dataframes")

        drop_counties(dataframes["main"])
        print("Dropped duplicates")

        merged, num_dropped = merge(dataframes)
        print("Merged")

        write_file(folder, merged)
        print("Wrote file")

        exec_time = time.time() - start_time

        create_summary(folder, merged, dataframes, num_dropped, exec_time)
        print("Created summary")
        
        print("Completed " + folder + ". Execution time (seconds): " + str(exec_time))
        print("")


#RUN SCRIPT
main()

Current Folder:  20181230


  dataframes["building"] = get_building(folder)
  dataframes["saleData"] = get_saleData(folder)
  dataframes["main"] = get_main(folder)
  dataframes["mailAddress"] = get_mailAddress(folder)


Created dataframes
Dropped duplicates
Merged
Wrote file
Created summary
Completed 20181230. Execution time (seconds): 491.7272672653198

Current Folder:  20190319


  dataframes["building"] = get_building(folder)
  dataframes["saleData"] = get_saleData(folder)
  dataframes["main"] = get_main(folder)
  dataframes["mailAddress"] = get_mailAddress(folder)


Created dataframes
Dropped duplicates
Merged
Wrote file
Created summary
Completed 20190319. Execution time (seconds): 1024.5489990711212

Current Folder:  20190616


  dataframes["building"] = get_building(folder)
  dataframes["saleData"] = get_saleData(folder)
  dataframes["main"] = get_main(folder)
  dataframes["mailAddress"] = get_mailAddress(folder)


Created dataframes
Dropped duplicates
Merged
Wrote file
Created summary
Completed 20190616. Execution time (seconds): 1615.906179189682

Current Folder:  20190918


  dataframes["building"] = get_building(folder)
  dataframes["saleData"] = get_saleData(folder)
  dataframes["main"] = get_main(folder)
  dataframes["mailAddress"] = get_mailAddress(folder)


Created dataframes
Dropped duplicates
Merged
Wrote file
Created summary
Completed 20190918. Execution time (seconds): 2350.4842069149017



In [3]:
def create_dfs_from_files(folder):
    dataframes = {}
    dataframes["building"] = get_building(folder)
    dataframes["buildingAreas"] = get_buildingAreas(folder)
    dataframes["saleData"] = get_saleData(folder)
    dataframes["taxDistrict"] = get_taxDistrict(folder)
    dataframes["value"] = get_value(folder)
    dataframes["main"] = get_main(folder)
    dataframes["mailAddress"] = get_mailAddress(folder)
    return dataframes


The following get methods read in the corresponding data files, selecting only the variables of interest and giving those columns their proper names manually.

In [4]:
def get_building(folder): 
    building = pd.read_csv(folder + '\\ZAsmt\\Building.txt', sep='|', on_bad_lines='skip', encoding='latin-1')

    building = building.iloc[:, [0, 1, 14, 19, 26]]

    building.columns = ["RowID", "NoOfUnits", "YearBuilt", "TotalBedrooms", "TotalActualBathCount"]

    return building

In [5]:
def get_buildingAreas(folder):
    buildingAreas = pd.read_csv(folder + '\\ZAsmt\\BuildingAreas.txt', sep='|', on_bad_lines='skip', encoding='latin-1')
    
    buildingAreas = buildingAreas.iloc[:, [0, 4]]
    
    buildingAreas.columns = ["RowID", "BuildingAreaSqFt"]

    return buildingAreas

In [6]:
def get_saleData(folder):
    saleData = pd.read_csv(folder + '\\ZAsmt\\SaleData.txt', sep='|', on_bad_lines='skip', encoding='latin-1')

    saleData = saleData.iloc[:, [0, 2, 3, 5, 11]]

    saleData.columns = ["RowID", "SellerFullName", "BuyerFullName", "DocumentDate", "SalesPriceAmount"]

    return saleData

In [7]:
def get_taxDistrict(folder):
    taxDistrict = pd.read_csv(folder + '\\ZAsmt\\TaxDistrict.txt', sep='|', on_bad_lines='skip', encoding='latin-1')

    taxDistrict = taxDistrict.iloc[:, [0, 1]]

    taxDistrict.columns = ["RowID", "TaxDistrictStndCode"]

    return taxDistrict

In [8]:
def get_value(folder):
    value = pd.read_csv(folder + '\\ZAsmt\\Value.txt', sep='|', on_bad_lines='skip', encoding='latin-1')

    value = value.iloc[:, [0, 3, 4]]

    value.columns = ["RowID", "TotalAssessedValue", "AssessmentYear"]

    return value

In [9]:
def get_main(folder):
    main = pd.read_csv(folder + '\\ZAsmt\\Main.txt', sep='|', on_bad_lines='skip', encoding='latin-1')

    main = main.iloc[:, [0, 1, 4, 20, 21, 22, 23, 24, 25, 26, 27, 29, 30, 35, 36, 38, 39, 40, 70, 92]]

    main.columns = ["RowID", "ImportParcelID", "County" , "PropertyHouseNumber", "PropertyHouseNumberExt", "PropertyStreetPreDirectional", "PropertyStreetName", "PropertyStreetSuffix", "PropertyStreetPostDirectional", "PropertyFullStreetAddress", "PropertyCity", "PropertyZip", "PropertyZip4", "PropertyZoningSourceCode", "CensusTract", "TaxAmount", "TaxYear", "TaxDelinquencyFlag", "LotSizeSquareFeet", "BatchID"]

    return main

In [10]:
def get_mailAddress(folder):
    mailAddress = pd.read_csv(folder + '\\ZAsmt\\MailAddress.txt', sep='|', on_bad_lines='skip', encoding="latin-1")

    mailAddress = mailAddress.iloc[:, [0, 2, 3, 4, 5, 6, 7, 9, 10, 11, 13, 14]]
    mailAddress.columns = ["RowID", "MailHouseNumber", "MailHouseNumberExt", "MailStreetPreDirectional", "MailStreetName", "MailStreetSuffix", "MailStreetPostDirectional", "MailBuildingNumber", "MailFullStreetAddress", "MailCity", "MailZip", "MailZip4"]

    return mailAddress

MODIFY DESIRED COUNTIES HERE. Create a new variable and assign it to a list of counties you want, all lowercase. Then, change the variable located in the isin() function to use these counties.

Drops any county not in the 29 county Atlanta MSA from the main dataframe. The "Main" file, and the resultant dataframe, is the best source of county information from all of the files; we can assume that every property must have at least an entry in the main file.

Soon, we will begin the merging process, starting with the main file. With this assumption, we can speed up the merging process by dropping now and performing a left join on the other files. The alternative would be dropping the counties after the merge, but we would still be using the "County" variable from the main dataframe, so it is more efficient to do it now.

In [11]:
def drop_counties(main):
    COUNTIES_ATL = ["barrow", "bartow", "butts", "carroll", "cherokee", "clayton", "cobb", "coweta", "dawson", "dekalb", "douglas", "fayette", "forsyth", "fulton", "gwinnett", "haralson", "heard", "henry", "jasper", "lamar", "meriwether", "morgan", "newton", "paulding", "pickens", "pike", "rockdale", "spalding", "walton"]
    main = main.loc[main['County'].str.lower().isin(COUNTIES_ATL)] # Change (COUNTIES_ATL (OR SIMILIAR) HERE)

Merges the dataframes from each file, using a left join. A left join keeps all of the data from the left dataframe and adds matching data from the right dataframe, if there is any. Since we begin the merge with the main dataframe, we are utilizing the assumption that any property will at least have an entry in main. We keep this data throughout and add to it if there is any additional data with a matching RowID via left join.

Finally, we drop any duplicates to reduce the data size. The number of dropped entries is also calculated and recorded in the "data_summary" file.

In [12]:
def merge(dataframes):
    merged = dataframes["main"].merge(dataframes["mailAddress"], how="left", on="RowID")

    merged = merged.merge(dataframes["saleData"], how="left", on="RowID")
    merged = merged.merge(dataframes["value"], how="left", on="RowID")
    merged = merged.merge(dataframes["building"], how="left", on="RowID")
    merged = merged.merge(dataframes["buildingAreas"], how="left", on="RowID")
    merged = merged.merge(dataframes["taxDistrict"], how="left", on="RowID")

    prev_size = len(merged.index)

    merged.drop_duplicates(inplace=True)

    num_dropped = prev_size - len(merged.index)

    return merged, num_dropped

Writes the merged dataframe to "{year/quarter}_asmt_out.csv"

In [13]:
def write_file(folder, merged):
    merged.to_csv("out" + "\\" + folder + "_asmt_out.csv")

Creates a "data_summary.txt" file containing the following for each quarterly data folder:
- Execution time (seconds)
- Number of dropped duplicates in the merged file.
- Summary statistics for each of the original files (excluding those where this information is meaningless, ex: mailAddress): building, buildingAreas, saleData, value, and main.
- Summary statistics for the new merged file.

In [14]:
def create_summary(folder, merged, dataframes, num_dropped, exec_time):
    txt = open("out" + "\\" + folder + "_asmt_data_summary.txt", 'w')
    txt.write("Execution time (seconds): " + str(exec_time))
    txt.write("\n\n")

    txt.write("Number of Dropped Duplicates: " + str(num_dropped))
    txt.write("\n\n")

    txt.write("Original Data Statistics")
    txt.write("\n\n")

    txt.write("Building: ")
    txt.write("\n")
    txt.write(dataframes["building"].describe().round(2).to_string())
    txt.write("\n\n")

    txt.write("BuildingAreas: ")
    txt.write("\n")
    txt.write(dataframes["buildingAreas"].describe().round(2).to_string())
    txt.write("\n\n")

    txt.write("SaleData: ")
    txt.write("\n")
    txt.write(dataframes["saleData"].describe().round(2).to_string())
    txt.write("\n\n")

    txt.write("Value: ")
    txt.write("\n")
    txt.write(dataframes["value"].describe().round(2).to_string())
    txt.write("\n\n")

    txt.write("Main: ")
    txt.write("\n")
    txt.write(dataframes["main"].describe().round(2).to_string())
    txt.write("\n\n")

    txt.write("Merged Data Statistics: ")
    txt.write("\n")
    txt.write(merged.describe().round(2).to_string())
    txt.write("\n\n")

    txt.close()