### **Z TRANS DATA EXTRACT & MERGE SCRIPT**

**PLACE THIS SCRIPT** in the folder or directory containing the quarterly data folders.

For instance, this script should be placed in the same folder or directory, and on the same level (aka, not in its own/separate folder), as the "20181230" and "20190616" data folders from ZTRAXs.

----------------------------------------------------
**This script is for ZTRANS DATA; it does the following:**
- Works through any selected ZTRAXs data folders.

*For each folder:*
-   Takes data from selected files and only keeps selected variables from these files.
-   Merges the individual selected ZTrans files/variables into a single output file for each selected quarter data file.
-   Creates a data_summary file for each merged output file which contains original and new summary statistics as well as the total number of dropped duplicate rows and execution time.
----------------------------------------------------

**TO RUN** the scipt, simply click the run button that appears near the top left of the first cell. 

Only run this cell, unless changes are made to any other cells. If other cells are changed, run the changed cells individually before running the main (first) cell again.

**Potential improvements:**
*This script was written for quick results/development rather than maintainability*. All data that needs to be changed by the user to run the script on a different ZTRAX dataset should be in one place for ease of use/modification.
Further, there is repetition in creating the dataframes from each separate file. Future development can remove these separate functions for each pre-defined file and create a single function which takes the file of interest as its parameter. This will benefit maintainability and allow the user to easily add additional files of interest; right now, the script assumes we will always want to extract the same individual files from each ZTrans data folder.


In [30]:
import pandas as pd
import time

folders = ["20181230", "20190319", "20190616", "20190918", "20191009", "20200102", "20200407", "20200811", "20201012", "20210111", "20210405", "20210802", "20211018", "20220429"]

def main():
    start_time = time.time()

    for folder in folders:
        print("Current Folder: ", folder)

        dataframes = create_dfs_from_files(folder)
        print("Created dataframes")

        drop_counties(dataframes["main"])
        print("Dropped duplicates")

        merged, num_dropped = merge(dataframes)
        print("Merged")

        write_file(folder, merged)
        print("Wrote file")

        exec_time = time.time() - start_time

        create_summary(folder, merged, dataframes, num_dropped, exec_time)
        print("Created summary")
        
        print("Completed " + folder + ". Execution time (seconds): " + str(exec_time))
        print("")


#RUN SCRIPT
main()

Current Folder:  20181230


  dataframes["propertyInfo"] = get_propertyInfo(folder)
  dataframes["main"] = get_main(folder)
  dataframes["sellerMailAddress"] = get_sellerMailAddress(folder)
  dataframes["buyerMailAddress"] = get_buyerMailAddress(folder)


Created dataframes
Dropped duplicates
Merged
Wrote file
Created summary
Completed 20181230. Execution time (seconds): 991.3085854053497

Current Folder:  20190319


  dataframes["propertyInfo"] = get_propertyInfo(folder)
  dataframes["main"] = get_main(folder)
  dataframes["sellerMailAddress"] = get_sellerMailAddress(folder)
  dataframes["buyerMailAddress"] = get_buyerMailAddress(folder)


Created dataframes
Dropped duplicates
Merged
Wrote file
Created summary
Completed 20190319. Execution time (seconds): 2028.1399517059326

Current Folder:  20190616


  dataframes["propertyInfo"] = get_propertyInfo(folder)
  dataframes["main"] = get_main(folder)
  dataframes["sellerMailAddress"] = get_sellerMailAddress(folder)
  dataframes["buyerMailAddress"] = get_buyerMailAddress(folder)


Created dataframes
Dropped duplicates
Merged
Wrote file
Created summary
Completed 20190616. Execution time (seconds): 3102.0462307929993

Current Folder:  20190918


  dataframes["propertyInfo"] = get_propertyInfo(folder)
  dataframes["main"] = get_main(folder)
  dataframes["sellerMailAddress"] = get_sellerMailAddress(folder)
  dataframes["buyerMailAddress"] = get_buyerMailAddress(folder)


Created dataframes
Dropped duplicates
Merged
Wrote file
Created summary
Completed 20190918. Execution time (seconds): 4220.619454860687

Current Folder:  20191009


  dataframes["propertyInfo"] = get_propertyInfo(folder)
  dataframes["main"] = get_main(folder)
  dataframes["sellerMailAddress"] = get_sellerMailAddress(folder)
  dataframes["buyerMailAddress"] = get_buyerMailAddress(folder)
  dataframes["sellerName"] = get_sellerName(folder)


Created dataframes
Dropped duplicates
Merged
Wrote file
Created summary
Completed 20191009. Execution time (seconds): 5342.93445110321

Current Folder:  20200102


  dataframes["propertyInfo"] = get_propertyInfo(folder)
  dataframes["main"] = get_main(folder)
  dataframes["sellerMailAddress"] = get_sellerMailAddress(folder)
  dataframes["buyerMailAddress"] = get_buyerMailAddress(folder)


Created dataframes
Dropped duplicates
Merged
Wrote file
Created summary
Completed 20200102. Execution time (seconds): 6510.534721136093

Current Folder:  20200407


  dataframes["propertyInfo"] = get_propertyInfo(folder)
  dataframes["main"] = get_main(folder)
  dataframes["sellerMailAddress"] = get_sellerMailAddress(folder)
  dataframes["buyerMailAddress"] = get_buyerMailAddress(folder)


Created dataframes
Dropped duplicates
Merged
Wrote file
Created summary
Completed 20200407. Execution time (seconds): 7700.800858259201

Current Folder:  20200811


  dataframes["propertyInfo"] = get_propertyInfo(folder)
  dataframes["main"] = get_main(folder)
  dataframes["sellerMailAddress"] = get_sellerMailAddress(folder)
  dataframes["buyerMailAddress"] = get_buyerMailAddress(folder)


Created dataframes
Dropped duplicates
Merged
Wrote file
Created summary
Completed 20200811. Execution time (seconds): 8919.879530668259

Current Folder:  20201012


  dataframes["propertyInfo"] = get_propertyInfo(folder)
  dataframes["main"] = get_main(folder)
  dataframes["sellerMailAddress"] = get_sellerMailAddress(folder)
  dataframes["buyerMailAddress"] = get_buyerMailAddress(folder)


Created dataframes
Dropped duplicates
Merged
Wrote file
Created summary
Completed 20201012. Execution time (seconds): 10181.633672237396

Current Folder:  20210111


  dataframes["propertyInfo"] = get_propertyInfo(folder)
  dataframes["main"] = get_main(folder)
  dataframes["sellerMailAddress"] = get_sellerMailAddress(folder)
  dataframes["buyerMailAddress"] = get_buyerMailAddress(folder)
  dataframes["sellerName"] = get_sellerName(folder)


Created dataframes
Dropped duplicates
Merged
Wrote file
Created summary
Completed 20210111. Execution time (seconds): 11469.96124625206

Current Folder:  20210405


  dataframes["propertyInfo"] = get_propertyInfo(folder)
  dataframes["main"] = get_main(folder)
  dataframes["sellerMailAddress"] = get_sellerMailAddress(folder)
  dataframes["buyerMailAddress"] = get_buyerMailAddress(folder)


Created dataframes
Dropped duplicates
Merged
Wrote file
Created summary
Completed 20210405. Execution time (seconds): 12794.764294862747

Current Folder:  20210802


  dataframes["propertyInfo"] = get_propertyInfo(folder)
  dataframes["main"] = get_main(folder)
  dataframes["sellerMailAddress"] = get_sellerMailAddress(folder)
  dataframes["buyerMailAddress"] = get_buyerMailAddress(folder)


Created dataframes
Dropped duplicates
Merged
Wrote file
Created summary
Completed 20210802. Execution time (seconds): 14151.533039331436

Current Folder:  20211018


  dataframes["propertyInfo"] = get_propertyInfo(folder)
  dataframes["main"] = get_main(folder)
  dataframes["sellerMailAddress"] = get_sellerMailAddress(folder)
  dataframes["buyerMailAddress"] = get_buyerMailAddress(folder)


Created dataframes
Dropped duplicates
Merged
Wrote file
Created summary
Completed 20211018. Execution time (seconds): 15732.553567886353

Current Folder:  20220429


  dataframes["propertyInfo"] = get_propertyInfo(folder)
  dataframes["main"] = get_main(folder)
  dataframes["sellerMailAddress"] = get_sellerMailAddress(folder)
  dataframes["buyerMailAddress"] = get_buyerMailAddress(folder)


Created dataframes
Dropped duplicates
Merged
Wrote file
Created summary
Completed 20220429. Execution time (seconds): 16625.930213928223



In [15]:
def create_dfs_from_files(folder):
    dataframes = {}
    dataframes["foreclosureName"] = get_foreclosureName(folder)
    dataframes["propertyInfo"] = get_propertyInfo(folder)
    dataframes["main"] = get_main(folder)
    dataframes["sellerMailAddress"] = get_sellerMailAddress(folder)
    dataframes["buyerMailAddress"] = get_buyerMailAddress(folder)
    dataframes["buyerName"] = get_buyerName(folder)
    dataframes["sellerName"] = get_sellerName(folder)
    return dataframes


The following get methods read in the corresponding data files, selecting only the variables of interest and giving those columns their proper names manually.

In [16]:
def get_foreclosureName(folder): 
    ForeclosureNameAddress = pd.read_csv(folder + '\\ZTrans\\ForeclosureNameAddress.txt', sep='|', on_bad_lines='skip')

    ForeclosureNameAddress = ForeclosureNameAddress.iloc[:, [0, 2, 3, 4, 5, 7, 9, 11, 12, 14, 15]]

    ForeclosureNameAddress.columns = ["TransId", "FCMailFirstMiddleName", "FCMailLastName", "FCMailIndividualFullName", "FCMailNonIndividualName", "FCMailFullStreetAddress", "FCMailBuildingNumber", "FCMailUnit", "FCMailCity", "FCMailZip", "FCMailZip4"]

    return ForeclosureNameAddress

In [17]:
def get_propertyInfo(folder):
    PropertyInfo = pd.read_csv(folder + '\\ZTrans\\PropertyInfo.txt', sep='|', on_bad_lines='skip')

    PropertyInfo = PropertyInfo.iloc[:, [0, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19]]

    PropertyInfo.columns = ["TransId", "PropertyHouseNumber", "PropertyHouseNumberExt", "PropertyStreetPreDirectional", "PropertyStreetName", "PropertyStreetSuffix", "PropertyStreetPostDirectional", "PropertyBuildingNumber", "PropertyFullStreetAddress", "PropertyCity", "PropertyZip", "PropertyZip4"]

    return PropertyInfo

In [18]:
def get_main(folder):
    Main = pd.read_csv(folder + '\\ZTrans\\Main.txt', sep='|', on_bad_lines='skip')

    Main = Main.iloc[:, [0, 1, 4, 17, 18, 24, 104, 105, 127]]

    Main.columns = ["TransId", "FIPS", "County", "DocumentDate", "SignatureDate", "SalesPriceAmount", "TotalDelinquentAmount", "DelinquentAsOfDate", "BatchID"]

    return Main

In [19]:
def get_sellerMailAddress(folder):
    SellerMailAddress = pd.read_csv(folder + '\\ZTrans\\SellerMailAddress.txt', sep='|', on_bad_lines='skip')

    SellerMailAddress = SellerMailAddress.iloc[:, [0, 3, 4, 5, 6, 7, 8, 10, 11, 12, 14, 15]]

    SellerMailAddress.columns = ["TransId", "SellerMailHouseNumber", "SellerMailHouseNumberExt", "SellerMailStreetPreDirectional", "SellerMailStreetName", "SellerMailStreetSuffix", "SellerMailStreetPostDirectional", "SellerMailBuildingNumber", "SellerMailFullStreetAddress", "SellerMailCity", "SellerMailZip", "SellerMailZip4"]
    return SellerMailAddress

In [20]:
def get_buyerMailAddress(folder):
    BuyerMailAddress = pd.read_csv(folder + '\\ZTrans\\BuyerMailAddress.txt', sep='|', on_bad_lines='skip')

    BuyerMailAddress = BuyerMailAddress.iloc[:, [0, 3, 4, 5, 6, 7, 8, 10, 11, 12, 14, 15]]

    BuyerMailAddress.columns = ["TransId", "BuyerMailHouseNumber", "BuyerMailHouseNumberExt", "BuyerMailStreetPreDirectional", "BuyerMailStreetName", "BuyerMailStreetSuffix", "BuyerMailStreetPostDirectional", "BuyerMailBuildingNumber", "BuyerMailFullStreetAddress", "BuyerMailCity", "BuyerMailZip", "BuyerMailZip4"]

    return BuyerMailAddress

In [21]:
def get_buyerName(folder):   
    BuyerName = pd.read_csv(folder + '\\ZTrans\\BuyerName.txt', sep='|', on_bad_lines='skip')

    BuyerName = BuyerName.iloc[:, [0, 1, 2, 3, 4]]

    BuyerName.columns = ["TransId", "BuyerFirstMiddleName", "BuyerLastName", "BuyerIndividualFullName", "BuyerNonIndividualName"]

    return BuyerName

In [22]:
def get_sellerName(folder):
    SellerName = pd.read_csv(folder + '\\ZTrans\\SellerName.txt', sep='|', on_bad_lines='skip')

    SellerName = SellerName.iloc[:, [0, 1, 2, 3, 4]]

    SellerName.columns = ["TransId", "SellerFirstMiddleName", "SellerLastName", "SellerIndividualFullName", "SellerNonIndividualName"]

    return SellerName

MODIFY DESIRED COUNTIES HERE. Create a new variable and assign it to a list of counties you want, all lowercase. Then, change the variable located in the isin() function to use these counties.

Drops any county not in the 29 county Atlanta MSA from the main dataframe. The "Main" file, and the resultant dataframe, is the best source of county information from all of the files; we can assume that every property must have at least an entry in the main file.

Soon, we will begin the merging process, starting with the main file. With this assumption, we can speed up the merging process by dropping now and performing a left join on the other files. The alternative would be dropping the counties after the merge, but we would still be using the "County" variable from the main dataframe, so it is more efficient to do it now.

In [23]:
def drop_counties(main):
    COUNTIES_ATL = ["barrow", "bartow", "butts", "carroll", "cherokee", "clayton", "cobb", "coweta", "dawson", "dekalb", "douglas", "fayette", "forsyth", "fulton", "gwinnett", "haralson", "heard", "henry", "jasper", "lamar", "meriwether", "morgan", "newton", "paulding", "pickens", "pike", "rockdale", "spalding", "walton"]
    main = main.loc[main['County'].str.lower().isin(COUNTIES_ATL)] # Change (COUNTIES_ATL (OR SIMILIAR) HERE)

Merges the dataframes from each file, using a left join. A left join keeps all of the data from the left dataframe and adds matching data from the right dataframe, if there is any. Since we begin the merge with the main dataframe, we are utilizing the assumption that any property will at least have an entry in main. We keep this data throughout and add to it if there is any additional data with a matching RowID via left join.

Finally, we drop any duplicates to reduce the data size. The number of dropped entries is also calculated and recorded in the "data_summary" file.

In [29]:
def merge(dataframes):
    merged = dataframes["main"].merge(dataframes["propertyInfo"], how="left", on="TransId")

    merged = merged.merge(dataframes["buyerMailAddress"], how="left", on="TransId")
    merged = merged.merge(dataframes["sellerMailAddress"], how="left", on="TransId")
    merged = merged.merge(dataframes["buyerName"], how="left", on="TransId")
    merged = merged.merge(dataframes["sellerName"], how="left", on="TransId")
    merged = merged.merge(dataframes["foreclosureName"], how="left", on="TransId")

    prev_size = len(merged.index)

    merged.drop_duplicates(inplace=True)

    num_dropped = prev_size - len(merged.index)

    return merged, num_dropped

Writes the merged dataframe to "{year/quarter}_trans_out.csv"

In [25]:
def write_file(folder, merged):
    merged.to_csv("out" + "\\" + folder + "_trans_out.csv")

Creates a "data_summary.txt" file containing the following for each quarterly data folder:
- Execution time (seconds)
- Number of dropped duplicates in the merged file.
- Summary statistics for each of the original files (excluding those where this information is meaningless, ex: mailAddress): building, buildingAreas, saleData, value, and main.
- Summary statistics for the new merged file.

In [26]:
def create_summary(folder, merged, dataframes, num_dropped, exec_time):
    txt = open("out" + "\\" + folder + "_trans_data_summary.txt", 'w')
    txt.write("Execution time (seconds): " + str(exec_time))
    txt.write("\n\n")

    txt.write("Number of Dropped Duplicates: " + str(num_dropped))
    txt.write("\n\n")

    txt.write("Original Data Statistics")
    txt.write("\n\n")

    txt.write("Main: ")
    txt.write("\n")
    txt.write(dataframes["main"].describe().round(2).to_string())
    txt.write("\n\n")

    txt.write("BuyerMailAddress: ")
    txt.write("\n")
    txt.write(dataframes["buyerMailAddress"].describe().round(2).to_string())
    txt.write("\n\n")

    txt.write("SellerMailAddress: ")
    txt.write("\n")
    txt.write(dataframes["sellerMailAddress"].describe().round(2).to_string())
    txt.write("\n\n")

    txt.write("PropertyInfo: ")
    txt.write("\n")
    txt.write(dataframes["propertyInfo"].describe().round(2).to_string())
    txt.write("\n\n")

    txt.write("BuyerName: ")
    txt.write("\n")
    txt.write(dataframes["buyerName"].describe().round(2).to_string())
    txt.write("\n\n")
    
    txt.write("SellerName: ")
    txt.write("\n")
    txt.write(dataframes["sellerName"].describe().round(2).to_string())
    txt.write("\n\n")
    
    txt.write("ForeclosureNameAddress: ")
    txt.write("\n")
    txt.write(dataframes["foreclosureName"].describe().round(2).to_string())
    txt.write("\n\n")

    txt.write("Merged Data Statistics: ")
    txt.write("\n")
    txt.write(merged.describe().round(2).to_string())
    txt.write("\n\n")

    txt.close()