### **Z TRANS DATA EXTRACT & MERGE SCRIPT**

**PLACE THIS SCRIPT** in the folder or directory containing the quarterly data folders.

For instance, this script should be placed in the same folder or directory, and on the same level (aka, not in its own/separate folder), as the "20181230" and "20190616" data folders from ZTRAXs.

----------------------------------------------------
**This script is for ZTRANS DATA; it does the following:**
- Works through any selected ZTRAXs data folders.

*For each folder:*
-   Takes data from selected files and only keeps selected variables from these files.
-   Merges the individual selected ZTrans files/variables Into a single output file for each selected quarter data file.
-   Creates a data_summary file for each merged output file which contains original and new summary statistics as well as the total number of dropped duplicate rows and execution time.
----------------------------------------------------

**TO RUN** the scipt, simply click the run button that appears near the top left of the first cell. 

Only run this cell, unless changes are made to any other cells. If other cells are changed, run the changed cells individually before running the main (first) cell again.

**Potential improvements:**
*This script was written for quick results/development rather than maIntainability*. All data that needs to be changed by the user to run the script on a different ZTRAX dataset should be in one place for ease of use/modification.
Further, there is repetition in creating the dataframes from each separate file. Future development can remove these separate functions for each pre-defined file and create a single function which takes the file of Interest as its parameter. This will benefit maIntainability and allow the user to easily add additional files of Interest; right now, the script assumes we will always want to extract the same individual files from each ZTrans data folder.


In [51]:
import pandas as pd
import time
import csv
import parquet #pip install parquet
import pyarrow

#["20181230", "20190319", "20190616", "20190918", "20191009", "20200102", "20200407", "20200811", "20201012", "20210111", "20210405", "20210802", "20210802", "20211018", "20220429"]
folders = ["20221103"]

def main():
    start_time = time.time()

    for folder in folders:
        print("Current Folder: ", folder)

        dataframes = create_dfs_from_files(folder)
        print("Created dataframes")

        dataframes["main"] = drop_counties(dataframes["main"])
        print("Dropped counties and duplicates")

        merged, num_dropped = merge(dataframes)
        print("Merged")

        write_file(folder, merged)
        print("Wrote file")

        exec_time = time.time() - start_time

        create_summary(folder, merged, dataframes, num_dropped, exec_time)
        print("Created summary")
        
        print("Completed " + folder + ". Execution time (seconds): " + str(exec_time))
        print("")


#RUN SCRIPT
main()

#CHECK THE NV DATA FOR COUNTIES

Current Folder:  20221103
Reading ForeclosureNameAddress
Reading propertyInfo
Reading Main
Reading SellerMailAddress
Reading BuyerMailAddress
Reading BuyerName
Reading SellerName
Created dataframes
Dropped counties and duplicates
Main df shape:  (5296545, 15)
After merging propertyInfo:  (5296545, 28)
After merging buyerMailAddress:  (5296545, 40)
After merging sellerMailAddress:  (5296545, 51)
After merging buyerName:  (5296545, 53)
After merging sellerName:  (5296545, 55)
After merging foreclosureName:  (5296545, 65)
Merged
Wrote file
Created summary
Completed 20221103. Execution time (seconds): 532.2796258926392



In [39]:
def create_dfs_from_files(folder):
    dataframes = {}
    dataframes["foreclosureName"] = get_foreclosureName(folder)
    dataframes["propertyInfo"] = get_propertyInfo(folder)
    dataframes["main"] = get_main(folder)
    dataframes["sellerMailAddress"] = get_sellerMailAddress(folder)
    dataframes["buyerMailAddress"] = get_buyerMailAddress(folder)
    dataframes["buyerName"] = get_buyerName(folder)
    dataframes["sellerName"] = get_sellerName(folder)
    return dataframes


The following get methods read in the corresponding data files, selecting only the variables of Interest and giving those columns their proper names manually.

In [40]:
def get_foreclosureName(folder):
    print("Reading ForeclosureNameAddress")
    ForeclosureNameAddress = pd.read_csv(
        folder + '\\ZTrans\\ForeclosureNameAddress.txt',
        sep='|',
        on_bad_lines='skip',
        encoding='latin-1',
        quoting=csv.QUOTE_NONE,
        header=None,
        usecols=[0, 2, 3, 4, 5, 7, 9, 11, 12, 14, 15],
        names=["TransId", "FCMailFirstMiddleName", "FCMailLastName", "FCMailIndividualFullName", "FCMailNonIndividualName", "FCMailFullStreetAddress", "FCMailBuildingNumber",
            "FCMailUnit", "FCMailCity", "FCMailZip", "FCMailZip4"],
        dtype={"TransId": "Int32", "FCMailFirstMiddleName": pd.StringDtype(), "FCMailLastName": pd.StringDtype(), "FCMailIndividualFullName": pd.StringDtype(), "FCMailNonIndividualName": pd.StringDtype(), "FCMailFullStreetAddress": pd.StringDtype(), "FCMailBuildingNumber": pd.StringDtype(),
            "FCMailUnit": pd.StringDtype(), "FCMailCity": "category", "FCMailZip": "category", "FCMailZip4": "category"})

    return ForeclosureNameAddress

def get_propertyInfo(folder):
    print("Reading PropertyInfo")
    PropertyInfo = pd.read_csv(
        folder + '\\ZTrans\\PropertyInfo.txt',
        sep='|',
        on_bad_lines='skip',
        encoding='latin-1',
        quoting=csv.QUOTE_NONE,
        header=None,
        usecols=[0, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19, 64],
        names=["TransId", "PropertyHouseNumber", "PropertyHouseNumberExt", "PropertyStreetPreDirectional", "PropertyStreetName", "PropertyStreetSuffix", "PropertyStreetPostDirectional",
            "PropertyBuildingNumber", "PropertyFullStreetAddress", "PropertyCity", "PropertyZip", "PropertyZip4", "ImportParcelID"],
        dtype={"TransId": "Int32", "PropertyHouseNumber": pd.StringDtype(), "PropertyHouseNumberExt": pd.StringDtype(), "PropertyStreetPreDirectional": "category", "PropertyStreetName": pd.StringDtype(), "PropertyStreetSuffix": pd.StringDtype(), "PropertyStreetPostDirectional": "category",
            "PropertyBuildingNumber": pd.StringDtype(), "PropertyFullStreetAddress": pd.StringDtype(), "PropertyCity": "category", "PropertyZip": "category", "PropertyZip4": "category", "ImportParcelID": "Int32"})

    return PropertyInfo

In [41]:
def get_main(folder):
    print("Reading Main")
    Main = pd.read_csv(
        folder + '\\ZTrans\\Main.txt',
        sep='|',
        on_bad_lines='skip',
        encoding='latin-1',
        quoting=csv.QUOTE_NONE,
        header=None,
        usecols=[0, 1, 3, 4, 16, 17, 18, 24, 25, 30, 32, 62, 104, 105, 127],
        names=["TransId", "FIPS", "County", "DocumentTypeStndCode", "DataClassStndCode", "DocumentDate", "SignatureDate", "SalesPriceAmount", "SalesPriceAmountStndCode", "IntraFamilyTransferFlag",
            "PropertyUseStndCode", "LoanTypeStndCode", "TotalDelinquentAmount", "DelinquentAsOfDate", "BatchID"],
        dtype={"TransId": "Int32", "FIPS": "Int32", "County": "category", "DocumentTypeStndCode": "category", "DataClassStndCode": "category", "DocumentDate": pd.StringDtype(), "SignatureDate": pd.StringDtype(), "SalesPriceAmount": "Float32", "SalesPriceAmountStndCode": "category", "IntraFamilyTransferFlag": "category",
            "PropertyUseStndCode": "category", "LoanTypeStndCode": "category", "TotalDelinquentAmount": "Float32", "DelinquentAsOfDate": pd.StringDtype(), "BatchID": "Int32"})

    return Main

In [42]:
def get_sellerMailAddress(folder):
    print("Reading SellerMailAddress")
    SellerMailAddress = pd.read_csv(
        folder + '\\ZTrans\\SellerMailAddress.txt',
        sep='|',
        on_bad_lines='skip',
        encoding='latin-1',
        quoting=csv.QUOTE_NONE,
        header=None,
        usecols=[0, 3, 4, 5, 6, 7, 8, 10, 11, 12, 14, 15],
        names=["TransId", "SellerMailHouseNumber", "SellerMailHouseNumberExt", "SellerMailStreetPreDirectional", "SellerMailStreetName", "SellerMailStreetSuffix", "SellerMailStreetPostDirectional",
            "SellerMailBuildingNumber", "SellerMailFullStreetAddress", "SellerMailCity", "SellerMailZip", "SellerMailZip4"],
        dtype={"TransId": "Int32", "SellerMailHouseNumber": pd.StringDtype(), "SellerMailHouseNumberExt": pd.StringDtype(), "SellerMailStreetPreDirectional": "category", "SellerMailStreetName": pd.StringDtype(), "SellerMailStreetSuffix": pd.StringDtype(), "SellerMailStreetPostDirectional": "category",
            "SellerMailBuildingNumber": pd.StringDtype(), "SellerMailFullStreetAddress": pd.StringDtype(), "SellerMailCity": "category", "SellerMailZip": "category", "SellerMailZip4": "category"})

    return SellerMailAddress

def get_buyerMailAddress(folder):
    print("Reading BuyerMailAddress")
    BuyerMailAddress = pd.read_csv(
        folder + '\\ZTrans\\BuyerMailAddress.txt',
        sep='|',
        on_bad_lines='skip',
        encoding='latin-1',
        quoting=csv.QUOTE_NONE,
        header=None,
        usecols=[0, 3, 4, 5, 6, 7, 8, 10, 11, 12, 14, 15],
        names=["TransId", "BuyerMailHouseNumber", "BuyerMailHouseNumberExt", "BuyerMailStreetPreDirectional", "BuyerMailStreetName", "BuyerMailStreetSuffix", "BuyerMailStreetPostDirectional",
            "BuyerMailBuildingNumber", "BuyerMailFullStreetAddress", "BuyerMailCity", "BuyerMailZip", "BuyerMailZip4"],
        dtype={"TransId": "Int32", "BuyerMailHouseNumber": pd.StringDtype(), "BuyerMailHouseNumberExt": pd.StringDtype(), "BuyerMailStreetPreDirectional": "category", "BuyerMailStreetName": pd.StringDtype(), "BuyerMailStreetSuffix": pd.StringDtype(), "BuyerMailStreetPostDirectional": "category",
            "BuyerMailBuildingNumber": pd.StringDtype(), "BuyerMailFullStreetAddress": pd.StringDtype(), "BuyerMailCity": "category", "BuyerMailZip": "category", "BuyerMailZip4": "category"})

    return BuyerMailAddress

In [43]:
def get_buyerName(folder):
    print("Reading BuyerName")
    BuyerName = pd.read_csv(
        folder + '\\ZTrans\\BuyerNameAgg.csv',
        dtype={"TransId": "Int32", "BuyerIndividualFullName": pd.StringDtype(), "BuyerNonIndividualName": pd.StringDtype()})

    return BuyerName

In [44]:
def get_sellerName(folder):
    print("Reading SellerName")
    SellerName = pd.read_csv(
        folder + '\\ZTrans\\SellerNameAgg.csv',
        dtype={"TransId": "Int32", "SellerIndividualFullName": pd.StringDtype(), "SellerNonIndividualName": pd.StringDtype()})

    return SellerName

In [45]:
def get_propertyInfo(folder):
    print("Reading propertyInfo")
    SellerName = pd.read_csv(
        folder + '\\ZTrans\\PropertyAgg.csv',
        dtype={"TransId": "Int32", "PropertyHouseNumber": pd.StringDtype(), "PropertyHouseNumberExt": pd.StringDtype(), "PropertyStreetPreDirectional": "category", "PropertyStreetName": pd.StringDtype(), "PropertyStreetSuffix": pd.StringDtype(), "PropertyStreetPostDirectional": "category",
            "PropertyBuildingNumber": pd.StringDtype(), "PropertyFullStreetAddress": pd.StringDtype(), "PropertyCity": "category", "PropertyZip": "category", "PropertyZip4": "category", "ImportParcelID": "Int32"})

    return SellerName

In [46]:
def get_buyerMailAddress(folder):
    print("Reading BuyerMailAddress")
    BuyerMailAddress = pd.read_csv(
        folder + '\\ZTrans\\BuyerMailAgg.csv',
        dtype={"TransId": "Int32", "BuyerMailHouseNumber": pd.StringDtype(), "BuyerMailHouseNumberExt": pd.StringDtype(), "BuyerMailStreetPreDirectional": "category", "BuyerMailStreetName": pd.StringDtype(), "BuyerMailStreetSuffix": pd.StringDtype(), "BuyerMailStreetPostDirectional": "category",
            "BuyerMailBuildingNumber": pd.StringDtype(), "BuyerMailFullStreetAddress": pd.StringDtype(), "BuyerMailCity": "category", "BuyerMailZip": "category", "BuyerMailZip4": "category"})

    return BuyerMailAddress

def get_buyerName(folder):
    print("Reading BuyerName")
    BuyerName = pd.read_csv(
        folder + '\\ZTrans\\BuyerName.txt',
        sep='|',
        on_bad_lines='skip',
        encoding='latin-1',
        quoting=csv.QUOTE_NONE,
        header=None,
        usecols=[0, 1, 2, 3, 4, 5],
        names=["TransId", "BuyerFirstMiddleName", "BuyerLastName", "BuyerIndividualFullName", "BuyerNonIndividualName", "BuyerNameSequenceNumber"],
        dtype={"TransId": "Int32", "BuyerFirstMiddleName": pd.StringDtype(), "BuyerLastName": pd.StringDtype(), "BuyerIndividualFullName": pd.StringDtype(), "BuyerNonIndividualName": pd.StringDtype(), "BuyerNameSequenceNumber": "Int16"})

    return BuyerName

def get_sellerName(folder):
    print("Reading SellerName")
    SellerName = pd.read_csv(
        folder + '\\ZTrans\\SellerName.txt',
        sep='|',
        on_bad_lines='skip',
        encoding='latin-1',
        quoting=csv.QUOTE_NONE,
        header=None,
        usecols=[0, 1, 2, 3, 4],
        names=["TransId", "SellerFirstMiddleName", "SellerLastName", "SellerIndividualFullName", "SellerNonIndividualName"],
        dtype={"TransId": "Int32", "SellerFirstMiddleName": pd.StringDtype(), "SellerLastName": pd.StringDtype(), "SellerIndividualFullName": pd.StringDtype(), "SellerNonIndividualName": pd.StringDtype()})

    return SellerName

MODIFY DESIRED COUNTIES HERE. Create a new variable and assign it to a list of counties you want, all lowercase. Then, change the variable located in the isin() function to use these counties.

Drops any county not in the 29 county Atlanta MSA from the main dataframe. The "Main" file, and the resultant dataframe, is the best source of county information from all of the files; we can assume that every property must have at least an entry in the main file.

Soon, we will begin the merging process, starting with the main file. With this assumption, we can speed up the merging process by dropping now and performing a left join on the other files. The alternative would be dropping the counties after the merge, but we would still be using the "County" variable from the main dataframe, so it is more efficient to do it now.

In [47]:
def drop_counties(main):
    COUNTIES_ATL = ["barrow", "bartow", "butts", "carroll", "cherokee", "clayton", "cobb", "coweta", "dawson", "dekalb", "douglas", "fayette", "forsyth", "fulton", "gwinnett", "haralson", "heard", "henry", "jasper", "lamar", "meriwether", "morgan", "newton", "paulding", "pickens", "pike", "rockdale", "spalding", "walton"] # 13
    FIPS_GA = [13013, 13015, 13035, 13045, 13057, 13063, 13067, 13077, 13085, 13089, 13089, 13097, 13113, 13117, 13121, 13135, 13143, 13149, 13151, 13159, 13171, 13199, 13211, 13217, 13223, 13227, 13231, 13247, 13255, 13297]
    COUNTIES_NC = ["anson", "cabarrus", "gaston", "iredell", "lincoln", "mecklenburg", "rowan", "union", "chester", "lancaster", "york"] # 37
    COUNTIES_MD = ["anne arundel", "baltimore", "carroll", "harford", "howard", "queen annes", "baltimore city"] #24, baltimore (independent) -> baltimore city (looked at source)
    FIPS_MD = [24003, 24005, 24013, 24025, 24027, 24035, 24510]
    COUNTIES_MN = ["anoka", "carver", "chisago", "dakota", "hennepin", "isanti", "le sueur", "mille lacs", "ramsey", "scott", "sherburne", "washington", "wright", "pierce", "st croix"] #27
    FIPS_MN = [27003, 27019, 27025, 27037, 27053, 27059, 27079, 27095, 27123, 27139, 27141, 27163, 27171]
    COUNTIES_NV = "clark" #32
    FIPS_NV = [32003]
    COUNTIES_WI = ["milwaukee", "ozaukee", "washington", "waukesha"] #55
    #main = main.set_index('County')
    #return main.loc[main.index.str.lower().isin(COUNTIES_MD)]
    #return main.loc[main['County'].str.lower().isin(COUNTIES_MD)] # Change [COUNTIES_ATL (OR SIMILIAR)] HERE
    #return main.loc[main['County'].str.lower() == COUNTIES_NV] # Change [COUNTIES_ATL (OR SIMILIAR)] HERE

    return main.loc[main['FIPS'].isin(FIPS_GA)]

    #"lamar", "meriwether", "morgan"

Merges the dataframes from each file, using a left join. A left join keeps all of the data from the left dataframe and adds matching data from the right dataframe, if there is any. Since we begin the merge with the main dataframe, we are utilizing the assumption that any property will at least have an entry in main. We keep this data throughout and add to it if there is any additional data with a matching RowID via left join.

Finally, we drop any duplicates to reduce the data size. The number of dropped entries is also calculated and recorded in the "data_summary" file.

In [48]:
def merge(dataframes):
    print("Main df shape: ", dataframes["main"].shape)
    merged = dataframes["main"].merge(dataframes["propertyInfo"], how="left", left_on="TransId", right_on="TransId")
    merged.drop_duplicates(inplace=True)
    prev_size = len(merged.index)
    print("After merging propertyInfo: ", merged.shape)

    merged = merged.merge(dataframes["buyerMailAddress"], how="left", left_on="TransId", right_on="TransId")
    merged.drop_duplicates(inplace=True)
    print("After merging buyerMailAddress: ", merged.shape)

    merged = merged.merge(dataframes["sellerMailAddress"], how="left", on="TransId")
    merged.drop_duplicates(inplace=True)
    print("After merging sellerMailAddress: ", merged.shape)

    merged = merged.merge(dataframes["buyerName"], how="left", on="TransId")
    merged.drop_duplicates(inplace=True)
    print("After merging buyerName: ", merged.shape)

    merged = merged.merge(dataframes["sellerName"], how="left", on="TransId")
    merged.drop_duplicates(inplace=True)
    print("After merging sellerName: ", merged.shape)

    merged = merged.merge(dataframes["foreclosureName"], how="left", on="TransId")
    merged.drop_duplicates(inplace=True)
    print("After merging foreclosureName: ", merged.shape)

    num_dropped = prev_size - len(merged.index)

    return merged, num_dropped

Writes the merged dataframe to "{year/quarter}_trans_out.csv"

In [49]:
def write_file(folder, merged):
    merged.to_csv("out" + "\\" + folder + "_trans_out.csv", index=False)
    merged.to_parquet("out" + "\\" + folder + "_parquet_trans_out.parquet", index=False)

Creates a "data_summary.txt" file containing the following for each quarterly data folder:
- Execution time (seconds)
- Number of dropped duplicates in the merged file.
- Summary statistics for each of the original files (excluding those where this information is meaningless, ex: mailAddress): building, buildingAreas, saleData, value, and main.
- Summary statistics for the new merged file.

In [50]:
def create_summary(folder, merged, dataframes, num_dropped, exec_time):
    txt = open("out" + "\\" + folder + "_trans_data_summary.txt", 'w')
    txt.write("Execution time (seconds): " + str(exec_time))
    txt.write("\n\n")

    txt.write("Change in observations from original file: " + str(num_dropped))
    txt.write("\n\n")

    txt.write("Original Data Statistics")
    txt.write("\n\n")

    txt.write("Main: ")
    txt.write("\n")
    txt.write(dataframes["main"].describe().round(2).to_string())
    txt.write("\n\n")

    txt.write("BuyerMailAddress: ")
    txt.write("\n")
    txt.write(dataframes["buyerMailAddress"].describe().round(2).to_string())
    txt.write("\n\n")

    txt.write("SellerMailAddress: ")
    txt.write("\n")
    txt.write(dataframes["sellerMailAddress"].describe().round(2).to_string())
    txt.write("\n\n")

    txt.write("PropertyInfo: ")
    txt.write("\n")
    txt.write(dataframes["propertyInfo"].describe().round(2).to_string())
    txt.write("\n\n")

    txt.write("BuyerName: ")
    txt.write("\n")
    txt.write(dataframes["buyerName"].describe().round(2).to_string())
    txt.write("\n\n")
    
    txt.write("SellerName: ")
    txt.write("\n")
    txt.write(dataframes["sellerName"].describe().round(2).to_string())
    txt.write("\n\n")
    
    txt.write("ForeclosureNameAddress: ")
    txt.write("\n")
    txt.write(dataframes["foreclosureName"].describe().round(2).to_string())
    txt.write("\n\n")

    txt.write("Merged Data Statistics: ")
    txt.write("\n")
    txt.write(merged.describe().round(2).to_string())
    txt.write("\n\n")

    txt.close()