### **Z (NEW) ASMT DATA EXTRACT & MERGE SCRIPT**

**PLACE THIS SCRIPT** in the folder or directory containing the quarterly data folders.

For instance, this script should be placed in the same folder or directory, and on the same level (aka, not in its own/separate folder), as the "20181230" and "20190616" data folders from ZTRAXs.

----------------------------------------------------
**This script is for ZTRANS DATA; it does the following:**
- Works through any selected ZTRAXs data folders.

*For each folder:*
-   Takes data from selected files and only keeps selected variables from these files.
-   Merges the individual selected ZTrans files/variables Into a single output file for each selected quarter data file.
-   Creates a data_summary file for each merged output file which contains original and new summary statistics as well as the total number of dropped duplicate rows and execution time.
----------------------------------------------------

**TO RUN** the scipt, simply click the run button that appears near the top left of the first cell. 

Only run this cell, unless changes are made to any other cells. If other cells are changed, run the changed cells individually before running the main (first) cell again.

**Potential improvements:**
*This script was written for quick results/development rather than maIntainability*. All data that needs to be changed by the user to run the script on a different ZTRAX dataset should be in one place for ease of use/modification.
Further, there is repetition in creating the dataframes from each separate file. Future development can remove these separate functions for each pre-defined file and create a single function which takes the file of Interest as its parameter. This will benefit maIntainability and allow the user to easily add additional files of Interest; right now, the script assumes we will always want to extract the same individual files from each ZTrans data folder.


In [14]:
import pandas as pd
import time
import csv
import parquet #pip install parquet
import pyarrow

#["20181230", "20190319", "20190616", "20190918", "20191009", "20200102", "20200407", "20200811", "20201012", "20210111", "20210405", "20210802", "20210802", "20211018", "20220429"]
folders = ["20220429"]

def main():
    start_time = time.time()

    for folder in folders:
        print("Current Folder: ", folder)

        dataframes = create_dfs_from_files(folder)
        print("Created dataframes")

        dataframes["main"] = drop_counties(dataframes["main"])
        print("Dropped counties and duplicates")

        merged = merge(dataframes)
        print("Merged")

        write_file(folder, merged)
        print("Wrote file")

        exec_time = time.time() - start_time
        
        print("Completed " + folder + ". Execution time (seconds): " + str(exec_time))
        print("")


#RUN SCRIPT
main()

#CHECK THE NV DATA FOR COUNTIES

Current Folder:  20220429
building
buildingAreas
saleData
taxDistrict
value
main
mail
Created dataframes
Dropped counties and duplicates
Main: (2234759, 21)
After mailAddress: (2234759, 32)
After value: (2234759, 34)
After saleData: (5241642, 38)
After building: (5343906, 46)
After buildingAreas: (10262664, 47)
After taxDistrict: (26334453, 48)
Merged
Wrote file
Completed 20220429. Execution time (seconds): 736.8481013774872



In [2]:
def create_dfs_from_files(folder):
    dataframes = {}
    dataframes["building"] = get_building(folder)
    dataframes["buildingAreas"] = get_buildingAreas(folder)
    dataframes["saleData"] = get_saleData(folder)
    dataframes["taxDistrict"] = get_taxDistrict(folder)
    dataframes["value"] = get_value(folder)
    dataframes["main"] = get_main(folder)
    dataframes["mailAddress"] = get_mailAddress(folder)
    return dataframes


The following get methods read in the corresponding data files, selecting only the variables of Interest and giving those columns their proper names manually.

In [3]:
def get_building(folder):
    print("building")
    building = pd.read_csv(
        folder + '\\ZAsmt\\Building.txt',
        sep='|',
        on_bad_lines='skip',
        encoding='latin-1',
        quoting=csv.QUOTE_NONE,
        header=None,
        usecols=[0, 1, 4, 5, 7, 14, 19, 26, 44],
        names=["RowID", "NoOfUnits", "PropertyCountyLandUseCode", "PropertyLandUseStndCode", "PropertyStateLandUseCode", "YearBuilt", "TotalBedrooms", "TotalActualBathCount", "StoryTypeStndCode"],
        dtype={"RowID": pd.StringDtype(), "NoOfUnits": "Int32", "PropertyCountyLandUseCode": "category", "PropertyLandUseStndCode": "category", "PropertyStateLandUseCode": "category", "YearBuilt": "Int16", "TotalBedrooms": "category", "TotalActualBathCount": "float32", "StoryTypeStndCode": "category"})

    return building

In [4]:
def get_buildingAreas(folder):
    print("buildingAreas")
    buildingAreas = pd.read_csv(
        folder + '\\ZAsmt\\BuildingAreas.txt',
        sep='|',
        on_bad_lines='skip',
        encoding='latin-1',
        quoting=csv.QUOTE_NONE,
        header=None,
        usecols=[0, 4],
        names=["RowID", "BuildingAreaSqFt"],
        dtype={"RowID": pd.StringDtype(), "BuildingAreaSqFt": "Int32"})

    return buildingAreas

In [5]:
def get_saleData(folder):
    print("saleData")
    saleData = pd.read_csv(
        folder + '\\ZAsmt\\SaleData.txt',
        sep='|',
        on_bad_lines='skip',
        encoding='latin-1',
        quoting=csv.QUOTE_NONE,
        header=None,
        usecols=[0, 2, 3, 5, 11],
        names=["RowID", "SellerFullName", "BuyerFullName", "DocumentDate", "SalesPriceAmount"],
        dtype={"RowID": pd.StringDtype(), "SellerFullName": pd.StringDtype(), "BuyerFullName": pd.StringDtype(), "DocumentDate": pd.StringDtype(), "SalesPriceAmount": "Int32"})

    return saleData

In [6]:
def get_taxDistrict(folder):
    print("taxDistrict")
    taxDistrict = None
    
    try:
        taxDistrict = pd.read_csv(
            folder + '\\ZAsmt\\TaxDistrict.txt',
            sep='|',
            on_bad_lines='skip',
            encoding='latin-1',
            quoting=csv.QUOTE_NONE,
            header=None,
            usecols=[0, 1],
            names=["RowID", "TaxDistrictStndCode"],
            dtype={"RowID": pd.StringDtype(), "TaxDistrictStndCode": "category"})
    except:
        print("taxDistrict was empty")
    
    return taxDistrict

In [7]:
def get_value(folder):
    print("value")
    value = pd.read_csv(
        folder + '\\ZAsmt\\Value.txt',
        sep='|',
        on_bad_lines='skip',
        encoding='latin-1',
        quoting=csv.QUOTE_NONE,
        header=None,
        usecols=[0, 3, 4],
        names=["RowID", "TotalAssessedValue", "AssessmentYear"],
        dtype={"RowID": pd.StringDtype(), "TotalAssessedValue": "Int32", "AssessmentYear": "Int32"})

    return value

In [8]:
def get_main(folder):
    print("main")
    main = pd.read_csv(
        folder + '\\ZAsmt\\Main.txt',
        sep='|',
        on_bad_lines='skip',
        encoding='latin-1',
        quoting=csv.QUOTE_NONE,
        header=None,
        usecols=[0, 1, 2, 4, 20, 21, 22, 23, 24, 25, 26, 27, 29, 30, 35, 36, 38, 39, 40, 70, 92],
        names=["RowID", "ImportParcelID", "FIPS", "County" , "PropertyHouseNumber", "PropertyHouseNumberExt", "PropertyStreetPreDirectional", "PropertyStreetName", "PropertyStreetSuffix", "PropertyStreetPostDirectional", "PropertyFullStreetAddress", "PropertyCity", "PropertyZip", "PropertyZip4", "PropertyZoningSourceCode", "CensusTract", "TaxAmount", "TaxYear", "TaxDelinquencyFlag", "LotSizeSquareFeet", "BatchID"],
        dtype={"RowID": pd.StringDtype(), "ImportParcelID": "Int32", "FIPS": "Int16", "County": "category", "PropertyHouseNumber": pd.StringDtype(), "PropertyHouseNumberExt": pd.StringDtype(), "PropertyStreetPreDirectional": "category", "PropertyStreetName": pd.StringDtype(), "PropertyStreetSuffix": pd.StringDtype(), "PropertyStreetPostDirectional": "category", "PropertyFullStreetAddress": pd.StringDtype(), "PropertyCity": "category", "PropertyZip": "category", "PropertyZip4": "category", "PropertyZoningSourceCode": "category", "CensusTract": "Int32", "TaxAmount": "float64", "TaxYear": "Int16", "TaxDelinquencyFlag": "category", "LotSizeSquareFeet": "float64", "BatchID": "Int32"})

    return main

In [9]:
def get_mailAddress(folder):
    print("mail")
    mailAddress = pd.read_csv(
        folder + '\\ZAsmt\\MailAddress.txt',
        sep='|',
        on_bad_lines='skip',
        encoding='latin-1',
        quoting=csv.QUOTE_NONE,
        header=None,
        usecols=[0, 2, 3, 4, 5, 6, 7, 9, 10, 11, 13, 14],
        names=["RowID", "MailHouseNumber", "MailHouseNumberExt", "MailStreetPreDirectional", "MailStreetName", "MailStreetSuffix", "MailStreetPostDirectional", "MailBuildingNumber", "MailFullStreetAddress", "MailCity", "MailZip", "MailZip4"],
        dtype={"RowID": pd.StringDtype(), "MailHouseNumber": pd.StringDtype(), "MailHouseNumberExt": pd.StringDtype(), "MailStreetPreDirectional": "category", "MailStreetName": pd.StringDtype(), "MailStreetSuffix": pd.StringDtype(), "MailStreetPostDirectional": "category", "MailBuildingNumber": pd.StringDtype(), "MailFullStreetAddress": pd.StringDtype(), "MailCity": "category", "MailZip": "category", "MailZip4": "category"})

    return mailAddress

MODIFY DESIRED COUNTIES HERE. Create a new variable and assign it to a list of counties you want, all lowercase. Then, change the variable located in the isin() function to use these counties.

Drops any county not in the 29 county Atlanta MSA from the main dataframe. The "Main" file, and the resultant dataframe, is the best source of county information from all of the files; we can assume that every property must have at least an entry in the main file.

Soon, we will begin the merging process, starting with the main file. With this assumption, we can speed up the merging process by dropping now and performing a left join on the other files. The alternative would be dropping the counties after the merge, but we would still be using the "County" variable from the main dataframe, so it is more efficient to do it now.

In [10]:
def drop_counties(main):
    COUNTIES_ATL = ["barrow", "bartow", "butts", "carroll", "cherokee", "clayton", "cobb", "coweta", "dawson", "dekalb", "douglas", "fayette", "forsyth", "fulton", "gwinnett", "haralson", "heard", "henry", "jasper", "lamar", "meriwether", "morgan", "newton", "paulding", "pickens", "pike", "rockdale", "spalding", "walton"] # 13
    FIPS_GA = [13013, 13015, 13035, 13045, 13057, 13063, 13067, 13077, 13085, 13089, 13089, 13097, 13113, 13117, 13121, 13135, 13143, 13149, 13151, 13159, 13171, 13199, 13211, 13217, 13223, 13227, 13231, 13247, 13255, 13297]
    COUNTIES_NC = ["anson", "cabarrus", "gaston", "iredell", "lincoln", "mecklenburg", "rowan", "union", "chester", "lancaster", "york"] # 37
    COUNTIES_MD = ["anne arundel", "baltimore", "carroll", "harford", "howard", "queen annes", "baltimore city"] #24, baltimore (independent) -> baltimore city (looked at source)
    COUNTIES_MN = ["anoka", "carver", "chisago", "dakota", "hennepin", "isanti", "le sueur", "mille lacs", "ramsey", "scott", "sherburne", "washington", "wright", "pierce", "st croix"] #27
    COUNTIES_NV = "clark" #32
    COUNTIES_WI = ["milwaukee", "ozaukee", "washington", "waukesha"] #55
    #main = main.set_index('County')
    #return main.loc[main.index.str.lower().isin(COUNTIES_MD)]
    #return main.loc[main['County'].str.lower().isin(COUNTIES_MD)] # Change [COUNTIES_ATL (OR SIMILIAR)] HERE
    #return main.loc[main['County'].str.lower() == COUNTIES_NV] # Change [COUNTIES_ATL (OR SIMILIAR)] HERE

    return main.loc[main['FIPS'].isin(FIPS_GA)]

    #"lamar", "meriwether", "morgan"

Merges the dataframes from each file, using a left join. A left join keeps all of the data from the left dataframe and adds matching data from the right dataframe, if there is any. Since we begin the merge with the main dataframe, we are utilizing the assumption that any property will at least have an entry in main. We keep this data throughout and add to it if there is any additional data with a matching RowID via left join.

Finally, we drop any duplicates to reduce the data size. The number of dropped entries is also calculated and recorded in the "data_summary" file.

In [11]:
def merge(dataframes):
    print("Main:", dataframes["main"].shape)
    merged = dataframes["main"].merge(dataframes["mailAddress"], how="left", left_on="RowID", right_on="RowID")
    merged.drop_duplicates(inplace=True)
    print("After mailAddress:", merged.shape)

    merged = merged.merge(dataframes["value"], how="left", left_on="RowID", right_on="RowID")
    merged.drop_duplicates(inplace=True)
    print("After value:", merged.shape)

    merged = merged.merge(dataframes["saleData"], how="left", on="RowID")
    merged.drop_duplicates(inplace=True)
    print("After saleData:", merged.shape)

    merged = merged.merge(dataframes["building"], how="left", on="RowID")
    merged.drop_duplicates(inplace=True)
    print("After building:", merged.shape)

    merged = merged.merge(dataframes["buildingAreas"], how="left", on="RowID")
    merged.drop_duplicates(inplace=True)
    print("After buildingAreas:", merged.shape)


    try:
        merged = merged.merge(dataframes["taxDistrict"], how="left", on="RowID")
        print("After taxDistrict:", merged.shape)
    except:
        print("exluced taxDistrict")

    return merged

Writes the merged dataframe to "{year/quarter}_trans_out.csv"

In [12]:
def write_file(folder, merged):
    merged.to_csv("out" + "\\" + folder + "_asmt_out.csv", index=False)
    merged.to_parquet("out" + "\\" + folder + "parquet_asmt_out.parquet", index=False)