# Processing the CoreLogic Data using PySpark

In this Jupyter notebook, we demonstrate how you can process the CoreLogic data using PySpark. We will show you how to import the data, explore rows and columns, how to filter for a given location, and how to write a subset of the data to a csv file for later use. This [PySpark cheat sheet](http://datacamp-community-prod.s3.amazonaws.com/acfa4325-1d43-4542-8ce4-bea2d287db10) provides a great overview of available PySpark functionality. 

The notebook was created using a 'Jupyter + Spark Basic session' in [Open on Demand](https://arc.umich.edu/open-ondemand/) (OOD) on [Great Lakes](https://arc.umich.edu/greatlakes/) (GL). This automatically initializes Spark in the background.

If you are not running the notebook using OOD on GL, you will most likely have to make sure PySpark is installed on your system and then initialize it by typing 

```from pyspark import SparkContext```

```sc = SparkContext(master = 'local[2]')```

If you encounter any errors or have questions about this Jupyter notebook, or if you would like us to add another PySpark example using the CoreLogic data, feel free to reach out to [Armand Burks](mailto:arburks@umich.edu) and [Jule Krüger](mailto:julianek@umich.edu).

## Importing the CoreLogic data with PySpark

The CoreLogic data are stored in a Turbo volume on the ```/nfs``` drive. To get access, you will need to sign a Memorandum of Understanding (MOU) with the University of Michigan Library. For more information about this, see [here](https://github.com/arc-ts/corelogic-on-greatlakes/tree/main/intro-to-corelogic-data). To execute this notebook, you will have to be granted access to the CoreLogic data on Turbo.

The raw data come in three separate files: deeds (28GB), foreclosures (6GB), and taxes (24GB). [CSCAR](https://cscar.research.umich.edu/) pre-processed the CoreLogic data and stored each raw file in 100 separate partitions. You can read more about this methodology in ```nfs/turbo/lib-data-corelogic/Docs/cscar_data.txt```.

We are going to work with the pre-processed data to improve import speeds. Let's store the paths to the pre-processed partitioned files in three separate variables:

In [1]:
turbo_path_fcl = "/nfs/turbo/lib-data-corelogic/Data/fcl/*.gz"
turbo_path_deed = "/nfs/turbo/lib-data-corelogic/Data/deed/*.gz"
turbo_path_tax = "/nfs/turbo/lib-data-corelogic/Data/tax/*.gz"

The following command imports a set of partitioned CoreLogic data file into a dataframe (df). You can switch between foreclosures, deeds, and taxes by choosing the relevant path accordingly. In each text file, the pipe ```|``` was used as a delimiter and is set  in the ```sep``` argument as such.

__A note on import speeds__: Because of their size, it takes a little while to read in each partitioned data file collection (foreclosures about 50 seconds, taxes about 4.5 minutes, deeds about 4 minutes). Importing the raw data files takes much longer, e.g., foreclosures about 6.5 minutes. Using the pre-processed data from CSCAR greatly improves import speeds as Spark can spread computation across multiple cores.

Let's work with the deed data for now:

In [2]:
df = spark.read.csv(turbo_path_deed, inferSchema=True, header=True, sep="|")

Let's look at the data.

In [3]:
df.show()

+-----+---------------------------------+---------------------+-------------------+------------------------+-------------------+------------------+-----------------+------------------------+-----------------+-----------------------+--------------------+--------+------------------------------+-----------------------+--------------------------+---------------------+-----------------------+------------------------+-------------------------+------------------+-------------------------+---------------+-----------------+----------+--------------+--------------------+----------+-----------+--------------+------------------+---------------------------+--------------------+---------------------------+-----------------+-------------------+------------+----------------+----------------------+---------------------+----------------------+---------------------------------+--------------------+--------+---------+---------+----------------+-----------------+------------------+-------------+---------+-

In [4]:
df.columns

['FIPS',
 'APN (Parcel Number) (unformatted)',
 'PCL ID IRIS FORMATTED',
 'APN SEQUENCE NUMBER',
 'PENDING RECORD INDICATOR',
 'CORPORATE INDICATOR',
 'OWNER FULL NAME',
 'OWNER 1 LAST NAME',
 'OWNER 1 FIRST NAME & M I',
 'OWNER 2 LAST NAME',
 'OWNER 2 FIRST NAME & MI',
 'OWNER ETAL INDICATOR',
 'C/O NAME',
 'OWNER RELATIONSHIP RIGHTS CODE',
 'OWNER RELATIONSHIP TYPE',
 'PARTIAL INTEREST INDICATOR',
 'ABSENTEE OWNER STATUS',
 'PROPERTY LEVEL LATITUDE',
 'PROPERTY LEVEL LONGITUDE',
 'SITUS HOUSE NUMBER PREFIX',
 'SITUS HOUSE NUMBER',
 'SITUS HOUSE NUMBER SUFFIX',
 'SITUS DIRECTION',
 'SITUS STREET NAME',
 'SITUS MODE',
 'SITUS QUADRANT',
 'SITUS APARTMENT UNIT',
 'SITUS CITY',
 'SITUS STATE',
 'SITUS ZIP CODE',
 'SITUS CARRIER CODE',
 'MAILING HOUSE NUMBER PREFIX',
 'MAILING HOUSE NUMBER',
 'MAILING HOUSE NUMBER SUFFIX',
 'MAILING DIRECTION',
 'MAILING STREET NAME',
 'MAILING MODE',
 'MAILING QUADRANT',
 'MAILING APARTMENT UNIT',
 'MAILING PROPERTY CITY',
 'MAILING PROPERTY STATE',
 'MA

Let's assume we wanted to filter the CoreLogic data only for the city of Detroit. Detroit is in Wayne county. The FIPS code for Wayne county is '026163'.

In [5]:
subset = df.filter(df.FIPS==26163)

In [6]:
subset.show()

+-----+---------------------------------+---------------------+-------------------+------------------------+-------------------+--------------------+-----------------+------------------------+-----------------+-----------------------+--------------------+------------------+------------------------------+-----------------------+--------------------------+---------------------+-----------------------+------------------------+-------------------------+------------------+-------------------------+---------------+-----------------+----------+--------------+--------------------+----------+-----------+--------------+------------------+---------------------------+--------------------+---------------------------+-----------------+-------------------+------------+----------------+----------------------+---------------------+----------------------+---------------------------------+--------------------+--------+---------+---------+----------------+-----------------+--------------------+-----------

How many rows are in the Wayne county subset? Use the count() method

In [7]:
subset.count()

2190976

We need to clarify which variable indicates the location/city of a given property. Working with 'SITUS CITY' for now.

In [11]:
subset.select('SITUS CITY').distinct().collect()

[Row(SITUS CITY='LINCOLN PARK'),
 Row(SITUS CITY='MELVINDALE'),
 Row(SITUS CITY='NORTHVILLE'),
 Row(SITUS CITY='TRENTON'),
 Row(SITUS CITY='YPSILANTI'),
 Row(SITUS CITY='BELLEVILLE'),
 Row(SITUS CITY='OAK PARK'),
 Row(SITUS CITY=None),
 Row(SITUS CITY='RIVERVIEW'),
 Row(SITUS CITY='GROSSE POINTE WOODS'),
 Row(SITUS CITY='FLAT ROCK'),
 Row(SITUS CITY='GROSSE ILE'),
 Row(SITUS CITY='GROSSE POINTE PARK'),
 Row(SITUS CITY='WARREN'),
 Row(SITUS CITY='GIBRALTAR'),
 Row(SITUS CITY='GROSSE POINTE'),
 Row(SITUS CITY='MONROE'),
 Row(SITUS CITY='WYANDOTTE'),
 Row(SITUS CITY='HIGHLAND'),
 Row(SITUS CITY='WOODHAVEN'),
 Row(SITUS CITY='INKSTER'),
 Row(SITUS CITY='SOUTHGATE'),
 Row(SITUS CITY='SUMPTER TWP'),
 Row(SITUS CITY='GARDEN CITY'),
 Row(SITUS CITY='HIGHLAND PARK'),
 Row(SITUS CITY='DETROIT'),
 Row(SITUS CITY='FERNDALE'),
 Row(SITUS CITY='CHELSEA'),
 Row(SITUS CITY='BROWNSTOWN TWP'),
 Row(SITUS CITY='NEW HUDSON'),
 Row(SITUS CITY='LANSING'),
 Row(SITUS CITY='DEARBORN HEIGHTS'),
 Row(SITUS CITY

We will still have to do some filtering here (perhaps on 'SITUS CITY' if that is the right variable), because we are getting observations for cities in Wayne county that are not Detroit.

We are writing the data to a new output/ folder as a csv file. By default, Spark writes the data into multiple files. This is recommended practice, especially if the data is big (many rows and/or columns).

In [25]:
##still some work to do here before we can save the data
##subset.write.csv("output/corelogic_data_deeds_Detroit")

There is also a way to force the data into one single file only. This is a little risky, so caution is advised.is. It combines the data into one core, if it does not fit, it will throw an error. (_scratch/social_work_root/[uniqname]/[filename]). The overwrite argument overwrites the previously created directory.


In [12]:
##subset.coalesce(1).write.csv("output/corelogic_data_deeds_Detroit", mode="overwrite")