# Processing the CoreLogic Data using PySpark

In this Jupyter notebook, we demonstrate how you can process the CoreLogic data using PySpark. We will show you how to import the data, explore rows and columns, how to filter for a given location, and how to write a subset of the data to a csv file for later use. This [PySpark cheat sheet](http://datacamp-community-prod.s3.amazonaws.com/acfa4325-1d43-4542-8ce4-bea2d287db10) provides a great overview of available PySpark functionality. 

The notebook was created using a 'Jupyter + Spark Basic session' in [Open on Demand](https://arc.umich.edu/open-ondemand/) (OOD) on [Great Lakes](https://arc.umich.edu/greatlakes/) (GL). This automatically initializes Spark in the background.

If you are not running the notebook using OOD on GL, you will most likely have to make sure PySpark is installed on your system and then initialize it by typing 

```from pyspark import SparkContext```

```sc = SparkContext(master = 'local[2]')```

If you encounter any errors or have questions about this Jupyter notebook, or if you would like us to add another PySpark example using the CoreLogic data, feel free to reach out to [Armand Burks](mailto:arburks@umich.edu) and [Jule Krüger](mailto:julianek@umich.edu).

## Importing the CoreLogic data with PySpark

The data are stored in a Turbo volume. To get access, you will need to sign a Memorandum of Understanding (MOU) with the University of Michigan Library. To execute this notebook, you will have to be granted access to the CoreLogic data on Turbo.

The raw data come in three separate files: deeds (28GB), foreclosures (6GB), and taxes (24GB). [CSCAR](https://cscar.research.umich.edu/) pre-processed the CoreLogic data and stored each raw file in 100 separate partitions. You can read more about this methodology in ```nfs/turbo/lib-data-corelogic/Docs/cscar_data.txt```.

We store the paths to the pre-processed partitioned files in three separate variables:

In [23]:
turbo_path_fcl = "/nfs/turbo/lib-data-corelogic/Data/fcl/*.gz"
turbo_path_deed = "/nfs/turbo/lib-data-corelogic/Data/deed/*.gz"
turbo_path_tax = "/nfs/turbo/lib-data-corelogic/Data/tax/*.gz"

The following command imports a CoreLogic data file into a dataframe (df). You can switch between foreclosures, deeds, and taxes by choosing a relevant path. In each text file, the pipe ```|``` was used as a delimiter and is set accordingly in the ```sep``` argument.

__A note on import speeds__: Because of their size, it takes a little while to read in each partitioned data file collection (foreclosures about 50 seconds, taxes about 4.5 minutes, deeds about XXX minutes). Importing the raw data files takes much longer, e.g., foreclosures about 6.5 minutes.

Because of these speed differences, we will continue this demo using the pre-partitioned foreclosure data.

In [None]:
df = spark.read.csv(turbo_path_tax, inferSchema=True, header=True, sep="|")

Let's look at the data.

In [11]:
df.show()

+------------------+-----+-----+----------+-------------------+------------+------------+-------------+------------+--------------+------------+------------+----------------+----------------+--------------------+-------------------+----------------------------------+---------------------------------+------------------------------------+----------------------------------+---------------------------------+------------------------------------+----------------------------------+---------------------------------+------------------------------------+----------------------------------+---------------------------------+------------------------------------+-----------------------------------+-------+-------------+---------------+-------+----------+---------------+--------------+--------------------+-----------------+-------------------+-------+-----------+-----------+--------------------------+-----------------+------------------+----------+-------+----------+----------------------+------------

In [12]:
df.columns

['APN',
 'FIPS',
 'State',
 'County',
 'BatchDateSeq Number',
 'DeedCategory',
 'DocumentType',
 'RecordingDate',
 'DocumentYear',
 'DocumentNumber',
 'DocumentBook',
 'DocumentPage',
 'TitleCompanyCode',
 'TitleCompanyName',
 'AttorneyName',
 'AttorneyPhoneNumber',
 '1stDefendantBorrowerOwnerFirstName',
 '1stDefendantBorrowerOwnerLastName',
 '1stDefendantBorrowerOwnerCompanyName',
 '2ndDefendantBorrowerOwnerFirstName',
 '2ndDefendantBorrowerOwnerLastName',
 '2ndDefendantBorrowerOwnerCompanyName',
 '3rdDefendantBorrowerOwnerFirstName',
 '3rdDefendantBorrowerOwnerLastName',
 '3rdDefendantBorrowerOwnerCompanyName',
 '4thDefendantBorrowerOwnerFirstName',
 '4thDefendantBorrowerOwnerLastName',
 '4thDefendantBorrowerOwnerCompanyName',
 'DefendantBorrowerOwnerEtAlIndicator',
 'Filler1',
 'DateofDefault',
 'AmountofDefault',
 'Filler2',
 'FilingDate',
 'CourtCaseNumber',
 'LisPendensType',
 'Plaintiff1/Seller',
 'Plaintiff2/Seller',
 'FinalJudgmentAmount',
 'Filler3',
 'AuctionDate',
 'Auction

Detroit is in Wayne county. The FIPS code for Wayne county is '026163'.

In [14]:
subset = df.filter(df.FIPS==26163)

In [15]:
subset.show()

+-----------+-----+-----+------+-------------------+------------+------------+-------------+------------+--------------+------------+------------+----------------+----------------+--------------------+-------------------+----------------------------------+---------------------------------+------------------------------------+----------------------------------+---------------------------------+------------------------------------+----------------------------------+---------------------------------+------------------------------------+----------------------------------+---------------------------------+------------------------------------+-----------------------------------+-------+-------------+---------------+-------+----------+--------------------+--------------+--------------------+-----------------+-------------------+-------+-----------+-----------+--------------------------+-----------------+------------------+----------+-------+----------+----------------------+------------------

How many rows are in the Wayne county subset? Use the count() method

In [16]:
subset.count()

521040

In [21]:
subset.select('PropertyCity1').distinct().show()

+-------------------+
|      PropertyCity1|
+-------------------+
|       LINCOLN PARK|
|         MELVINDALE|
|         NORTHVILLE|
|            TRENTON|
|         BELLEVILLE|
|               null|
|          RIVERVIEW|
|GROSSE POINTE WOODS|
|          FLAT ROCK|
|         GROSSE ILE|
| GROSSE POINTE PARK|
|          GIBRALTAR|
|      GROSSE POINTE|
|             MONROE|
|          WYANDOTTE|
|          WOODHAVEN|
|            INKSTER|
|          SOUTHGATE|
|        SUMPTER TWP|
|        GARDEN CITY|
+-------------------+
only showing top 20 rows



We are writing the data to a new output/ folder as a csv file. By default, Spark writes the data into multiple files. This is recommended practice, especially if the data is big (many rows and/or columns).

In [25]:
subset.write.csv("output/PPP_data_Ann_Arbor")

There is also a way to force the data into one single file only. This is a little risky, so caution is advised.is. It combines the data into one core, if it does not fit, it will throw an error. (_scratch/social_work_root/[uniqname]/[filename]). The overwrite argument overwrites the previously created directory.


In [47]:
subset.coalesce(1).write.csv("output/PPP_data_Ann_Arbor", mode="overwrite")