## Reading Data in PySpark
This chapter will go into more detail about the various methods available to read in data in PySpark. The previous chapter, [Introduction to PySpark](../pyspark-intro/pyspark-intro), briefly covered [`spark.read.csv()`](https://spark.apache.org/docs/latest/sql-data-sources-csv.html) and [`spark.read.parquet()`](https://spark.apache.org/docs/latest/sql-data-sources-parquet.html) to read data in as a CSV and parquet file respectively.

This chapter will provide more detail on when you would use [`spark.read.csv()`](https://spark.apache.org/docs/latest/sql-data-sources-csv.html) or [`spark.read.parquet()`](https://spark.apache.org/docs/latest/sql-data-sources-parquet.html) as well as introducing new methods to read in data including Hive tables and ORC files.

## Reading in CSV files
To read in a CSV file, you can use the following code

In [14]:
from pyspark.sql import SparkSession, functions as F

spark = (SparkSession.builder.master("local[2]")
         .appName("reading-data")
         .getOrCreate())

animal_rescue = spark.read.csv("../data/animal_rescue.csv", header = True, inferSchema = True)
animal_rescue.show()

+--------------+----------------+-------+-------+---------------+---------+--------------+---------------------+-----------------------+--------------------+--------------------+------------------+--------------------+-----------------+--------------------------+--------------------+---------+--------------------+-----------+--------------------+---------------+----------------+---------+----------+---------------+----------------+
|IncidentNumber|  DateTimeOfCall|CalYear|FinYear| TypeOfIncident|PumpCount|PumpHoursTotal|HourlyNotionalCost(£)|IncidentNotionalCost(£)|    FinalDescription|   AnimalGroupParent|      OriginofCall|        PropertyType| PropertyCategory|SpecialServiceTypeCategory|  SpecialServiceType| WardCode|                Ward|BoroughCode|             Borough|  StnGroundName|PostcodeDistrict|Easting_m|Northing_m|Easting_rounded|Northing_rounded|
+--------------+----------------+-------+-------+---------------+---------+--------------+---------------------+----------------

It's important to note the two arguments we have provided to the `spark.read.csv()` function, `header` and `inferSchema`.

By setting `header` to True, we're saying that we want the top row to be used as the column names. If we did not set this argument to True, then the top rows will be treated as the first row of data, and columns will be given a default name of "_c1", "_c2", "_c3" and so on.

`inferSchema` is very important - a disadvantage of using a CSV file is that they are not associated with a schema in the same way parquet files are. By setting `inferSchema` to True, we're allowing the PySpark API to attempt to work out the schemas based on the contents of each column. If this were set to False, then each column would be set to a string datatype by default.

Note that `inferSchema` isn't always completely correct - we can see this in the DateTimeOfCall column in the below code. It should be a date type, but it has been read in as a string type. 

In [6]:
animal_rescue.printSchema()

root
 |-- IncidentNumber: string (nullable = true)
 |-- DateTimeOfCall: string (nullable = true)
 |-- CalYear: integer (nullable = true)
 |-- FinYear: string (nullable = true)
 |-- TypeOfIncident: string (nullable = true)
 |-- PumpCount: double (nullable = true)
 |-- PumpHoursTotal: double (nullable = true)
 |-- HourlyNotionalCost(£): integer (nullable = true)
 |-- IncidentNotionalCost(£): double (nullable = true)
 |-- FinalDescription: string (nullable = true)
 |-- AnimalGroupParent: string (nullable = true)
 |-- OriginofCall: string (nullable = true)
 |-- PropertyType: string (nullable = true)
 |-- PropertyCategory: string (nullable = true)
 |-- SpecialServiceTypeCategory: string (nullable = true)
 |-- SpecialServiceType: string (nullable = true)
 |-- WardCode: string (nullable = true)
 |-- Ward: string (nullable = true)
 |-- BoroughCode: string (nullable = true)
 |-- Borough: string (nullable = true)
 |-- StnGroundName: string (nullable = true)
 |-- PostcodeDistrict: string (nullabl

To correct this, we can either cast the column as a date type or we can provide a schema when reading it in. 

In [12]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, TimestampType

custom_schema = StructType([
    StructField("IncidentNumber", StringType(), True),
    StructField("DateTimeOfCall", TimestampType(), True),
    StructField("CalYear", IntegerType(), True),
    StructField("TypeOfIncident", StringType(), True),])



animal_rescue = spark.read.csv(path = "../data/animal_rescue.csv", header = True, schema = custom_schema)
animal_rescue.printSchema()

root
 |-- IncidentNumber: string (nullable = true)
 |-- DateTimeOfCall: timestamp (nullable = true)
 |-- CalYear: integer (nullable = true)
 |-- TypeOfIncident: string (nullable = true)



We can see that DateTimeOfCall has now been read in correctly as a timestamp type. Also, providing a schema will improve the efficiency of the operation. This is because, in order to infer a schema, Spark needs to scan the dataset. A column could contain, for example, an integer for the first 1000 rows and a string for the 1001th row - in which case it would be inferred as a string, but this wouldn't be obvious from the first 1000 rows. Needing to sample the data in each column can be quite memory intensive. Providing a schema means that Spark no longer has to sample the data before reading it in. 

## Reading in parquet files
To read in a parquet file, you can use the following code:

In [11]:
animal_rescue = spark.read.parquet("../data/animal_rescue.parquet")

As you can see, we didn't have to provide a schema or use the inferSchema argument - this is because parquet files already have a schema associated with them, which is stored in the metadata. This is one of the benefits of reading in from a parquet file as opposed to a CSV, and there are several more benefits, including: 

* Column-based format - parquet files are organised by columns, rather than by row. This allows for better compression and more efficient use of storage space, as columns typically contain similar data types and repeating values. Additionally, when accessing only specific columns, PySpark can skip reading in unnecessary data and only read in the columns of interest.
* Predicate pushdown - parquet supports predicate pushdowns, this means if you read in the full dataset and then filter, the filter clause will be "pushed down" to where the data is stored, meaning it can be filtered before it is read in, reducing the amount of memory that the data will take up. 
* Compression - parquet has built-in compression methods to reduce the required storage space.
* Complex data types - parquet files support complex data types such as nested data.

## Reading in a Hive table