## Reading Data in PySpark
This chapter will go into more detail about the various methods available to read in data in PySpark. The previous chapter, [Introduction to PySpark](../pyspark-intro/pyspark-intro), briefly covered [`spark.read.csv()`](https://spark.apache.org/docs/latest/sql-data-sources-csv.html) and [`spark.read.parquet()`](https://spark.apache.org/docs/latest/sql-data-sources-parquet.html) to read data in as a CSV and parquet file respectively.

This chapter will provide more detail on when you would use [`spark.read.csv()`](https://spark.apache.org/docs/latest/sql-data-sources-csv.html) or [`spark.read.parquet()`](https://spark.apache.org/docs/latest/sql-data-sources-parquet.html) as well as introducing new methods to read in data including Hive tables and ORC files.

Let's start by setting up a Spark session and reading in the config file.

In [20]:
from pyspark.sql import SparkSession, functions as F
import yaml

spark = (SparkSession.builder.master("local[2]")
         .appName("reading-data")
         .getOrCreate())

with open("../../config.yaml") as f:
    config = yaml.safe_load(f)

## Reading in CSV files
To read in a CSV file, you can use the following code

In [2]:
animal_rescue = spark.read.csv(config["rescue_path_csv"], header = True, inferSchema = True)

It's important to note the two arguments we have provided to the [`spark.read.csv()`](https://spark.apache.org/docs/latest/sql-data-sources-csv.html) function, `header` and `inferSchema`.

By setting `header` to True, we're saying that we want the top row to be used as the column names. If we did not set this argument to True, then the top rows will be treated as the first row of data, and columns will be given a default name of "_c1", "_c2", "_c3" and so on.

`inferSchema` is very important - a disadvantage of using a CSV file is that they are not associated with a schema in the same way parquet files are. By setting `inferSchema` to True, we're allowing the PySpark API to attempt to work out the schemas based on the contents of each column. If this were set to False, then each column would be set to a string datatype by default.

Note that `inferSchema` may not always give the result you're expecting - we can see this in the DateTimeOfCall column in the below code. We may want this as a timestamp type, but it has been read in as a string. 

In [3]:
animal_rescue.printSchema()

root
 |-- IncidentNumber: string (nullable = true)
 |-- DateTimeOfCall: string (nullable = true)
 |-- CalYear: integer (nullable = true)
 |-- FinYear: string (nullable = true)
 |-- TypeOfIncident: string (nullable = true)
 |-- PumpCount: double (nullable = true)
 |-- PumpHoursTotal: double (nullable = true)
 |-- HourlyNotionalCost(£): integer (nullable = true)
 |-- IncidentNotionalCost(£): double (nullable = true)
 |-- FinalDescription: string (nullable = true)
 |-- AnimalGroupParent: string (nullable = true)
 |-- OriginofCall: string (nullable = true)
 |-- PropertyType: string (nullable = true)
 |-- PropertyCategory: string (nullable = true)
 |-- SpecialServiceTypeCategory: string (nullable = true)
 |-- SpecialServiceType: string (nullable = true)
 |-- WardCode: string (nullable = true)
 |-- Ward: string (nullable = true)
 |-- BoroughCode: string (nullable = true)
 |-- Borough: string (nullable = true)
 |-- StnGroundName: string (nullable = true)
 |-- PostcodeDistrict: string (nullabl

To correct this, we can either cast the column as a date type or we can provide a schema when reading it in. 

In [4]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, TimestampType

custom_schema = StructType([
    StructField("IncidentNumber", StringType(), True),
    StructField("DateTimeOfCall", TimestampType(), True),
    StructField("CalYear", IntegerType(), True),
    StructField("TypeOfIncident", StringType(), True),])

animal_rescue = spark.read.csv(path = config["rescue_path_csv"], header = True, schema = custom_schema)
animal_rescue.printSchema()

root
 |-- IncidentNumber: string (nullable = true)
 |-- DateTimeOfCall: timestamp (nullable = true)
 |-- CalYear: integer (nullable = true)
 |-- TypeOfIncident: string (nullable = true)



We can see that DateTimeOfCall has now been read in correctly as a timestamp type. Also, providing a schema will improve the efficiency of the operation. This is because, in order to infer a schema, Spark needs to scan the dataset. A column could contain, for example, an integer for the first 1000 rows and a string for the 1001th row - in which case it would be inferred as a string, but this wouldn't be obvious from the first 1000 rows. Needing to sample the data in each column can be quite memory intensive. Providing a schema means that Spark no longer has to sample the data before reading it in. 

## Reading in parquet files
To read in a parquet file, you can use [`spark.read.parquet()`](https://spark.apache.org/docs/latest/sql-data-sources-parquet.html) as demonstrated below:

In [5]:
animal_rescue = spark.read.parquet(config["rescue_path"])

As you can see, we didn't have to provide a schema or use the inferSchema argument - this is because parquet files already have a schema associated with them, which is stored in the metadata. This is one of the benefits of reading in from a parquet file as opposed to a CSV, and there are several more benefits, including: 

* Column-based format - parquet files are organised by columns, rather than by row. This allows for better compression and more efficient use of storage space, as columns typically contain similar data types and repeating values. Additionally, when accessing only specific columns, Spark can skip reading in unnecessary data and only read in the columns of interest.
* Predicate pushdown - parquet supports predicate pushdowns, this means if you read in the full dataset and then filter, the filter clause will be "pushed down" to where the data is stored, meaning it can be filtered before it is read in, reducing the amount of memory that the data will take up. 
* Compression - parquet has built-in compression methods to reduce the required storage space.
* Complex data types - parquet files support complex data types such as nested data.

## Reading in an Optimised Row Columnar (ORC) file
To read in an ORC file, you can use the following code:

In [21]:
animal_rescue = spark.read.orc(config["rescue_path_orc"])

An ORC file shares a lot of the same benefits that a parquet file has over a CSV, including:
* The schema is stored with the file, meaning we don't have to specify a schema
* Column-based formatting
* Predicate pushdown support
* Built-in compression
* Support of complex data types

## Reading in an Avro file
To read in an Avro file, you can use the following code:

In [25]:
animal_rescue = spark.read.format("avro").load(config["rescue_path_avro"])

Note that unlike other methods, Spark doesn't have a built in `spark.read.avro()` method so we need to use a slightly different method to read this in, by first specifying the format as "avro" and then using `.load()` to read in the file. Note that this method would also work for other formats, such as `spark.read.format("parquet").load(...)` but is slightly more verbose than the other methods demonstrated. 

While an Avro file has many of the benefits associated with parquet and ORC files, such as being associated with a schema and having built-in compression methods, there is one key difference:
An Avro file is row-based, not column-based.

See the section on "Which file type should you use?" for a discussion on row-based formats vs column-based formats.

## Reading in a Hive table
To read in a Hive table, we can use one of the following approaches:

In [6]:
#using spark.sql
animal_rescue = spark.sql("SELECT * FROM train_tmp.animal_rescue")

#using spark.read.table
animal_rescue = spark.read.table("train_tmp.animal_rescue")

Both of these methods achieve the same thing - you can use the SQL approach if you want to combine it with additional queries or if you're more familiar with SQL syntax but the spark.read.table approach will achieve the same end result.

One thing to note is that we first specify the database name and then the name of the table. This is because Hive tables are stored within databases. This isn't necessary if you're already in the correct database - you can specify which database you're working in by using [`spark.sql("USE database_name")`](https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-qry-select-usedb.html), so we could also read in the dataset using this code:

In [13]:
spark.sql("USE train_tmp")
animal_rescue = spark.read.table("animal_rescue")

In many ways, Hive tables have a lot of the same benefits that a parquet file has, such as the storing of schemas and supporting predicate pushdowns. In fact, a Hive table may consist of parquet files - a parquet file is the default underlying file structure PySpark uses when saving out a Hive table. A Hive table can consist of any of the underlying data formats we've discussed in this chapter.

The key benefits that a Hive table offers over other file formats are:

* Metadata - Hive tables contain more detailed metadata than parquet files, though both store information on schemas.
* Mutability - Hive tables are mutable, this means you can modify them in place using the [ALTER TABLE](https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-alter-table.html#:~:text=SET%20TABLE%20PROPERTIES,to%20drop%20the%20table%20property.) SQL statement. This can be used to add, modify or delete columns in a table. In contrast, parquet files are immutable - to make a change to a parquet file, you have to read it in first, make the changes and then save it by either overwriting the original file or by creating a new file.

## So, which file format should I use?
There are a number of factors to consider when deciding which file format to use.

### Do I need my data to be human-readable?

Of the file formats discussed in this chapter, only CSV files are human-readable. This means that if you need to look at the data yourself, i.e. for quality assurance purposes, you would need to ensure that your data is in a CSV file. However, CSV files are generally less efficient to read in than other file formats such as parquet or ORC files, particularly if you only require a subset of the available columns.

It's also important to note that Spark is a big data solution, so if you're only working with only a small amount of data that needs to be manually examined by a human, it may be worth reconsidering whether Spark is needed at all - it could be a good idea to read [when to use Spark](../spark-overview/when-to-use-spark.md).

### Row-based vs columnar-based formats
Generally, if you're focused more on write-intensive data applications (writing out data more than you're reading in data), for example, regularly writing new rows to a database, a row-based file format tends to work best. This is because a new row can just be appended to the end of the previous row.

If you have a dataset with many columns, and you're only reading in a subset of columns, a columnar-based format such as parquet or ORC will work better. This is because, with row-based formatting, Spark will need to read in all of the columns before disregarding columns it doesn't need (as data are saved row-by-row). In contrast, with columnar formatting, Spark only needs to read in the relevant columns. 

Generally, aggregation tends to be more efficient with column-based formatting - imagine you want to calculate the maximum value of a column. Since all values of a column are stored together in a column-based format, this will be much more efficient when compared to a row-based format.

For optimising storage space, a column-based format would be more appropriate. This is because data across columns tend to be in consistent formats and more uniform - for example, imagine a column containing a date. However, data across rows can vary in type. Because of this, columnar compression methods tend to be more effective.

[This](https://www.snowflake.com/trending/avro-vs-parquet) article goes into more depth about when you should use Avro files (a row-based format) vs parquet files (a columnar based format).

### Do you work primarily with databases/SQL?
If you're primarily working with databases/tables within databases and SQL, it may be a good idea to use a Hive table. You can use any format as the underlying data format within a Hive table - so it may be worthwhile reviewing the data formats presented in this chapter to decide which format would be most appropriate for your use case.

## Further resources
[Generic Load/Save Functions - Spark 3.4.1 Documentation](https://spark.apache.org/docs/3.2.0/sql-data-sources-load-save-functions.html)

[CSV Files - Spark 3.4.1 Documentation](https://spark.apache.org/docs/latest/sql-data-sources-csv.html)

[Parquet Files - Spark 3.4.1 Documentation](https://spark.apache.org/docs/latest/sql-data-sources-parquet.html)

[ORC Files - Spark 3.4.1 Documentation](https://spark.apache.org/docs/latest/sql-data-sources-orc.html)

[Hive Tables - Spark 3.4.1 Documentation](https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html)

[Avro Files - Spark 3.4.1 Documentation](https://spark.apache.org/docs/latest/sql-data-sources-avro.html)
