## Reading a CSV file

Loading the data is the first step in building a data transformation pipeline. “Comma separated values” (CSV) is a file commonly used file format for data exchange. You’re now going to use Spark to read a CSV file.

You’ve seen in the videos how to load landing/prices.csv. Now let’s do the same for landing/ratings.csv, step by step. Remember, the actual data lake is made available to you under ~/workspace/mnt/data_lake.

A SparkSession named spark has already been loaded for you.

### Instructions
    - Create a DataFrameReader object using the spark.read property.
    - Make the reader object use the header of the CSV file to name the columns automatically, by passing in the correct keyword arguments to the reader’s .options() method.

In [None]:
# Read a csv file and set the headers
df = (spark.read
      .options(header="true")
      .csv("/home/repl/workspace/mnt/data_lake/landing/ratings.csv"))

df.show()

## Defining a schema

In the last exercise, you read a CSV file using PySpark. Because you didn’t define a schema, all column values were parsed as strings which can be cumbersome and inefficient to process. You are usually better off defining the data types in a schema yourself.

To do this, you use classes from the pyspark.sql.types module. Its StructType() class expects a list of StructField() instances that allow you to add fields to a schema. Various other types, such as ByteType() and IntegerType() are also defined in this module and can be used to specify the data types of each field. In this exercise, all of these classes have been imported for you.

In the ratings.csv dataset from the previous exercise, the rating values in the columns “absorption_rate” and “comfort” are expressed on a scale from 1 to 5, like with Amazon’s web store. Because of that, they easily fit into a ByteType(), which can hold values between -128 and 127. The other columns are better left as StringType()s.

Feel free to explore the previous Spark DataFrame in the console using df.show() so you can map each column to the correct type.

### Instructions
    - Define the schema for the spreadsheet that has the columns “brand”, “model”, “absorption_rate” and “comfort”, in that order.
    - Pass the predefined schema while loading the CSV file using the .schema() method.

In [None]:
# Define the schema
schema = StructType([
  StructField("brand", StringType(), nullable=False),
  StructField("model", StringType(), nullable=False),
  StructField("absorption_rate", ByteType(), nullable=True),
  StructField("comfort", ByteType(), nullable=True)
])

better_df = (spark
             .read
             .options(header="true")
             # Pass the predefined schema to the Reader
             .schema(schema)
             .csv("/home/repl/workspace/mnt/data_lake/landing/ratings.csv"))
pprint(better_df.dtypes)