# PySpark - Read, Write and Validate Data -- PGA Tour Project
* Notebook by Adam Lang
* Date: 12/30/2024

# Overview
* In this notebook we will go over reading, writing and validating data in PySpark using a famous dataset from the PGA Tour that came from Kaggle.


## Create a Spark Session
* This is always the first step in any Pyspark workflow.

In [38]:
## create spark session
import pyspark
from pyspark.sql import SparkSession


## now create the session
spark = SparkSession.builder.appName("PGATourData").getOrCreate()
spark

## get the cores
cores = spark._jsc.sc().getExecutorMemoryStatus().keySet().size()
print(f"The cores we are working with are: {cores}'cores'.")
spark

The cores we are working with are: 1'cores'.


## Reading in PGA csv dataset
* You can download the `pga_tour_historical` dataset and save it whatever folder you want, then read it in.

**Data Source:** https://www.kaggle.com/bradklassen/pga-tour-20102018-data



In [3]:
## data path
path = "/content/drive/MyDrive/Colab Notebooks/PySpark Data Science/"

## read in the data
pga_data = spark.read.csv(path+'pga_tour_historical.csv',\
                          inferSchema=True,\
                          header=True)

In [2]:
# from google.colab import drive
# drive.mount('/content/drive')

Mounted at /content/drive


## 1. View first 5 lines of dataframe
* Generate a view of the first 5 lines of the dataframe using 2 different methods.
  * `.show()`
  * `.limit(5).toPandas()`

In [5]:
## first 5 lines of df -- method 1
pga_data.show(5)

+---------------+------+----------------+--------------------+-----+
|    Player Name|Season|       Statistic|            Variable|Value|
+---------------+------+----------------+--------------------+-----+
|Robert Garrigus|  2010|Driving Distance|Driving Distance ...|   71|
|   Bubba Watson|  2010|Driving Distance|Driving Distance ...|   77|
| Dustin Johnson|  2010|Driving Distance|Driving Distance ...|   83|
|Brett Wetterich|  2010|Driving Distance|Driving Distance ...|   54|
|    J.B. Holmes|  2010|Driving Distance|Driving Distance ...|  100|
+---------------+------+----------------+--------------------+-----+
only showing top 5 rows



In [6]:
## method 2
pga_data.limit(5).toPandas()

Unnamed: 0,Player Name,Season,Statistic,Variable,Value
0,Robert Garrigus,2010,Driving Distance,Driving Distance - (ROUNDS),71
1,Bubba Watson,2010,Driving Distance,Driving Distance - (ROUNDS),77
2,Dustin Johnson,2010,Driving Distance,Driving Distance - (ROUNDS),83
3,Brett Wetterich,2010,Driving Distance,Driving Distance - (ROUNDS),54
4,J.B. Holmes,2010,Driving Distance,Driving Distance - (ROUNDS),100


## 2. Print the schema details
* Now we will print the details of the dataframes schema that Spark infered to ensure that it was infered correctly. Sometimes it is not infered correctly, so we need to be careful.

In [40]:
## df schema
print("The PGA Data schema is:")
pga_data.printSchema()
print("\n\n")
print("The PGA data columns are:")
print(pga_data.columns)
print("\n\n")
print("The df is described as:")
print(pga_data.describe())

The PGA Data schema is:
root
 |-- Player Name: string (nullable = true)
 |-- Season: integer (nullable = true)
 |-- Statistic: string (nullable = true)
 |-- Variable: string (nullable = true)
 |-- Value: integer (nullable = true)




The PGA data columns are:
['Player Name', 'Season', 'Statistic', 'Variable', 'Value']



The df is described as:
DataFrame[summary: string, Player Name: string, Season: string, Statistic: string, Variable: string, Value: string]


## 3. Edit the schema during the read in

* We can see from the output above that Spark did not correctly infer that the **"value" column was an integer value**.
* Let's try specifying the schema this time to let spark know what the schema should be.

* Here is a link to see a list of PySpark data types:
https://spark.apache.org/docs/latest/sql-ref-datatypes.html

In [8]:
## import all types from pyspark
from pyspark.sql.types import*

In [12]:
## edit schema of Value column
data_schema = [StructField("Player Name", StringType(), True),
               StructField("Season", IntegerType(), True),
               StructField("Statistic", StringType(), True),
               StructField("Variable", StringType(), True),
               StructField("Value", IntegerType(), True)]

## create a final schema
final_schema = StructType(fields=data_schema)

## read in file again
pga_data = spark.read.csv(path+'pga_tour_historical.csv',
                          schema=final_schema, ## pass custom schema here
                          header=True)

## show
pga_data.limit(5).toPandas()

Unnamed: 0,Player Name,Season,Statistic,Variable,Value
0,Robert Garrigus,2010,Driving Distance,Driving Distance - (ROUNDS),71
1,Bubba Watson,2010,Driving Distance,Driving Distance - (ROUNDS),77
2,Dustin Johnson,2010,Driving Distance,Driving Distance - (ROUNDS),83
3,Brett Wetterich,2010,Driving Distance,Driving Distance - (ROUNDS),54
4,J.B. Holmes,2010,Driving Distance,Driving Distance - (ROUNDS),100


In [13]:
## now lets check schema
pga_data.printSchema()

root
 |-- Player Name: string (nullable = true)
 |-- Season: integer (nullable = true)
 |-- Statistic: string (nullable = true)
 |-- Variable: string (nullable = true)
 |-- Value: integer (nullable = true)



Summary
* We successfully transformed the `Value` to integer.

## 4. Generate summary statistics for only one variable

* Let's generate summary statistics for only the "Value" column using the `.describe` function
* (count, mean, stddev, min, max)

In [41]:
## summary stats for Value column
from pyspark.sql.functions import*  ## if you want to use col function

## select only value col
pga_data.describe(['Value']).show()

+-------+------------------+
|summary|             Value|
+-------+------------------+
|  count|           1657247|
|   mean|12494.388998743096|
| stddev|157274.75673570752|
|    min|              -178|
|    max|           3564954|
+-------+------------------+



In [19]:
## summary stats for the value column
pga_data.select('Value').summary('count','mean','stddev','min','max').show()

+-------+------------------+
|summary|             Value|
+-------+------------------+
|  count|           1657247|
|   mean|12494.388998743096|
| stddev|157274.75673570752|
|    min|              -178|
|    max|           3564954|
+-------+------------------+



Summary
* Both methods above give you the summary stats.

## 5. Generate summary statistics for TWO variables
* Now we will generate ONLY the `count min and max` for BOTH the "Value" and "Season" variables using the select function.

* We will do this without using the `.describe` function.

In [21]:
## summary for Value and Season
pga_data.select('Value','Season').summary('count','min','max').show()

+-------+-------+-------+
|summary|  Value| Season|
+-------+-------+-------+
|  count|1657247|2740403|
|    min|   -178|   2010|
|    max|3564954|   2018|
+-------+-------+-------+



## 6. Write a parquet file

* Now we will write a parquet file (not partitioned) from the pga dataset.
  * But first we will create a new dataframe containing ONLY the the "Season" and "Value" fields (using the "select command used in the question above) and write a parquet file partitioned by "Season".

*Note that if any of your variable names contain spaces, spark will produce an error message with this call. That is why we are selecting ONLY the "Season" and "Value" fields. Ideally we should renamed those columns.

In [42]:
## write parquet file
season_val_df = pga_data.select('Season','Value')
season_val_df.write.mode("overwrite").parquet('partition_parquet/')

## 7. Write a partioned parquet file

You will need to use the same limited dataframe that you created in the previous question to accomplish this task as well.

In [43]:
## partition parquet by season
season_val_df.write.mode("overwrite").partitionBy("Season").parquet("partitioned_parquet/")
season_val_df.limit(5).toPandas()

Unnamed: 0,Season,Value
0,2010,71
1,2010,77
2,2010,83
3,2010,54
4,2010,100


## 8. Read in a partitioned parquet file

Now try reading in the partitioned parquet file you just created above.

In [30]:
## read in partitioned parquet file
path_parq='/content/'
season_val_part = spark.read.parquet(path_parq+'partitioned_parquet')
season_val_part.limit(5).toPandas()

Unnamed: 0,Value,Season
0,71,2010
1,77,2010
2,83,2010
3,54,2010
4,100,2010


## 9. Reading in a set of paritioned parquet files

Now try only reading Seasons 2010, 2011 and 2012.

In [33]:
path_parq = '/content/partitioned_parquet/'

In [45]:
## read in specific files from partitioned parquet
## have to use .option("basePath")
part_df = spark.read.option("basePath",path_parq).parquet(path_parq+'Season=2010/',\
                                                          path_parq+'Season=2011',\
                                                          path_parq='Season=2012')
part_df.show(5)

+-----+------+
|Value|Season|
+-----+------+
|   71|  2010|
|   77|  2010|
|   83|  2010|
|   54|  2010|
|  100|  2010|
+-----+------+
only showing top 5 rows



## 10. Create your own dataframe

* Now lets create our own dataframe below using PySparks *.createDataFrame* function.

* We will make one that contains 4 variables and at least 3 rows.


In [35]:
## create custom df -- pizza types and cost
vals = [('Pepperoni', 10.99),('Cheese', 8.99),('Hawaiian',12.50),('BBQ Chicken', 15.99)]

## create pyspark df
spark_df = spark.createDataFrame(vals, ['Pizza Type','Cost'])
spark_df.show()

+-----------+-----+
| Pizza Type| Cost|
+-----------+-----+
|  Pepperoni|10.99|
|     Cheese| 8.99|
|   Hawaiian| 12.5|
|BBQ Chicken|15.99|
+-----------+-----+



In [36]:
## schema
spark_df.printSchema()

root
 |-- Pizza Type: string (nullable = true)
 |-- Cost: double (nullable = true)

