<a href="https://colab.research.google.com/github/ad17171717/YouTube-Tutorials/blob/main/Google%20Colab%20Tutorials/Google_Colab_%2B_PySpark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **PySpark**

**PySpark is the Python API for Apache Spark. It enables users to perform real-time, large-scale data processing in a distributed environment using Python. PySpark combines Python’s learnability and ease of use with the power of Apache Spark to enable processing and analysis of data at any size for everyone familiar with Python. PySpark supports all of Spark’s features such as Spark SQL, DataFrames, Structured Streaming, Machine Learning (MLlib) and Spark Core.**

<sup>Source: [PySpark Documentation](https://spark.apache.org/docs/latest/api/python/index.html) from Apache.org</sup>

## **Installing PySpark**

In [1]:
#check that java is installed
!java -version

openjdk version "11.0.23" 2024-04-16
OpenJDK Runtime Environment (build 11.0.23+9-post-Ubuntu-1ubuntu122.04.1)
OpenJDK 64-Bit Server VM (build 11.0.23+9-post-Ubuntu-1ubuntu122.04.1, mixed mode, sharing)


In [2]:
#install pyspark
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=3d038ba9f5b5b2a1c298b8f1505a03f29c9684d0dc59134724a8d85a447de6cc
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


In [3]:
import os
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, LongType, TimestampType

## **Working with PySpark**

### **Download Dataset**

In [None]:
#use curl to download the data then unzip it into the content directory
!curl -O https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_2023/raw/review_categories/Automotive.jsonl.gz
!gunzip -f Automotive.jsonl.gz

<sup>Data Source: [Amazon Reviews](https://amazon-reviews-2023.github.io/#) from UCSD by Hou, Yupeng and Li, Jiacheng and He, Zhankui and Yan, An and Chen, Xiusi and McAuley, Julian</sup>

In [None]:
print(f'The size of the Amazon Automotive Revie file is: {os.path.getsize("/content/Automotive.jsonl") / (1024 ** 3):.2f} GB')

### **Initializing a SparkSession**

**A `SparkSession` is a class from the PySpark module. It is the entrypoint to working with Apache Spark. This will allow us to create a PySpark DataFrame**

<sup>Source: [pyspark.sql.SparkSession](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.html) from Spark.Apache.org</sup>

In [4]:
spark = SparkSession.builder.appName('AutomotiveData').getOrCreate()

print(f'The Spark version is {spark.version}')

The Spark version is 3.5.1


### **Reading in the Data**

#### **pandas**

In [None]:
#attempting to read a 8 GB file into a pandas DataFrame will cause the session to crash
pandas_df = pd.read_json('/content/Automotive.jsonl', lines=True)
print(pandas_df.head())

#### **PySpark**

##### **Using a Schema for a PySpark DataFrame**

**Within Apache Spark a `schema` provides the data format for a DataFrame or a Dataset.**

**For our `schema`, we will first set our `StructType` which is a set of `StructField`s. A `StructField` contains information for a given column including the column name (`name`), the type of data within the rows (`dataType`) and whether or not the row can contain a null value (`nullable`).**

In [None]:
schema = StructType([
    StructField("asin", StringType(), True),
    StructField("helpful_vote", IntegerType(), True),
    StructField("images", StringType(), True),
    StructField("parent_asin", StringType(), True),
    StructField("rating", DoubleType(), True),
    StructField("text", StringType(), True),
    StructField("timestamp", TimestampType(), True),
    StructField("title", StringType(), True),
    StructField("user_id", StringType(), True),
    StructField("verified_purchase", StringType(), True)
])

In [None]:
spark_df = spark.read.schema(schema).json('/content/Automotive.jsonl')

### **PySpark Operations**

In [None]:
#display first 5 rows of the pyspark DataFrame
print(spark_df.show(5))

In [None]:
#print the columns and datatypes of the DataFrame
spark_df.printSchema()

In [None]:
#retrieve summary statistics for the DataFrame
spark_df.describe().show()

In [None]:
#retrieve value counts for the rating column
spark_df.groupBy('rating').count().show()

## **Limitations of PySpark within Google Colab**

**Google Colab provides a platform to Google Account holders for no additional monetary fee; because of this resources within Google Colab are limited. Additionally the computation will be slower because of cloud latency. Within the tutorial we saw how long it can take to read a 8 GB file into a PySpark DataFrame or running calculations within a Google Colab session. For the free edition of Google Colab using PySpark can be great for learning PySpark or working with early stage prototype scripts. However, the free version of Google Colab should not be used for development or productions purposes.**

**Consider working with datasets 2GB or smaller in Google Colab when using PySpark.**

<sup>Source: [Google Colab Frequently Asked Questions](https://research.google.com/colaboratory/faq.html)</sup>

# **References and Additional Learning**

## **Data**

- **[Amazon Reviews](https://amazon-reviews-2023.github.io/#) from UCSD by Hou, Yupeng and Li, Jiacheng and He, Zhankui and Yan, An and Chen, Xiusi and McAuley, Julian**

## **Documentation**

- **[Google Colab Frequently Asked Questions](https://research.google.com/colaboratory/faq.html) from research.google.com**

- **[PySpark Documentation](https://spark.apache.org/docs/latest/api/python/index.html) from Apache.org**

# **Connect**
- **Feel free to connect with Adrian on [YouTube](https://www.youtube.com/channel/UCPuDxI3xb_ryUUMfkm0jsRA), [LinkedIn](https://www.linkedin.com/in/adrian-dolinay-frm-96a289106/), [X](https://twitter.com/DolinayG), [GitHub](https://github.com/ad17171717), [Medium](https://adriandolinay.medium.com/) and [Odysee](https://odysee.com/@adriandolinay:0). Happy coding!**