# PySpark

installation: `!pip install pyspark==3.5.0`


- Spark or PySpark applications starts with an EntryPoint which is the SparkSession.
- The SparkSession is the entry point to programming Spark with the Dataset and DataFrame API.
- It gives access to SparkContext, SQLContext, and HiveContext.
- The SparkSession can be created using the `SparkSession.builder` method.

In [1]:
from pyspark.sql import SparkSession

# spark = SparkSession.builder.master("local[*]").appName("Spark-Read").config("spark.some.config.option", "config-value").getOrCreate()  # Create the SparkSession
spark = SparkSession.builder.appName('Spark-Read').getOrCreate()
spark.sparkContext.setJobDescription("Learning Spark UI")

spark

In [2]:
# Method 1: Using the `read.csv` method to read csv file
# the `header` option is set to True to indicate that the first row in the csv file contains the column names.
# the `inferSchema` option is set to True to automatically infer the data types of the columns.

# Note:
# - You also have the option of strictly specifying a schema when you read in data (which we recommend in production
# scenarios)

spark_df1 = spark.read.csv('./data/testdata.csv', header=True, inferSchema=True)

# used to show n rows of the dataframe in relational database table format
spark_df1.show(3)
# df.show(10, truncate=False)


+--------+--------------+-------+--------------------+--------------------+----------+-------------+--------+-------------+----------------+-------+---------------+--------------------+-------------------+-----------+--------+---------+--------+-----------------+------------------+-----------------+------+------+----------+----------+--------+--------+--------------+--------------+---------------+---------------+----------------+----------------+----------------+---------------+---------------+---------------+----------------------+----------+-------------+------------+--------------------+--------------------+-------------------+-------------------+----------------------------+---------------+---------------+--------------------+--------------------+-------------------+-------------------+------------------------+------------------------+--------+------------+----------------+----------------+----------------+-----------+-----------+-----------+------------+------------+------------+-

In [12]:
# Read file Method 2: Using the 'option' method to read csv file

# the 'header' is used to indicate that the first row in the csv file contains the column names.
spark_df2 = spark.read.option("header", "true").csv('./data/testdata.csv')

# .head method returns the first n rows of the DataFrame.
spark_df2.head(5)

# .printSchema method prints the schema of the Dataframe.
# spark_df2.printSchema()

spark_df2.schema
spark_df2.dtypes # Get the data types of each column in the DataFrame as a list of tuples (column_name, data_type)
spark_df2.columns

# .head method returns the last n rows of the DataFrame.
spark_df2.tail(10)

spark_df1.MSISDN
spark_df1.select('MSISDN').show(5)

+--------+
|  MSISDN|
+--------+
|54924133|
|54846497|
|57369115|
|57113437|
|58468805|
+--------+
only showing top 5 rows



In [13]:
!pip install pyspark




[notice] A new release of pip is available: 24.0 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip
