## Pyspark Basics

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.4.0.tar.gz (310.8 MB)
     -------------------------------------- 310.8/310.8 MB 1.3 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting py4j==0.10.9.7 (from pyspark)
  Downloading py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)
     -------------------------------------- 200.5/200.5 kB 4.0 MB/s eta 0:00:00
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py): started
  Building wheel for pyspark (setup.py): finished with status 'done'
  Created wheel for pyspark: filename=pyspark-3.4.0-py2.py3-none-any.whl size=311317167 sha256=b9e564ba7f19f8c467b87b6b96c857c0a632eb4aa59d48d79dbc11608242072b
  Stored in directory: c:\users\azam\appdata\local\pip\cache\wheels\9f\34\a4\159aa12d0a510d5ff7c8f0220abbea42e5d81ecf588c4fd884
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.7 pyspark-3.4



In [1]:
import pyspark

In [2]:
import pandas as pd

In [3]:
path = r"C:\Users\Azam\Desktop\Extra Work\Udemy ML\UNZIP_FOR_NOTEBOOKS_FINAL\03-Pandas\movie_scores.csv"

In [4]:
df = pd.read_csv(path)

In [5]:
df.head()

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [6]:
# Always start a spark session

from pyspark.sql import SparkSession

In [7]:
spark = SparkSession.builder.appName('Practice').getOrCreate()

#### SparkSession: SparkSession is the entry point for any Spark functionality. It allows you to interact with Spark and perform various operations on distributed datasets. It combines functionality from the earlier SQLContext, HiveContext, StreamingContext, and SparkContext APIs into a single unified interface.

#### builder: The builder method is used to create an instance of the SparkSession.Builder class. This class provides a fluent API for configuring Spark session settings.

#### appName: appName is a method that sets the name of your Spark application. It is an optional step but provides a meaningful name to identify your application in the Spark cluster's UI and logs. In the provided code snippet, the application name is set as 'Practice'.

#### getOrCreate: getOrCreate is a method called on the SparkSession.Builder object. It attempts to retrieve an existing SparkSession if one already exists or creates a new one if none exists. This approach ensures that you have a single SparkSession per JVM (Java Virtual Machine), which is the recommended practice.



In [8]:
spark

In [10]:
df_pyspark = spark.read.csv(path)

In [11]:
df_pyspark

DataFrame[_c0: string, _c1: string, _c2: string, _c3: string, _c4: string, _c5: string]

In [12]:
df_pyspark.show()

+----------+---------+----+----+---------------+----------------+
|       _c0|      _c1| _c2| _c3|            _c4|             _c5|
+----------+---------+----+----+---------------+----------------+
|first_name|last_name| age| sex|pre_movie_score|post_movie_score|
|       Tom|    Hanks|63.0|   m|            8.0|            10.0|
|      null|     null|null|null|           null|            null|
|      Hugh|  Jackman|51.0|   m|           null|            null|
|     Oprah|  Winfrey|66.0|   f|            6.0|             8.0|
|      Emma|    Stone|31.0|   f|            7.0|             9.0|
+----------+---------+----+----+---------------+----------------+



In [13]:
# I want to keep first_name|last_name| age| sex|pre_movie_score|post_movie_score as columns names

df_pyspark = spark.read.option('header','true').csv(path)

In [14]:
df_pyspark.show()

+----------+---------+----+----+---------------+----------------+
|first_name|last_name| age| sex|pre_movie_score|post_movie_score|
+----------+---------+----+----+---------------+----------------+
|       Tom|    Hanks|63.0|   m|            8.0|            10.0|
|      null|     null|null|null|           null|            null|
|      Hugh|  Jackman|51.0|   m|           null|            null|
|     Oprah|  Winfrey|66.0|   f|            6.0|             8.0|
|      Emma|    Stone|31.0|   f|            7.0|             9.0|
+----------+---------+----+----+---------------+----------------+



In [15]:
# Pandas df
type(df)

pandas.core.frame.DataFrame

In [16]:
# Pyspark's df
type(df_pyspark)

pyspark.sql.dataframe.DataFrame

In [17]:
df_pyspark.head(3)

[Row(first_name='Tom', last_name='Hanks', age='63.0', sex='m', pre_movie_score='8.0', post_movie_score='10.0'),
 Row(first_name=None, last_name=None, age=None, sex=None, pre_movie_score=None, post_movie_score=None),
 Row(first_name='Hugh', last_name='Jackman', age='51.0', sex='m', pre_movie_score=None, post_movie_score=None)]

In [18]:
df_pyspark.printSchema()

root
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- age: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- pre_movie_score: string (nullable = true)
 |-- post_movie_score: string (nullable = true)



In [19]:
num_rows = df_pyspark.count()  # Get the number of rows
num_columns = len(df_pyspark.columns)  # Get the number of columns

print("Number of rows:", num_rows)
print("Number of columns:", num_columns)

Number of rows: 5
Number of columns: 6
