# Spark High Level (DataFrame, Dataset, Sql) API
DataFrame, Dataset and SQL are structured Spark High Level APIs. Data is organized into named columns, like a table in relational database. Schema is a concept which has come with Dataframe. Schema is named columns and data structure where is stored datas.

In [2]:
import findspark
findspark.init()

In [3]:
from pyspark.sql import SparkSession

In [4]:
spark = SparkSession.builder\
.master("local[4]")\
.appName("DataFrame")\
.config("spark.driver.memory", "2g")\
.config("spark.executor.memory", "2g")\
.getOrCreate()

In [5]:
sc = spark.sparkContext

## Creating DataFrame from list

In [6]:
from pyspark.sql import Row
list_rdd = sc.parallelize([1,2,325,546,5,7,2,32,324]).map(lambda x: Row(x))

In [7]:
df_list = list_rdd.toDF(["Numbers"])

In [8]:
df_list.show()

+-------+
|Numbers|
+-------+
|      1|
|      2|
|    325|
|    546|
|      5|
|      7|
|      2|
|     32|
|    324|
+-------+



In [9]:
## DataFrame by range

In [10]:
df_from_range = sc.parallelize(range(1,30,3))\
.map(lambda x: (x,))\
.toDF(["Range Numbers"])

In [11]:
df_from_range.show(5)

+-------------+
|Range Numbers|
+-------------+
|            1|
|            4|
|            7|
|           10|
|           13|
+-------------+
only showing top 5 rows



In [12]:
from pyspark.sql.types import IntegerType
df_from_range2 = spark.createDataFrame(range(1,30,3), IntegerType())
df_from_range2.show(3)

+-----+
|value|
+-----+
|    1|
|    4|
|    7|
+-----+
only showing top 3 rows



## Creating DataFrame from File

#### If you don't use inferSchema, the dataframe takes all columns as a string

In [13]:
df_from_file = spark.read\
.option("header","True")\
.option("inferSchema", "True")\
.csv("data/film_data.csv")

df_from_file.show(10)

+--------------------+---------+------+-----+-------+----+--------+
|                Name|    Genre|Length|Score|Country|Year|  Budget|
+--------------------+---------+------+-----+-------+----+--------+
|         stand by Me|Adventure|    89|  8.1|    USA|1986| 8000000|
|ferris Bueller's ...|   Comedy|   103|  7.8|    USA|1986| 6000000|
|             Top Gun|   Action|   110|  6.9|    USA|1986|15000000|
|              Aliens|   Action|   137|  8.4|    USA|1986|18500000|
|Flight of the Nav...|Adventure|    90|  6.9|    USA|1986| 9000000|
|             Platoon|    Drama|   120|  8.1|     UK|1986| 6000000|
|           Labyrinth|Adventure|   101|  7.4|     UK|1986|25000000|
|         Blue Velvet|    Drama|   120|  7.8|    USA|1986| 6000000|
|      Pretty in Pink|   Comedy|    96|  6.8|    USA|1986| 9000000|
|             The Fly|    Drama|    96|  7.5|    USA|1986|15000000|
+--------------------+---------+------+-----+-------+----+--------+
only showing top 10 rows



In [14]:
df_movie = df_from_file
df_movie.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Genre: string (nullable = true)
 |-- Length: integer (nullable = true)
 |-- Score: double (nullable = true)
 |-- Country: string (nullable = true)
 |-- Year: string (nullable = true)
 |-- Budget: integer (nullable = true)



In [15]:
df_movie.count()

99

#### Selecting dataframe columns

In [16]:
df_movie.select("Name","Genre","Length","Score").show(10)

+--------------------+---------+------+-----+
|                Name|    Genre|Length|Score|
+--------------------+---------+------+-----+
|         stand by Me|Adventure|    89|  8.1|
|ferris Bueller's ...|   Comedy|   103|  7.8|
|             Top Gun|   Action|   110|  6.9|
|              Aliens|   Action|   137|  8.4|
|Flight of the Nav...|Adventure|    90|  6.9|
|             Platoon|    Drama|   120|  8.1|
|           Labyrinth|Adventure|   101|  7.4|
|         Blue Velvet|    Drama|   120|  7.8|
|      Pretty in Pink|   Comedy|    96|  6.8|
|             The Fly|    Drama|    96|  7.5|
+--------------------+---------+------+-----+
only showing top 10 rows



#### Sorting dataframe by descending movie score

In [17]:
df_movie.sort(df_movie.Score.desc()).show(10)

+--------------------+---------+------+-----+-------+----+--------+
|                Name|    Genre|Length|Score|Country|Year|  Budget|
+--------------------+---------+------+-----+-------+----+--------+
|              Aliens|   Action|   137|  8.4|    USA|1986|18500000|
|             Platoon|    Drama|   120|  8.1|     UK|1986| 6000000|
|         stand by Me|Adventure|    89|  8.1|    USA|1986| 8000000|
|           Sacrifice|    Drama|   149|  8.1| Sweden|1986|       0|
|Hannah and Her Si...|   Comedy|   107|  8.0|    USA|1986| 6400000|
|ferris Bueller's ...|   Comedy|   103|  7.8|    USA|1986| 6000000|
|         Blue Velvet|    Drama|   120|  7.8|    USA|1986| 6000000|
|         Down by Law|   Comedy|   107|  7.8|    USA|1986| 1100000|
|The Name of the Rose|    Crime|   130|  7.8|  Italy|1986|       0|
| When the Wind Blows|Animation|    80|  7.8|     UK|1986|       0|
+--------------------+---------+------+-----+-------+----+--------+
only showing top 10 rows



#### Sorting dataframe by ascending movie score

In [18]:
df_movie.sort(df_movie.Score.asc()).show(10)
#df_movie.sort("Score").show(10)

+--------------------+---------+------+-----+-------+----+--------+
|                Name|    Genre|Length|Score|Country|Year|  Budget|
+--------------------+---------+------+-----+-------+----+--------+
|     King Kong Lives|   Action|   105|  3.8|    USA|1986|10000000|
|               Troll|   Comedy|    82|  4.3|    USA|1986| 1100000|
|      The Men's Club|   Comedy|   101|  4.5|    USA|1986|       0|
|     Howard the Duck|   Action|   110|  4.6|    USA|1986|35000000|
|         Solarbabies|   Action|    94|  4.8|    USA|1986|25000000|
|Police Academy 3:...|   Comedy|    83|  5.2|    USA|1986|       0|
|            Soul Man|   Comedy|   104|  5.2|    USA|1986|       0|
|          Iron Eagle|   Action|   117|  5.3|    USA|1986|       0|
|        Psicosis III|   Horror|    93|  5.3|    USA|1986|       0|
|The Clan of the C...|Adventure|    98|  5.3|    USA|1986|15000000|
+--------------------+---------+------+-----+-------+----+--------+
only showing top 10 rows



In [19]:
df_movie.head(3)

[Row(Name='stand by Me', Genre='Adventure', Length=89, Score=8.1, Country='USA', Year='1986', Budget=8000000),
 Row(Name="ferris Bueller's Day Off", Genre='Comedy', Length=103, Score=7.8, Country='USA', Year='1986', Budget=6000000),
 Row(Name='Top Gun', Genre='Action', Length=110, Score=6.9, Country='USA', Year='1986', Budget=15000000)]

In [20]:
df_movie.limit(10)

DataFrame[Name: string, Genre: string, Length: int, Score: double, Country: string, Year: string, Budget: int]

#### Getting number of dataframe partition 

In [21]:
df_movie.repartition(3).rdd.getNumPartitions()

3

In [22]:
df_movie.repartition("Genre","Name").show(10)

+--------------------+---------+------+-----+-------+----+--------+
|                Name|    Genre|Length|Score|Country|Year|  Budget|
+--------------------+---------+------+-----+-------+----+--------+
|           Sacrifice|    Drama|   149|  8.1| Sweden|1986|       0|
|          Tough Guys|   Comedy|   104|  6.2|    USA|1986|10000000|
| Armed and Dangerous|   Action|    88|  5.6|    USA|1986|12000000|
| When the Wind Blows|Animation|    80|  7.8|     UK|1986|       0|
|             Top Gun|   Action|   110|  6.9|    USA|1986|15000000|
|  Jumpin' Jack Flash|   Comedy|   105|  5.8|    USA|1986|       0|
|Hannah and Her Si...|   Comedy|   107|  8.0|    USA|1986| 6400000|
|      Something Wild|   Comedy|   114|  6.9|    USA|1986|       0|
|      Pretty in Pink|   Comedy|    96|  6.8|    USA|1986| 9000000|
|         The Mission|Adventure|   125|  7.5|     UK|1986|24500000|
+--------------------+---------+------+-----+-------+----+--------+
only showing top 10 rows



#### collect() --> returns all records as a list of Row

In [23]:
df_movie.collect()

[Row(Name='stand by Me', Genre='Adventure', Length=89, Score=8.1, Country='USA', Year='1986', Budget=8000000),
 Row(Name="ferris Bueller's Day Off", Genre='Comedy', Length=103, Score=7.8, Country='USA', Year='1986', Budget=6000000),
 Row(Name='Top Gun', Genre='Action', Length=110, Score=6.9, Country='USA', Year='1986', Budget=15000000),
 Row(Name='Aliens', Genre='Action', Length=137, Score=8.4, Country='USA', Year='1986', Budget=18500000),
 Row(Name='Flight of the Navigator', Genre='Adventure', Length=90, Score=6.9, Country='USA', Year='1986', Budget=9000000),
 Row(Name='Platoon', Genre='Drama', Length=120, Score=8.1, Country='UK', Year='1986', Budget=6000000),
 Row(Name='Labyrinth', Genre='Adventure', Length=101, Score=7.4, Country='UK', Year='1986', Budget=25000000),
 Row(Name='Blue Velvet', Genre='Drama', Length=120, Score=7.8, Country='USA', Year='1986', Budget=6000000),
 Row(Name='Pretty in Pink', Genre='Comedy', Length=96, Score=6.8, Country='USA', Year='1986', Budget=9000000),
 

#### Return all column names

In [24]:
df_movie.columns

['Name', 'Genre', 'Length', 'Score', 'Country', 'Year', 'Budget']

#### Calculating correlation between two variable (columns)

In [25]:
df_movie.corr("Length","Score")

0.25962854190011336

In [26]:
df_movie.count()

99

#### Selecting and getting of distinct column values from dataframe 

In [27]:
df_movie.select("Genre").distinct().show()

+---------+
|    Genre|
+---------+
|    Crime|
| Thriller|
|Adventure|
|    Drama|
|Animation|
|   Horror|
|Biography|
|   Comedy|
|   Action|
|   Sci-Fi|
+---------+



#### Dropping column from dataframe

In [28]:
df_movie.drop("Budget").show(5)

+--------------------+---------+------+-----+-------+----+
|                Name|    Genre|Length|Score|Country|Year|
+--------------------+---------+------+-----+-------+----+
|         stand by Me|Adventure|    89|  8.1|    USA|1986|
|ferris Bueller's ...|   Comedy|   103|  7.8|    USA|1986|
|             Top Gun|   Action|   110|  6.9|    USA|1986|
|              Aliens|   Action|   137|  8.4|    USA|1986|
|Flight of the Nav...|Adventure|    90|  6.9|    USA|1986|
+--------------------+---------+------+-----+-------+----+
only showing top 5 rows



#### Rounding Movie Score values by using withColumn()

In [29]:
from pyspark.sql.functions import col, round
df_movie.select("Score")\
.withColumn("Rounded_Score", round(col("Score"))).show(10)

+-----+-------------+
|Score|Rounded_Score|
+-----+-------------+
|  8.1|          8.0|
|  7.8|          8.0|
|  6.9|          7.0|
|  8.4|          8.0|
|  6.9|          7.0|
|  8.1|          8.0|
|  7.4|          7.0|
|  7.8|          8.0|
|  6.8|          7.0|
|  7.5|          8.0|
+-----+-------------+
only showing top 10 rows



#### Removing (dropping) automatically duplicate values from dataframe

In [30]:
df_movie.dropDuplicates().show(5)

+--------------------+---------+------+-----+-------+----+--------+
|                Name|    Genre|Length|Score|Country|Year|  Budget|
+--------------------+---------+------+-----+-------+----+--------+
|The Texas Chainsa...|   Comedy|    89|  5.6|    USA|1986| 4700000|
|        River's Edge|    Crime|    99|  7.1|    USA|1986| 1900000|
|Police Academy 3:...|   Comedy|    83|  5.2|    USA|1986|       0|
|The Great Mouse D...|Animation|    74|  7.2|    USA|1986|14000000|
|          Witchboard|   Horror|    98|  5.7|     UK|1986| 2000000|
+--------------------+---------+------+-----+-------+----+--------+
only showing top 5 rows



#### Removing (dropping) duplicate values selected columns

In [31]:
df_movie.dropDuplicates(["Name","Genre"]).show(5)

+------------------+---------+------+-----+-------+----+--------+
|              Name|    Genre|Length|Score|Country|Year|  Budget|
+------------------+---------+------+-----+-------+----+--------+
|        Tough Guys|   Comedy|   104|  6.2|    USA|1986|10000000|
|Jumpin' Jack Flash|   Comedy|   105|  5.8|    USA|1986|       0|
|   Ruthless People|   Comedy|    93|  6.9|    USA|1986|       0|
|            Aliens|   Action|   137|  8.4|    USA|1986|18500000|
|  An American Tail|Animation|    80|  6.9|    USA|1986|       0|
+------------------+---------+------+-----+-------+----+--------+
only showing top 5 rows



In [32]:
df_movie.dtypes

[('Name', 'string'),
 ('Genre', 'string'),
 ('Length', 'int'),
 ('Score', 'double'),
 ('Country', 'string'),
 ('Year', 'string'),
 ('Budget', 'int')]

In [33]:
df_movie.explain()

== Physical Plan ==
*(1) FileScan csv [Name#81,Genre#82,Length#83,Score#84,Country#85,Year#86,Budget#87] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/home/asus/Desktop/BigData-MachineLearning-Notes/ApacheSpark/data/film_dat..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<Name:string,Genre:string,Length:int,Score:double,Country:string,Year:string,Budget:int>


### Filtering dataframe values by Movie Length

In [34]:
df_movie.filter(df_movie.Length > 100).show(10)

+--------------------+---------+------+-----+-------+----+--------+
|                Name|    Genre|Length|Score|Country|Year|  Budget|
+--------------------+---------+------+-----+-------+----+--------+
|ferris Bueller's ...|   Comedy|   103|  7.8|    USA|1986| 6000000|
|             Top Gun|   Action|   110|  6.9|    USA|1986|15000000|
|              Aliens|   Action|   137|  8.4|    USA|1986|18500000|
|             Platoon|    Drama|   120|  8.1|     UK|1986| 6000000|
|           Labyrinth|Adventure|   101|  7.4|     UK|1986|25000000|
|         Blue Velvet|    Drama|   120|  7.8|    USA|1986| 6000000|
|          Highlander|   Action|   116|  7.2|     UK|1986|16000000|
|           Manhunter|    Crime|   120|  7.2|    USA|1986|15000000|
|            9� Weeks|    Drama|   117|  5.9|    USA|1986|17000000|
|     Howard the Duck|   Action|   110|  4.6|    USA|1986|35000000|
+--------------------+---------+------+-----+-------+----+--------+
only showing top 10 rows



### Filtering dataframe values by Genre and Movie Length

In [35]:
df_movie.filter((df_movie.Genre == "Action") & (df_movie.Length > 100)).show(10)

+--------------------+------+------+-----+-------+----+--------+
|                Name| Genre|Length|Score|Country|Year|  Budget|
+--------------------+------+------+-----+-------+----+--------+
|             Top Gun|Action|   110|  6.9|    USA|1986|15000000|
|              Aliens|Action|   137|  8.4|    USA|1986|18500000|
|          Highlander|Action|   116|  7.2|     UK|1986|16000000|
|     Howard the Duck|Action|   110|  4.6|    USA|1986|35000000|
|    Heartbreak Ridge|Action|   130|  6.8|    USA|1986|15000000|
|          Iron Eagle|Action|   117|  5.3|    USA|1986|       0|
|The Karate Kid Pa...|Action|   113|  5.9|    USA|1986|       0|
|     King Kong Lives|Action|   105|  3.8|    USA|1986|10000000|
|      Running Scared|Action|   107|  6.5|    USA|1986|       0|
|            Raw Deal|Action|   106|  5.5|    USA|1986| 8500000|
+--------------------+------+------+-----+-------+----+--------+
only showing top 10 rows



### Counting Genre values by using GroupBy()

In [36]:
df_movie.groupBy("Genre").count().sort("count", ascending = False).show(10)

+---------+-----+
|    Genre|count|
+---------+-----+
|   Comedy|   30|
|   Action|   22|
|    Drama|   15|
|Adventure|   11|
|   Horror|    7|
|    Crime|    6|
|Animation|    4|
|Biography|    2|
| Thriller|    1|
|   Sci-Fi|    1|
+---------+-----+



### Ordering dataframe values by descending Movie Length and ascending Name

In [37]:
df_movie.orderBy(df_movie.Length.desc(), "Name").show()

+--------------------+---------+------+-----+-------+----+--------+
|                Name|    Genre|Length|Score|Country|Year|  Budget|
+--------------------+---------+------+-----+-------+----+--------+
|           Sacrifice|    Drama|   149|  8.1| Sweden|1986|       0|
|              Aliens|   Action|   137|  8.4|    USA|1986|18500000|
|    Heartbreak Ridge|   Action|   130|  6.8|    USA|1986|15000000|
|The Name of the Rose|    Crime|   130|  7.8|  Italy|1986|       0|
|     The Delta Force|   Action|   125|  5.6|    USA|1986|12000000|
|         The Mission|Adventure|   125|  7.5|     UK|1986|24500000|
|             Piratas|Adventure|   121|  6.1| France|1986|40000000|
|          Betty Blue|    Drama|   120|  7.4| France|1986|       0|
|         Blue Velvet|    Drama|   120|  7.8|    USA|1986| 6000000|
|           Manhunter|    Crime|   120|  7.2|    USA|1986|15000000|
|             Platoon|    Drama|   120|  8.1|     UK|1986| 6000000|
|Children of a Les...|    Drama|   119|  7.2|   

# [Example]:  Word Count by using DataFrame

In [38]:
story_df = spark.read.text("data/HanselStory.txt")

#### Show first 3 row

In [39]:
story_df.show(3, truncate=False)
#story_df.show(3, truncate=False)

+---------------------------------------------------------+
|value                                                    |
+---------------------------------------------------------+
|Once upon a time there dwelt on the outskirts of a       |
|large forest a poor woodcutter with his wife and two     |
|children; the boy was called Hansel and the girl Grettel.|
+---------------------------------------------------------+
only showing top 3 rows



In [40]:
from pyspark.sql.functions import explode, split, col

### Selecting column name and changing/alias column name

In [41]:
words = story_df.select(explode(split(col("value"), " ")).alias("Words"))

In [42]:
words.show(5)

+-----+
|Words|
+-----+
| Once|
| upon|
|    a|
| time|
|there|
+-----+
only showing top 5 rows



#### Counting word numbers by using groupBy()

In [43]:
words.groupBy("Words").count().orderBy("count", ascending = False).show(10)

+-----+-----+
|Words|count|
+-----+-----+
|  the|  113|
|  and|   91|
|   to|   44|
|    a|   42|
| they|   34|
|   of|   31|
|  had|   27|
|  was|   19|
|   in|   19|
|   on|   19|
+-----+-----+
only showing top 10 rows



# --> Executing SQL Query on CSV file
### We use our Movie CSV file for SQL queries

In [44]:
df_movie.show(10)

+--------------------+---------+------+-----+-------+----+--------+
|                Name|    Genre|Length|Score|Country|Year|  Budget|
+--------------------+---------+------+-----+-------+----+--------+
|         stand by Me|Adventure|    89|  8.1|    USA|1986| 8000000|
|ferris Bueller's ...|   Comedy|   103|  7.8|    USA|1986| 6000000|
|             Top Gun|   Action|   110|  6.9|    USA|1986|15000000|
|              Aliens|   Action|   137|  8.4|    USA|1986|18500000|
|Flight of the Nav...|Adventure|    90|  6.9|    USA|1986| 9000000|
|             Platoon|    Drama|   120|  8.1|     UK|1986| 6000000|
|           Labyrinth|Adventure|   101|  7.4|     UK|1986|25000000|
|         Blue Velvet|    Drama|   120|  7.8|    USA|1986| 6000000|
|      Pretty in Pink|   Comedy|    96|  6.8|    USA|1986| 9000000|
|             The Fly|    Drama|    96|  7.5|    USA|1986|15000000|
+--------------------+---------+------+-----+-------+----+--------+
only showing top 10 rows



#### When we cache our DataFrame, it shortens data access times, improve latency and improves input/output (I/O)

In [45]:
df_movie.cache()

DataFrame[Name: string, Genre: string, Length: int, Score: double, Country: string, Year: string, Budget: int]

### Creating SQL Table from loaded .csv dataframe file 

In [46]:
df_movie.createOrReplaceTempView("MovieTable")

### Selecting all datas from created MovieTable

In [47]:
spark.sql("""

    SELECT * FROM MovieTable

""").show(10)

+--------------------+---------+------+-----+-------+----+--------+
|                Name|    Genre|Length|Score|Country|Year|  Budget|
+--------------------+---------+------+-----+-------+----+--------+
|         stand by Me|Adventure|    89|  8.1|    USA|1986| 8000000|
|ferris Bueller's ...|   Comedy|   103|  7.8|    USA|1986| 6000000|
|             Top Gun|   Action|   110|  6.9|    USA|1986|15000000|
|              Aliens|   Action|   137|  8.4|    USA|1986|18500000|
|Flight of the Nav...|Adventure|    90|  6.9|    USA|1986| 9000000|
|             Platoon|    Drama|   120|  8.1|     UK|1986| 6000000|
|           Labyrinth|Adventure|   101|  7.4|     UK|1986|25000000|
|         Blue Velvet|    Drama|   120|  7.8|    USA|1986| 6000000|
|      Pretty in Pink|   Comedy|    96|  6.8|    USA|1986| 9000000|
|             The Fly|    Drama|    96|  7.5|    USA|1986|15000000|
+--------------------+---------+------+-----+-------+----+--------+
only showing top 10 rows



#### Filtering Movie Length using Sql Where

In [48]:
spark.sql("""

    SELECT * FROM MovieTable
    WHERE Length > 100

""").show(10)

+--------------------+---------+------+-----+-------+----+--------+
|                Name|    Genre|Length|Score|Country|Year|  Budget|
+--------------------+---------+------+-----+-------+----+--------+
|ferris Bueller's ...|   Comedy|   103|  7.8|    USA|1986| 6000000|
|             Top Gun|   Action|   110|  6.9|    USA|1986|15000000|
|              Aliens|   Action|   137|  8.4|    USA|1986|18500000|
|             Platoon|    Drama|   120|  8.1|     UK|1986| 6000000|
|           Labyrinth|Adventure|   101|  7.4|     UK|1986|25000000|
|         Blue Velvet|    Drama|   120|  7.8|    USA|1986| 6000000|
|          Highlander|   Action|   116|  7.2|     UK|1986|16000000|
|           Manhunter|    Crime|   120|  7.2|    USA|1986|15000000|
|            9� Weeks|    Drama|   117|  5.9|    USA|1986|17000000|
|     Howard the Duck|   Action|   110|  4.6|    USA|1986|35000000|
+--------------------+---------+------+-----+-------+----+--------+
only showing top 10 rows



### Grouping and counting Movie Genres

In [49]:
spark.sql("""

    SELECT Genre, COUNT(Genre) as Count
    FROM MovieTable
    GROUP BY Genre
    ORDER BY Count desc

""").show(10)

+---------+-----+
|    Genre|Count|
+---------+-----+
|   Comedy|   30|
|   Action|   22|
|    Drama|   15|
|Adventure|   11|
|   Horror|    7|
|    Crime|    6|
|Animation|    4|
|Biography|    2|
| Thriller|    1|
|   Sci-Fi|    1|
+---------+-----+



In [50]:
spark.sql("""

    SELECT Genre, COUNT(Genre) as Count
    FROM MovieTable
    GROUP BY Genre
    HAVING COUNT(Genre) > 10

""").show(10)

+---------+-----+
|    Genre|Count|
+---------+-----+
|Adventure|   11|
|    Drama|   15|
|   Comedy|   30|
|   Action|   22|
+---------+-----+



In [51]:
spark.sql("""

    SELECT * 
    FROM MovieTable
    WHERE Name LIKE 'A%'
    
""").show(10)

+-------------------+---------+------+-----+-------+----+--------+
|               Name|    Genre|Length|Score|Country|Year|  Budget|
+-------------------+---------+------+-----+-------+----+--------+
|             Aliens|   Action|   137|  8.4|    USA|1986|18500000|
|   An American Tail|Animation|    80|  6.9|    USA|1986|       0|
|     At Close Range|    Crime|   111|  7.0|    USA|1986| 6500000|
|About Last Night...|   Comedy|   113|  6.2|    USA|1986|       0|
|Armed and Dangerous|   Action|    88|  5.6|    USA|1986|12000000|
|   April Fool's Day|   Horror|    89|  6.2|    USA|1986| 5000000|
+-------------------+---------+------+-----+-------+----+--------+



In [52]:
spark.sql("""

    SELECT * 
    FROM MovieTable
    WHERE Name LIKE '%a'
    
""").show(10)

+--------------------+------+------+-----+-------+----+--------+
|                Name| Genre|Length|Score|Country|Year|  Budget|
+--------------------+------+------+-----+-------+----+--------+
|Big Trouble in Li...|Action|    99|  7.3|    USA|1986|25000000|
|               Cobra|Action|    87|  5.7|    USA|1986|25000000|
+--------------------+------+------+-----+-------+----+--------+



In [53]:
spark.sql("""

    SELECT * 
    FROM MovieTable
    WHERE Score BETWEEN 7 and 10
    
""").show(10)

+--------------------+---------+------+-----+-------+----+--------+
|                Name|    Genre|Length|Score|Country|Year|  Budget|
+--------------------+---------+------+-----+-------+----+--------+
|         stand by Me|Adventure|    89|  8.1|    USA|1986| 8000000|
|ferris Bueller's ...|   Comedy|   103|  7.8|    USA|1986| 6000000|
|              Aliens|   Action|   137|  8.4|    USA|1986|18500000|
|             Platoon|    Drama|   120|  8.1|     UK|1986| 6000000|
|           Labyrinth|Adventure|   101|  7.4|     UK|1986|25000000|
|         Blue Velvet|    Drama|   120|  7.8|    USA|1986| 6000000|
|             The Fly|    Drama|    96|  7.5|    USA|1986|15000000|
|          Highlander|   Action|   116|  7.2|     UK|1986|16000000|
|Big Trouble in Li...|   Action|    99|  7.3|    USA|1986|25000000|
|           Manhunter|    Crime|   120|  7.2|    USA|1986|15000000|
+--------------------+---------+------+-----+-------+----+--------+
only showing top 10 rows



# --> DataFrame String Functions

In [54]:
from pyspark.sql.functions import *


### Concat()

In [55]:
df_movie.select("Country","Year") \
.withColumn("Year_Country", concat(col("Country"),lit(" - "),col("Year"))) \
.show(truncate=False)

+---------+----+----------------+
|Country  |Year|Year_Country    |
+---------+----+----------------+
|USA      |1986|USA - 1986      |
|USA      |1986|USA - 1986      |
|USA      |1986|USA - 1986      |
|USA      |1986|USA - 1986      |
|USA      |1986|USA - 1986      |
|UK       |1986|UK - 1986       |
|UK       |1986|UK - 1986       |
|USA      |1986|USA - 1986      |
|USA      |1986|USA - 1986      |
|USA      |1986|USA - 1986      |
|Australia|1986|Australia - 1986|
|UK       |1986|UK - 1986       |
|USA      |1986|USA - 1986      |
|USA      |1986|USA - 1986      |
|USA      |1986|USA - 1986      |
|USA      |1986|USA - 1986      |
|USA      |1986|USA - 1986      |
|USA      |1986|USA - 1986      |
|USA      |1986|USA - 1986      |
|USA      |1986|USA - 1986      |
+---------+----+----------------+
only showing top 20 rows



### Number Format 

In [56]:
df_movie.select("Budget")\
.withColumn("Budget_Format", format_number(col("Budget"), 2)).show(10)

+--------+-------------+
|  Budget|Budget_Format|
+--------+-------------+
| 8000000| 8,000,000.00|
| 6000000| 6,000,000.00|
|15000000|15,000,000.00|
|18500000|18,500,000.00|
| 9000000| 9,000,000.00|
| 6000000| 6,000,000.00|
|25000000|25,000,000.00|
| 6000000| 6,000,000.00|
| 9000000| 9,000,000.00|
|15000000|15,000,000.00|
+--------+-------------+
only showing top 10 rows



### Lower(), initcap(), length()

In [57]:
df_movie.select("Name")\
.withColumn("Name_lower", lower(col("Name")))\
.withColumn("Name_initcap", initcap(col("Name")))\
.withColumn("Name_length", length(col("Name")))\
.show()

+--------------------+--------------------+--------------------+-----------+
|                Name|          Name_lower|        Name_initcap|Name_length|
+--------------------+--------------------+--------------------+-----------+
|         stand by Me|         stand by me|         Stand By Me|         11|
|ferris Bueller's ...|ferris bueller's ...|Ferris Bueller's ...|         24|
|             Top Gun|             top gun|             Top Gun|          7|
|              Aliens|              aliens|              Aliens|          6|
|Flight of the Nav...|flight of the nav...|Flight Of The Nav...|         23|
|             Platoon|             platoon|             Platoon|          7|
|           Labyrinth|           labyrinth|           Labyrinth|          9|
|         Blue Velvet|         blue velvet|         Blue Velvet|         11|
|      Pretty in Pink|      pretty in pink|      Pretty In Pink|         14|
|             The Fly|             the fly|             The Fly|          7|

### Trim --> (rtrim, ltrim, trim)

In [58]:
df_movie.select("Name")\
.withColumn("Name_rtrim", rtrim(col("Name")))\
.withColumn("Name_ltrim", ltrim(col("Name")))\
.withColumn("Name_trim", trim(col("Name")))\
.show()

+--------------------+--------------------+--------------------+--------------------+
|                Name|          Name_rtrim|          Name_ltrim|           Name_trim|
+--------------------+--------------------+--------------------+--------------------+
|         stand by Me|         stand by Me|         stand by Me|         stand by Me|
|ferris Bueller's ...|ferris Bueller's ...|ferris Bueller's ...|ferris Bueller's ...|
|             Top Gun|             Top Gun|             Top Gun|             Top Gun|
|              Aliens|              Aliens|              Aliens|              Aliens|
|Flight of the Nav...|Flight of the Nav...|Flight of the Nav...|Flight of the Nav...|
|             Platoon|             Platoon|             Platoon|             Platoon|
|           Labyrinth|           Labyrinth|           Labyrinth|           Labyrinth|
|         Blue Velvet|         Blue Velvet|         Blue Velvet|         Blue Velvet|
|      Pretty in Pink|      Pretty in Pink|      Prett

### Replace() and Split()

In [59]:
df_movie.select("Genre", "Name")\
.withColumn("Genre_GNR", regexp_replace(col("Genre"), "Action", "ACT"))\
.withColumn("Name_split", split(col("Name"), " "))\
.withColumn("Name_first_split", col("Name_split")[0])\
.show(truncate = False)

+---------+---------------------------+---------+---------------------------------+----------------+
|Genre    |Name                       |Genre_GNR|Name_split                       |Name_first_split|
+---------+---------------------------+---------+---------------------------------+----------------+
|Adventure|stand by Me                |Adventure|[stand, by, Me]                  |stand           |
|Comedy   |ferris Bueller's Day Off   |Comedy   |[ferris, Bueller's, Day, Off]    |ferris          |
|Action   |Top Gun                    |ACT      |[Top, Gun]                       |Top             |
|Action   |Aliens                     |ACT      |[Aliens]                         |Aliens          |
|Adventure|Flight of the Navigator    |Adventure|[Flight, of, the, Navigator]     |Flight          |
|Drama    |Platoon                    |Drama    |[Platoon]                        |Platoon         |
|Adventure|Labyrinth                  |Adventure|[Labyrinth]                      |Labyrint

### Upper()

In [60]:
df_movie.select("Name")\
.withColumn("Name_Upper", upper(col("Name"))).show()

+--------------------+--------------------+
|                Name|          Name_Upper|
+--------------------+--------------------+
|         stand by Me|         STAND BY ME|
|ferris Bueller's ...|FERRIS BUELLER'S ...|
|             Top Gun|             TOP GUN|
|              Aliens|              ALIENS|
|Flight of the Nav...|FLIGHT OF THE NAV...|
|             Platoon|             PLATOON|
|           Labyrinth|           LABYRINTH|
|         Blue Velvet|         BLUE VELVET|
|      Pretty in Pink|      PRETTY IN PINK|
|             The Fly|             THE FLY|
|    Crocodile Dundee|    CROCODILE DUNDEE|
|          Highlander|          HIGHLANDER|
|               Lucas|               LUCAS|
|Big Trouble in Li...|BIG TROUBLE IN LI...|
|           Manhunter|           MANHUNTER|
|            9� Weeks|            9� WEEKS|
|   Maximum Overdrive|   MAXIMUM OVERDRIVE|
|Little Shop of Ho...|LITTLE SHOP OF HO...|
|          The Wraith|          THE WRAITH|
|     Howard the Duck|     HOWAR

# --> Data Cleaning and Saving as a new File

In [61]:
df_dirty_data = spark.read\
.option("header","True")\
.option("inferSchema", "True")\
.csv("data/film_dirty_data.csv")

df_dirty_data.show(10)

+--------------------+---------+------+-----+-------+----+--------+
|                name|    genre|length|score|country|year|  budget|
+--------------------+---------+------+-----+-------+----+--------+
|         stand by Me|Adventure|    89|  8.1|    USA|1986| 8000000|
|ferris Bueller's ...|   Comedy|  null|  7.8|    USA|1986| 6000000|
|             Top Gun|     null|   110|  6.9|    USA|1986|15000000|
|              Aliens|   Action|   137|  8.4|   null|1986|18500000|
|Flight of the Nav...|Adventure|    90| null|    USA|1986| 9000000|
|             Platoon|    Drama|   120|  8.1|     UK|1986| 6000000|
|           Labyrinth|Adventure|   101|  7.4|     UK|1986|25000000|
|         Blue Velvet|    Drama|   120| null|    USA|1986| 6000000|
|      Pretty in Pink|   Comedy|  null|  6.8|    USA|1986| 9000000|
|             The Fly|    Drama|    96|  7.5|    USA|1986|15000000|
+--------------------+---------+------+-----+-------+----+--------+
only showing top 10 rows



In [62]:
from pyspark.sql import functions as F

In [63]:
df_clean = df_dirty_data\
.withColumn("name", F.trim(F.initcap(df_dirty_data.name)))\
.withColumn("genre",F.when(df_dirty_data.genre.isNull(), "Unknown").otherwise(F.trim(F.upper(df_dirty_data.genre))))\
.withColumn("country",F.when(df_dirty_data.country.isNull(), "Unknown").otherwise(F.trim(F.upper(df_dirty_data.country))))\
.withColumn("length",F.when(df_dirty_data.length.isNull(), 
                            df_dirty_data.agg({"length": "avg"}).select(round(col("avg(length)"))).collect()[0][0])\
                            .otherwise(df_dirty_data.length))\
.withColumn("score",F.when(df_dirty_data.score.isNull(), 
                           df_dirty_data.agg({"score": "avg"}).select(round(col("avg(score)"))).collect()[0][0])\
                            .otherwise(df_dirty_data.score))

df_clean.show(10)

+--------------------+---------+------+-----+-------+----+--------+
|                name|    genre|length|score|country|year|  budget|
+--------------------+---------+------+-----+-------+----+--------+
|         Stand By Me|ADVENTURE|  89.0|  8.1|    USA|1986| 8000000|
|Ferris Bueller's ...|   COMEDY| 103.0|  7.8|    USA|1986| 6000000|
|             Top Gun|  Unknown| 110.0|  6.9|    USA|1986|15000000|
|              Aliens|   ACTION| 137.0|  8.4|Unknown|1986|18500000|
|Flight Of The Nav...|ADVENTURE|  90.0|  6.0|    USA|1986| 9000000|
|             Platoon|    DRAMA| 120.0|  8.1|     UK|1986| 6000000|
|           Labyrinth|ADVENTURE| 101.0|  7.4|     UK|1986|25000000|
|         Blue Velvet|    DRAMA| 120.0|  6.0|    USA|1986| 6000000|
|      Pretty In Pink|   COMEDY| 103.0|  6.8|    USA|1986| 9000000|
|             The Fly|    DRAMA|  96.0|  7.5|    USA|1986|15000000|
+--------------------+---------+------+-----+-------+----+--------+
only showing top 10 rows



### Writing clean data to disc/hdd

In [64]:
df_clean.coalesce(1).write.mode("overwrite")\
.option("sep",",")\
.option("header", "True")\
.csv("data/film_clean_data")

### Reading clean data from disc/hdd

In [65]:
df_clean_read = spark.read\
.option("inferSchema", "True")\
.option("header", "True")\
.option("sep", ",")\
.csv("data/film_clean_data")

df_clean_read.show(10)

+--------------------+---------+------+-----+-------+----+--------+
|                name|    genre|length|score|country|year|  budget|
+--------------------+---------+------+-----+-------+----+--------+
|         Stand By Me|ADVENTURE|  89.0|  8.1|    USA|1986| 8000000|
|Ferris Bueller's ...|   COMEDY| 103.0|  7.8|    USA|1986| 6000000|
|             Top Gun|  Unknown| 110.0|  6.9|    USA|1986|15000000|
|              Aliens|   ACTION| 137.0|  8.4|Unknown|1986|18500000|
|Flight Of The Nav...|ADVENTURE|  90.0|  6.0|    USA|1986| 9000000|
|             Platoon|    DRAMA| 120.0|  8.1|     UK|1986| 6000000|
|           Labyrinth|ADVENTURE| 101.0|  7.4|     UK|1986|25000000|
|         Blue Velvet|    DRAMA| 120.0|  6.0|    USA|1986| 6000000|
|      Pretty In Pink|   COMEDY| 103.0|  6.8|    USA|1986| 9000000|
|             The Fly|    DRAMA|  96.0|  7.5|    USA|1986|15000000|
+--------------------+---------+------+-----+-------+----+--------+
only showing top 10 rows



## Creating manually Schema 

In [66]:
df_manual_schema = spark.read\
.option("header","True")\
.option("inferSchema", "True")\
.csv("data/film_data.csv")

df_manual_schema.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Genre: string (nullable = true)
 |-- Length: integer (nullable = true)
 |-- Score: double (nullable = true)
 |-- Country: string (nullable = true)
 |-- Year: string (nullable = true)
 |-- Budget: integer (nullable = true)



In [67]:
df_manual_schema.show(5)

+--------------------+---------+------+-----+-------+----+--------+
|                Name|    Genre|Length|Score|Country|Year|  Budget|
+--------------------+---------+------+-----+-------+----+--------+
|         stand by Me|Adventure|    89|  8.1|    USA|1986| 8000000|
|ferris Bueller's ...|   Comedy|   103|  7.8|    USA|1986| 6000000|
|             Top Gun|   Action|   110|  6.9|    USA|1986|15000000|
|              Aliens|   Action|   137|  8.4|    USA|1986|18500000|
|Flight of the Nav...|Adventure|    90|  6.9|    USA|1986| 9000000|
+--------------------+---------+------+-----+-------+----+--------+
only showing top 5 rows



In [68]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType

In [69]:
manual_schema = StructType(
[
    StructField("Name", StringType(), True),
    StructField("Genre", StringType(), True),
    StructField("Length", IntegerType(), True),
    StructField("Score", FloatType(), True),
    StructField("Country", StringType(), True),
    StructField("Year", IntegerType(), True),
    StructField("Budget", FloatType(), True)
]
)

In [70]:
df_manual_schema2 = spark.read\
.option("header", "True")\
.option("sep",",")\
.schema(manual_schema)\
.csv("data/film_data.csv")

df_manual_schema2.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Genre: string (nullable = true)
 |-- Length: integer (nullable = true)
 |-- Score: float (nullable = true)
 |-- Country: string (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Budget: float (nullable = true)



In [71]:
df_manual_schema2.show(5)

+--------------------+---------+------+-----+-------+----+---------+
|                Name|    Genre|Length|Score|Country|Year|   Budget|
+--------------------+---------+------+-----+-------+----+---------+
|         stand by Me|Adventure|    89|  8.1|    USA|1986|8000000.0|
|ferris Bueller's ...|   Comedy|   103|  7.8|    USA|1986|6000000.0|
|             Top Gun|   Action|   110|  6.9|    USA|1986|    1.5E7|
|              Aliens|   Action|   137|  8.4|    USA|1986|   1.85E7|
|Flight of the Nav...|Adventure|    90|  6.9|    USA|1986|9000000.0|
+--------------------+---------+------+-----+-------+----+---------+
only showing top 5 rows



#### Formatting Budget column values 

In [72]:
df_manual_schema2 = df_manual_schema2\
.withColumn("Budget", format_number(col("Budget"), 2))

#### Replacing point (.) instead of comma (,)

In [73]:
df_manual_schema2 = df_manual_schema2.withColumn("Budget", regexp_replace(col("Budget"), ",", "."))

In [74]:
df_manual_schema2.show(3)

+--------------------+---------+------+-----+-------+----+-------------+
|                Name|    Genre|Length|Score|Country|Year|       Budget|
+--------------------+---------+------+-----+-------+----+-------------+
|         stand by Me|Adventure|    89|  8.1|    USA|1986| 8.000.000.00|
|ferris Bueller's ...|   Comedy|   103|  7.8|    USA|1986| 6.000.000.00|
|             Top Gun|   Action|   110|  6.9|    USA|1986|15.000.000.00|
+--------------------+---------+------+-----+-------+----+-------------+
only showing top 3 rows



### Saving Spark DataFrame into partitoned folder

In [75]:
df_manual_schema2\
.coalesce(1)\
.write\
.mode("overwrite")\
.option("sep",",")\
.option("header", "True")\
.csv("data/film_clean_data2")

### Converting Spark DataFrame to Pandas DataFrame and save single csv

In [76]:
df_manual_schema2.toPandas().to_csv("data/film_clean_data.csv")

# --> PySpark DataFrame API Date-Time Operations

In [77]:
df_datetime = spark.read\
.option("header", "True")\
.option("sep",",")\
.option("inferSchema", "True")\
.csv("data/film_data.csv")

In [78]:
df_datetime = df_datetime.withColumn("Year", F.regexp_replace(F.col("Year"), "1986", "01.01.1986 00:01"))
df_datetime.show(3)

+--------------------+---------+------+-----+-------+----------------+--------+
|                Name|    Genre|Length|Score|Country|            Year|  Budget|
+--------------------+---------+------+-----+-------+----------------+--------+
|         stand by Me|Adventure|    89|  8.1|    USA|01.01.1986 00:01| 8000000|
|ferris Bueller's ...|   Comedy|   103|  7.8|    USA|01.01.1986 00:01| 6000000|
|             Top Gun|   Action|   110|  6.9|    USA|01.01.1986 00:01|15000000|
+--------------------+---------+------+-----+-------+----------------+--------+
only showing top 3 rows



In [79]:
from pyspark.sql import functions as F

In [80]:
current_format = "dd.MM.yyyy HH:mm"

In [81]:
df_datetime2 = df_datetime.select("Year")\
.withColumn("Normal_Format", F.to_date(F.col("Year"), current_format))\
.withColumn("Standart_Format", F.to_timestamp(F.col("Year"), current_format))

df_datetime2.show(10)

+----------------+-------------+-------------------+
|            Year|Normal_Format|    Standart_Format|
+----------------+-------------+-------------------+
|01.01.1986 00:01|   1986-01-01|1986-01-01 00:01:00|
|01.01.1986 00:01|   1986-01-01|1986-01-01 00:01:00|
|01.01.1986 00:01|   1986-01-01|1986-01-01 00:01:00|
|01.01.1986 00:01|   1986-01-01|1986-01-01 00:01:00|
|01.01.1986 00:01|   1986-01-01|1986-01-01 00:01:00|
|01.01.1986 00:01|   1986-01-01|1986-01-01 00:01:00|
|01.01.1986 00:01|   1986-01-01|1986-01-01 00:01:00|
|01.01.1986 00:01|   1986-01-01|1986-01-01 00:01:00|
|01.01.1986 00:01|   1986-01-01|1986-01-01 00:01:00|
|01.01.1986 00:01|   1986-01-01|1986-01-01 00:01:00|
+----------------+-------------+-------------------+
only showing top 10 rows



In [82]:
df_datetime2.printSchema()

root
 |-- Year: string (nullable = true)
 |-- Normal_Format: date (nullable = true)
 |-- Standart_Format: timestamp (nullable = true)



### Date Format Convert

In [83]:
format_tr = "dd/MM/yyyy HH:mm:ss"
format_eng = "MM-dd-yyyy HH.mm.ss"


df_datetime3 = df_datetime2\
.withColumn("TR Format", F.date_format(F.col("Standart_Format"), format_tr))\
.withColumn("ENG Format",F.date_format(F.col("Standart_Format"), format_eng))

df_datetime3.select("TR Format", "ENG Format").show(10)

+-------------------+-------------------+
|          TR Format|         ENG Format|
+-------------------+-------------------+
|01/01/1986 00:01:00|01-01-1986 00.01.00|
|01/01/1986 00:01:00|01-01-1986 00.01.00|
|01/01/1986 00:01:00|01-01-1986 00.01.00|
|01/01/1986 00:01:00|01-01-1986 00.01.00|
|01/01/1986 00:01:00|01-01-1986 00.01.00|
|01/01/1986 00:01:00|01-01-1986 00.01.00|
|01/01/1986 00:01:00|01-01-1986 00.01.00|
|01/01/1986 00:01:00|01-01-1986 00.01.00|
|01/01/1986 00:01:00|01-01-1986 00.01.00|
|01/01/1986 00:01:00|01-01-1986 00.01.00|
+-------------------+-------------------+
only showing top 10 rows



### date_add(), year(), datediff()

In [84]:
df_datetime4 = df_datetime2\
.withColumn("OneYear", F.date_add(F.col("Standart_Format"), 365))\
.withColumn("OnlyYear",F.year(F.col("Standart_Format")))\
.withColumn("DateDifference", F.datediff(F.col("OneYear"), df_datetime2.Standart_Format))

df_datetime4 = df_datetime4.drop("Normal_Format")
df_datetime4.show(10)

+----------------+-------------------+----------+--------+--------------+
|            Year|    Standart_Format|   OneYear|OnlyYear|DateDifference|
+----------------+-------------------+----------+--------+--------------+
|01.01.1986 00:01|1986-01-01 00:01:00|1987-01-01|    1986|           365|
|01.01.1986 00:01|1986-01-01 00:01:00|1987-01-01|    1986|           365|
|01.01.1986 00:01|1986-01-01 00:01:00|1987-01-01|    1986|           365|
|01.01.1986 00:01|1986-01-01 00:01:00|1987-01-01|    1986|           365|
|01.01.1986 00:01|1986-01-01 00:01:00|1987-01-01|    1986|           365|
|01.01.1986 00:01|1986-01-01 00:01:00|1987-01-01|    1986|           365|
|01.01.1986 00:01|1986-01-01 00:01:00|1987-01-01|    1986|           365|
|01.01.1986 00:01|1986-01-01 00:01:00|1987-01-01|    1986|           365|
|01.01.1986 00:01|1986-01-01 00:01:00|1987-01-01|    1986|           365|
|01.01.1986 00:01|1986-01-01 00:01:00|1987-01-01|    1986|           365|
+----------------+-------------------+