## Manipulating data in dataframes to get what we want

### Import Pyspark and create SparkSession.

This is the first thing to do when working with pyspark. The spark variable will also provide access to a UI to monitor jobs.


In [1]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ManipulateData").getOrCreate()

### Creating a datraframe

In [2]:
names = spark.createDataFrame([("Abraham","Lincoln")],['first_name','last_name'])

In [3]:
names.show()

+----------+---------+
|first_name|last_name|
+----------+---------+
|   Abraham|  Lincoln|
+----------+---------+



Each dataframe has an id property that uniquely determines that dataframe in the current sparksession

In [4]:
names.rdd.id()

10

### Importing functionality from sql.functions to operate on columns

Here we are concating the two columns first_name and last_name to have a new column called full_name

In [5]:
from pyspark.sql.functions import *
names = names.select(names.first_name,names.last_name,concat_ws(' ', names.first_name, names.last_name).alias('full_name'))

#### Property of Spark that it assigns a new id to the dataframe object as it is manipulated 

In [7]:
names.rdd.id()

14

In [8]:
names.show()

+----------+---------+---------------+
|first_name|last_name|      full_name|
+----------+---------+---------------+
|   Abraham|  Lincoln|Abraham Lincoln|
+----------+---------+---------------+



In [9]:
path = 'Datasets/'

#### Reading a CSV into DataFrame

In [10]:
videos = spark.read.csv(path+"youtubevideos.csv",header=True,inferSchema=True)

In [11]:
videos.limit(4).toPandas()

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"""last week tonight trump presidency""|""last wee...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John..."
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"""racist superman""|""rudy""|""mancuso""|""king""|""bac...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3,puqaWrEC7tY,17.14.11,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,2017-11-13T11:00:04.000Z,"""rhett and link""|""gmm""|""good mythical morning""...",343168,10172,666,2146,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,False,False,False,Today we find out if Link is a Nickelback amat...


In [12]:
videos.printSchema()

root
 |-- video_id: string (nullable = true)
 |-- trending_date: string (nullable = true)
 |-- title: string (nullable = true)
 |-- channel_title: string (nullable = true)
 |-- category_id: string (nullable = true)
 |-- publish_time: string (nullable = true)
 |-- tags: string (nullable = true)
 |-- views: string (nullable = true)
 |-- likes: string (nullable = true)
 |-- dislikes: string (nullable = true)
 |-- comment_count: string (nullable = true)
 |-- thumbnail_link: string (nullable = true)
 |-- comments_disabled: string (nullable = true)
 |-- ratings_disabled: string (nullable = true)
 |-- video_error_or_removed: string (nullable = true)
 |-- description: string (nullable = true)



### Importing datatypes from sql.types as we will change the datatype of some columns

In [13]:
from pyspark.sql.types import *

In [19]:
df = videos.withColumn("views",videos['views'].cast(IntegerType())) \
        .withColumn("likes", videos["likes"].cast(IntegerType())) \
        .withColumn("dislikes", videos["dislikes"].cast(IntegerType())) \
        .withColumn("trending_date", to_date(videos.trending_date, 'dd.mm.yy')) \
#         .withColumn("publish_time", to_timestamp(videos.publish_time, 'yyyy-MM-dd HH:mm:ss:ms'))
'''The reason this is commented is because to_timestamp will not be able to infer the correct
timestamp because the trending_date columns has invalid characters in its values like T and Z.
We need to remove them first'''
print(df.printSchema())
df.limit(4).toPandas()

root
 |-- video_id: string (nullable = true)
 |-- trending_date: date (nullable = true)
 |-- title: string (nullable = true)
 |-- channel_title: string (nullable = true)
 |-- category_id: string (nullable = true)
 |-- publish_time: string (nullable = true)
 |-- tags: string (nullable = true)
 |-- views: integer (nullable = true)
 |-- likes: integer (nullable = true)
 |-- dislikes: integer (nullable = true)
 |-- comment_count: string (nullable = true)
 |-- thumbnail_link: string (nullable = true)
 |-- comments_disabled: string (nullable = true)
 |-- ratings_disabled: string (nullable = true)
 |-- video_error_or_removed: string (nullable = true)
 |-- description: string (nullable = true)

None


Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,2011-01-17,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,1ZAPwfrtAFY,2011-01-17,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"""last week tonight trump presidency""|""last wee...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John..."
2,5qpjK5DgCt4,2011-01-17,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"""racist superman""|""rudy""|""mancuso""|""king""|""bac...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3,puqaWrEC7tY,2011-01-17,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,2017-11-13T11:00:04.000Z,"""rhett and link""|""gmm""|""good mythical morning""...",343168,10172,666,2146,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,False,False,False,Today we find out if Link is a Nickelback amat...


In [20]:
from pyspark.sql.functions import *

Replacing T with space character using regexp_replace from sql.functions

In [23]:
df = df.withColumn("publish_time_2",regexp_replace(df.publish_time,"T"," "))

Replacing z with empty character using regexp_replace from sql.functions

In [24]:
df = df.withColumn("publish_time_2", regexp_replace(df.publish_time_2,"Z",""))

In [25]:
df.select('publish_time_2').show(5,False)

+-----------------------+
|publish_time_2         |
+-----------------------+
|2017-11-13 17:13:01.000|
|2017-11-13 07:30:00.000|
|2017-11-12 19:05:24.000|
|2017-11-13 11:00:04.000|
|2017-11-12 18:01:41.000|
+-----------------------+
only showing top 5 rows



Now we can cast the publish_time_3 column to_timestamp

In [26]:
df = df.withColumn("publish_time_3",to_timestamp(df.publish_time_2,'yyyy-MM-dd HH:mm:ss.SSS'))

In [29]:
df.printSchema()

root
 |-- video_id: string (nullable = true)
 |-- trending_date: date (nullable = true)
 |-- title: string (nullable = true)
 |-- channel_title: string (nullable = true)
 |-- category_id: string (nullable = true)
 |-- publish_time: string (nullable = true)
 |-- tags: string (nullable = true)
 |-- views: integer (nullable = true)
 |-- likes: integer (nullable = true)
 |-- dislikes: integer (nullable = true)
 |-- comment_count: string (nullable = true)
 |-- thumbnail_link: string (nullable = true)
 |-- comments_disabled: string (nullable = true)
 |-- ratings_disabled: string (nullable = true)
 |-- video_error_or_removed: string (nullable = true)
 |-- description: string (nullable = true)
 |-- publish_time_2: string (nullable = true)
 |-- publish_time_3: timestamp (nullable = true)



### Renaming a column

In [29]:
renamed = df.withColumnRenamed("new","publish_time_3")

### We can achieve same result as above for timestamp using translate. It allows for single literal replacement as show. Here T is replaced with space and Z with empty string. That's why we see only one space character

In [28]:
df.select("publish_time",translate(col("publish_time"),"TZ"," ").alias("trans")).show(5,False)

+------------------------+-----------------------+
|publish_time            |trans                  |
+------------------------+-----------------------+
|2017-11-13T17:13:01.000Z|2017-11-13 17:13:01.000|
|2017-11-13T07:30:00.000Z|2017-11-13 07:30:00.000|
|2017-11-12T19:05:24.000Z|2017-11-12 19:05:24.000|
|2017-11-13T11:00:04.000Z|2017-11-13 11:00:04.000|
|2017-11-12T18:01:41.000Z|2017-11-12 18:01:41.000|
+------------------------+-----------------------+
only showing top 5 rows



#### Removing white spaces

In [30]:
df = df.withColumn("title",trim(df.title))#rtrim,ltrim

In [31]:
df.select("title").show()

+--------------------+
|               title|
+--------------------+
|WE WANT TO TALK A...|
|The Trump Preside...|
|Racist Superman |...|
|Nickelback Lyrics...|
|I Dare You: GOING...|
|2 Weeks with iPho...|
|Roy Moore & Jeff ...|
|5 Ice Cream Gadge...|
|The Greatest Show...|
|Why the rise of t...|
|Dion Lewis' 103-Y...|
|(SPOILERS) 'Shiva...|
|Marshmello - Bloc...|
|Which Countries A...|
|SHOPPING FOR NEW ...|
|    The New SpotMini|
|One Change That W...|
|How does your bod...|
|HomeMade Electric...|
|Founding An Inbre...|
+--------------------+
only showing top 20 rows



#### Making characters lowercase

In [34]:
df.select("title",lower(df.title)).show(5,False)

+--------------------------------------------------------------+--------------------------------------------------------------+
|title                                                         |lower(title)                                                  |
+--------------------------------------------------------------+--------------------------------------------------------------+
|WE WANT TO TALK ABOUT OUR MARRIAGE                            |we want to talk about our marriage                            |
|The Trump Presidency: Last Week Tonight with John Oliver (HBO)|the trump presidency: last week tonight with john oliver (hbo)|
|Racist Superman | Rudy Mancuso, King Bach & Lele Pons         |racist superman | rudy mancuso, king bach & lele pons         |
|Nickelback Lyrics: Real or Fake?                              |nickelback lyrics: real or fake?                              |
|I Dare You: GOING BALD!?                                      |i dare you: going bald!?                

### Perfoming if-else like operations on the columns as shown in the example below
Here when is a function from sql.functions is used to know if likes>dislikes then say good. When likes<dislikes then say bad. Otherwise say undetermined

In [32]:
df.select("likes","dislikes",(when(df.likes>df.dislikes,'good').when(df.likes<df.dislikes,'bad').otherwise("undetermined")).alias("favorability")).limit(5).toPandas()

Unnamed: 0,likes,dislikes,favorability
0,57527,2966,good
1,97185,6146,good
2,146033,5339,good
3,10172,666,good
4,132235,1989,good


Do the same with an SQL-like statement by using expr

In [33]:
df.select('likes','dislikes',expr("CASE WHEN likes>dislikes THEN 'good' WHEN dislikes>likes THEN 'bad' ELSE 'undetermined' END AS favorability")).show(3)

+------+--------+------------+
| likes|dislikes|favorability|
+------+--------+------------+
| 57527|    2966|        good|
| 97185|    6146|        good|
|146033|    5339|        good|
+------+--------+------------+
only showing top 3 rows



Making it more SQL-like with selectExpr

In [34]:
df.selectExpr("likes","dislikes","CASE WHEN likes>dislikes THEN 'good' WHEN dislikes>likes THEN 'bad' ELSE 'undetermined' END AS favorability").show(3)

+------+--------+------------+
| likes|dislikes|favorability|
+------+--------+------------+
| 57527|    2966|        good|
| 97185|    6146|        good|
|146033|    5339|        good|
+------+--------+------------+
only showing top 3 rows



Some more concatentation

In [36]:
df.select(concat_ws(' ',df.title,df.channel_title).alias("text")).show(5,False)

+------------------------------------------------------------------------------+
|text                                                                          |
+------------------------------------------------------------------------------+
|WE WANT TO TALK ABOUT OUR MARRIAGE CaseyNeistat                               |
|The Trump Presidency: Last Week Tonight with John Oliver (HBO) LastWeekTonight|
|Racist Superman | Rudy Mancuso, King Bach & Lele Pons Rudy Mancuso            |
|Nickelback Lyrics: Real or Fake? Good Mythical Morning                        |
|I Dare You: GOING BALD!? nigahiga                                             |
+------------------------------------------------------------------------------+
only showing top 5 rows



#### Let's extract year and month from the trending_date column

In [37]:
df.select("trending_date",year("trending_date"),month("trending_date")).show(5)

+-------------+-------------------+--------------------+
|trending_date|year(trending_date)|month(trending_date)|
+-------------+-------------------+--------------------+
|   2011-01-17|               2011|                   1|
|   2011-01-17|               2011|                   1|
|   2011-01-17|               2011|                   1|
|   2011-01-17|               2011|                   1|
|   2011-01-17|               2011|                   1|
+-------------+-------------------+--------------------+
only showing top 5 rows



#### Get differenc between two dates. You might realize we can't just subtract.
If you want the years, divide by 365

In [38]:
df.select("trending_date","publish_time_3",datediff(df.publish_time_3,df.trending_date).alias('difference')).show(5)

+-------------+-------------------+----------+
|trending_date|     publish_time_3|difference|
+-------------+-------------------+----------+
|   2011-01-17|2017-11-13 17:13:01|      2492|
|   2011-01-17|2017-11-13 07:30:00|      2492|
|   2011-01-17|2017-11-12 19:05:24|      2491|
|   2011-01-17|2017-11-13 11:00:04|      2492|
|   2011-01-17|2017-11-12 18:01:41|      2491|
+-------------+-------------------+----------+
only showing top 5 rows



### Split a column and convert into an array

In [39]:
array = df.select("title",split(df.title," ").alias("new"))

In [40]:
array.show(5,False)

+--------------------------------------------------------------+-------------------------------------------------------------------------+
|title                                                         |new                                                                      |
+--------------------------------------------------------------+-------------------------------------------------------------------------+
|WE WANT TO TALK ABOUT OUR MARRIAGE                            |[WE, WANT, TO, TALK, ABOUT, OUR, MARRIAGE]                               |
|The Trump Presidency: Last Week Tonight with John Oliver (HBO)|[The, Trump, Presidency:, Last, Week, Tonight, with, John, Oliver, (HBO)]|
|Racist Superman | Rudy Mancuso, King Bach & Lele Pons         |[Racist, Superman, |, Rudy, Mancuso,, King, Bach, &, Lele, Pons]         |
|Nickelback Lyrics: Real or Fake?                              |[Nickelback, Lyrics:, Real, or, Fake?]                                   |
|I Dare You: GOING BALD!?  

#### Get titles which has the word MARRIAGE in it.

There are multiple ways to do this.

In [41]:
array.select("title",array_contains(array.new,"MARRIAGE")).show(1,False)

+----------------------------------+-----------------------------+
|title                             |array_contains(new, MARRIAGE)|
+----------------------------------+-----------------------------+
|WE WANT TO TALK ABOUT OUR MARRIAGE|true                         |
+----------------------------------+-----------------------------+
only showing top 1 row



#### Get distinct entries into a column of arrays

In [43]:
array.select("title",array_distinct(array.new)).show(5,False)

+--------------------------------------------------------------+-------------------------------------------------------------------------+
|title                                                         |array_distinct(new)                                                      |
+--------------------------------------------------------------+-------------------------------------------------------------------------+
|WE WANT TO TALK ABOUT OUR MARRIAGE                            |[WE, WANT, TO, TALK, ABOUT, OUR, MARRIAGE]                               |
|The Trump Presidency: Last Week Tonight with John Oliver (HBO)|[The, Trump, Presidency:, Last, Week, Tonight, with, John, Oliver, (HBO)]|
|Racist Superman | Rudy Mancuso, King Bach & Lele Pons         |[Racist, Superman, |, Rudy, Mancuso,, King, Bach, &, Lele, Pons]         |
|Nickelback Lyrics: Real or Fake?                              |[Nickelback, Lyrics:, Real, or, Fake?]                                   |
|I Dare You: GOING BALD!?  

#### Remove elements from array in column

In [44]:
array.select("title",array_remove(array.new,"WE")).show(5,False)

+--------------------------------------------------------------+-------------------------------------------------------------------------+
|title                                                         |array_remove(new, WE)                                                    |
+--------------------------------------------------------------+-------------------------------------------------------------------------+
|WE WANT TO TALK ABOUT OUR MARRIAGE                            |[WANT, TO, TALK, ABOUT, OUR, MARRIAGE]                                   |
|The Trump Presidency: Last Week Tonight with John Oliver (HBO)|[The, Trump, Presidency:, Last, Week, Tonight, with, John, Oliver, (HBO)]|
|Racist Superman | Rudy Mancuso, King Bach & Lele Pons         |[Racist, Superman, |, Rudy, Mancuso,, King, Bach, &, Lele, Pons]         |
|Nickelback Lyrics: Real or Fake?                              |[Nickelback, Lyrics:, Real, or, Fake?]                                   |
|I Dare You: GOING BALD!?  

### So we want to apply a function that is user-defined to a column
We do this by registering our functions as a user defined function like below

In [45]:
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

In [46]:
def square(x):
    return int(x**2)

In [47]:
square_udf = udf(lambda z:square(z),IntegerType())

In [48]:
df.select("dislikes",square_udf('dislikes')).where(col("dislikes").isNotNull()).show(5)

+--------+------------------+
|dislikes|<lambda>(dislikes)|
+--------+------------------+
|    2966|           8797156|
|    6146|          37773316|
|    5339|          28504921|
|     666|            443556|
|    1989|           3956121|
+--------+------------------+
only showing top 5 rows

