# **Manipulating Data in Dataframes**

In this lecture we will learn how to manipulate data in dataframes. You will need these techniques to accomplish some of the following tasks:

- Change data types when they are incorrectly interpretted;
- Clean your data;
- Create new columns;
- Rename columns;
- Extract or Create New Values.

We will also cover how to manipulate arrays in this lecture as well.

**So let's get started!**

In [22]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('ManipulateDataset').getOrCreate()
spark

In [3]:
names = spark.createDataFrame([('Abraham', 'Lincoln')], ['first_name', 'last_name'])
names.show()

+----------+---------+
|first_name|last_name|
+----------+---------+
|   Abraham|  Lincoln|
+----------+---------+



In [4]:
names.rdd.id()

10

In [5]:
from pyspark.sql.functions import *

names = names.select(names.first_name, names.last_name, concat_ws(' ', names.first_name, names.last_name))
names.show()

+----------+---------+-----------------------------------+
|first_name|last_name|concat_ws( , first_name, last_name)|
+----------+---------+-----------------------------------+
|   Abraham|  Lincoln|                    Abraham Lincoln|
+----------+---------+-----------------------------------+



In [6]:
names.rdd.id()

16

In [7]:
path = './data/'

videos = spark.read.csv(path+'youtubevideos.csv', inferSchema=True, header=True)

print(videos.printSchema())
videos.limit(5).toPandas()

[Stage 7:====>                                                    (1 + 11) / 12]

root
 |-- video_id: string (nullable = true)
 |-- trending_date: string (nullable = true)
 |-- title: string (nullable = true)
 |-- channel_title: string (nullable = true)
 |-- category_id: string (nullable = true)
 |-- publish_time: string (nullable = true)
 |-- tags: string (nullable = true)
 |-- views: string (nullable = true)
 |-- likes: string (nullable = true)
 |-- dislikes: string (nullable = true)
 |-- comment_count: string (nullable = true)
 |-- thumbnail_link: string (nullable = true)
 |-- comments_disabled: string (nullable = true)
 |-- ratings_disabled: string (nullable = true)
 |-- video_error_or_removed: string (nullable = true)
 |-- description: string (nullable = true)

None


                                                                                

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"""last week tonight trump presidency""|""last wee...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John..."
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"""racist superman""|""rudy""|""mancuso""|""king""|""bac...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3,puqaWrEC7tY,17.14.11,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,2017-11-13T11:00:04.000Z,"""rhett and link""|""gmm""|""good mythical morning""...",343168,10172,666,2146,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,False,False,False,Today we find out if Link is a Nickelback amat...
4,d380meD0W0M,17.14.11,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12T18:01:41.000Z,"""ryan""|""higa""|""higatv""|""nigahiga""|""i dare you""...",2095731,132235,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...


**Available types**

- DataType
- NullType
- StringType
- BinaryType
- BooleanType
- DateType
- TimestampType
- DecimalType
- DoubleType
- FloatType
- ByteType
- IntegerType
- LongType
- ShortType
- ArrayType
- MapType
- StructField
- StructType

In [8]:
from pyspark.sql.types import *

In [9]:
from pyspark.sql.functions import *

def formatTrendingDate(date):
    new_date = str(date)
    sep = '-'
    new_date_concat = '20' + new_date[0:2] + sep + new_date[6:] + sep + new_date[3:5]
    return new_date_concat

def parseDateInRFC3339(date):
    new_date = str(date)
    new_date = new_date.replace('T', ' ')
    return new_date[0:-5]

parseRFC3339 = udf(parseDateInRFC3339)
parseTrendingDate = udf(formatTrendingDate)

In [10]:
df = videos \
    .withColumn('views', videos['views'].cast(IntegerType())) \
    .withColumn('likes', videos['likes'].cast(IntegerType())) \
    .withColumn('dislikes', videos['dislikes'].cast(IntegerType())) \
    .withColumn('trending_date', to_date(parseTrendingDate(col('trending_date')), 'yyyy-MM-dd')) \
    .withColumn('publish_time', to_timestamp(parseRFC3339(col('publish_time')), 'yyyy-MM-dd HH:mm:ss'))

In [11]:
df.printSchema()

root
 |-- video_id: string (nullable = true)
 |-- trending_date: date (nullable = true)
 |-- title: string (nullable = true)
 |-- channel_title: string (nullable = true)
 |-- category_id: string (nullable = true)
 |-- publish_time: timestamp (nullable = true)
 |-- tags: string (nullable = true)
 |-- views: integer (nullable = true)
 |-- likes: integer (nullable = true)
 |-- dislikes: integer (nullable = true)
 |-- comment_count: string (nullable = true)
 |-- thumbnail_link: string (nullable = true)
 |-- comments_disabled: string (nullable = true)
 |-- ratings_disabled: string (nullable = true)
 |-- video_error_or_removed: string (nullable = true)
 |-- description: string (nullable = true)



In [12]:
df.limit(4).toPandas()

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,2017-11-14,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13 17:13:01,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,1ZAPwfrtAFY,2017-11-14,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13 07:30:00,"""last week tonight trump presidency""|""last wee...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John..."
2,5qpjK5DgCt4,2017-11-14,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12 19:05:24,"""racist superman""|""rudy""|""mancuso""|""king""|""bac...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3,puqaWrEC7tY,2017-11-14,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,2017-11-13 11:00:04,"""rhett and link""|""gmm""|""good mythical morning""...",343168,10172,666,2146,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,False,False,False,Today we find out if Link is a Nickelback amat...


In [13]:
# using regex function
# df = df.withColumn('publish_time', regexp_replace(df['regexp_replace'], 'T', ' '))

In [14]:
# Translate (replace)
videos.select('publish_time', translate(col('publish_time'), 'TZ', ' ').alias('translated_time')).show(5, False)

+------------------------+-----------------------+
|publish_time            |translated_time        |
+------------------------+-----------------------+
|2017-11-13T17:13:01.000Z|2017-11-13 17:13:01.000|
|2017-11-13T07:30:00.000Z|2017-11-13 07:30:00.000|
|2017-11-12T19:05:24.000Z|2017-11-12 19:05:24.000|
|2017-11-13T11:00:04.000Z|2017-11-13 11:00:04.000|
|2017-11-12T18:01:41.000Z|2017-11-12 18:01:41.000|
+------------------------+-----------------------+
only showing top 5 rows



In [15]:
# Trim

videos.select(trim(videos['title'])).show(5, False) # rtrim/ltrim

+--------------------------------------------------------------+
|trim(title)                                                   |
+--------------------------------------------------------------+
|WE WANT TO TALK ABOUT OUR MARRIAGE                            |
|The Trump Presidency: Last Week Tonight with John Oliver (HBO)|
|Racist Superman | Rudy Mancuso, King Bach & Lele Pons         |
|Nickelback Lyrics: Real or Fake?                              |
|I Dare You: GOING BALD!?                                      |
+--------------------------------------------------------------+
only showing top 5 rows



In [16]:
# Lower

videos.select('title', lower(df['title'])).show(5, False)

+--------------------------------------------------------------+--------------------------------------------------------------+
|title                                                         |lower(title)                                                  |
+--------------------------------------------------------------+--------------------------------------------------------------+
|WE WANT TO TALK ABOUT OUR MARRIAGE                            |we want to talk about our marriage                            |
|The Trump Presidency: Last Week Tonight with John Oliver (HBO)|the trump presidency: last week tonight with john oliver (hbo)|
|Racist Superman | Rudy Mancuso, King Bach & Lele Pons         |racist superman | rudy mancuso, king bach & lele pons         |
|Nickelback Lyrics: Real or Fake?                              |nickelback lyrics: real or fake?                              |
|I Dare You: GOING BALD!?                                      |i dare you: going bald!?                

In [17]:
# Case When

# Option 1 when-otherwise
df.select('title', 'likes', 'dislikes', when(col('likes') > col('dislikes'), 'Good').otherwise('Bad').alias('Rating')).show(5, False)

+--------------------------------------------------------------+------+--------+------+
|title                                                         |likes |dislikes|Rating|
+--------------------------------------------------------------+------+--------+------+
|WE WANT TO TALK ABOUT OUR MARRIAGE                            |57527 |2966    |Good  |
|The Trump Presidency: Last Week Tonight with John Oliver (HBO)|97185 |6146    |Good  |
|Racist Superman | Rudy Mancuso, King Bach & Lele Pons         |146033|5339    |Good  |
|Nickelback Lyrics: Real or Fake?                              |10172 |666     |Good  |
|I Dare You: GOING BALD!?                                      |132235|1989    |Good  |
+--------------------------------------------------------------+------+--------+------+
only showing top 5 rows



In [18]:
# Option 2 When
df.select('title', 'likes', 'dislikes', expr(
    '''
        CASE 
            WHEN likes > dislikes THEN \'Good\'
            WHEN likes < dislikes THEN \'Bad\'
            ELSE 'Undetermined'
        END AS Rating
    ''')).show(5, False)

+--------------------------------------------------------------+------+--------+------+
|title                                                         |likes |dislikes|Rating|
+--------------------------------------------------------------+------+--------+------+
|WE WANT TO TALK ABOUT OUR MARRIAGE                            |57527 |2966    |Good  |
|The Trump Presidency: Last Week Tonight with John Oliver (HBO)|97185 |6146    |Good  |
|Racist Superman | Rudy Mancuso, King Bach & Lele Pons         |146033|5339    |Good  |
|Nickelback Lyrics: Real or Fake?                              |10172 |666     |Good  |
|I Dare You: GOING BALD!?                                      |132235|1989    |Good  |
+--------------------------------------------------------------+------+--------+------+
only showing top 5 rows



In [19]:
# Concatenate

df.select(concat_ws(' ', df.title, df.channel_title).alias('text')).show(5, False)

+------------------------------------------------------------------------------+
|text                                                                          |
+------------------------------------------------------------------------------+
|WE WANT TO TALK ABOUT OUR MARRIAGE CaseyNeistat                               |
|The Trump Presidency: Last Week Tonight with John Oliver (HBO) LastWeekTonight|
|Racist Superman | Rudy Mancuso, King Bach & Lele Pons Rudy Mancuso            |
|Nickelback Lyrics: Real or Fake? Good Mythical Morning                        |
|I Dare You: GOING BALD!? nigahiga                                             |
+------------------------------------------------------------------------------+
only showing top 5 rows



In [20]:
df.select('trending_date', year('trending_date'), month('trending_date')).show(5, False)

+-------------+-------------------+--------------------+
|trending_date|year(trending_date)|month(trending_date)|
+-------------+-------------------+--------------------+
|2017-11-14   |2017               |11                  |
|2017-11-14   |2017               |11                  |
|2017-11-14   |2017               |11                  |
|2017-11-14   |2017               |11                  |
|2017-11-14   |2017               |11                  |
+-------------+-------------------+--------------------+
only showing top 5 rows



In [23]:
df.select('trending_date', 'publish_time', datediff(df['trending_date'], df['publish_time'])).show(5)

+-------------+-------------------+-------------------------------------+
|trending_date|       publish_time|datediff(trending_date, publish_time)|
+-------------+-------------------+-------------------------------------+
|   2017-11-14|2017-11-13 17:13:01|                                    1|
|   2017-11-14|2017-11-13 07:30:00|                                    1|
|   2017-11-14|2017-11-12 19:05:24|                                    2|
|   2017-11-14|2017-11-13 11:00:04|                                    1|
|   2017-11-14|2017-11-12 18:01:41|                                    2|
+-------------+-------------------+-------------------------------------+
only showing top 5 rows



In [26]:
array = df.select('title', split(df.title, ' ').alias('new'))
array.limit(5).toPandas()

Unnamed: 0,title,new
0,WE WANT TO TALK ABOUT OUR MARRIAGE,"[WE, WANT, TO, TALK, ABOUT, OUR, MARRIAGE]"
1,The Trump Presidency: Last Week Tonight with J...,"[The, Trump, Presidency:, Last, Week, Tonight,..."
2,"Racist Superman | Rudy Mancuso, King Bach & Le...","[Racist, Superman, |, Rudy, Mancuso,, King, Ba..."
3,Nickelback Lyrics: Real or Fake?,"[Nickelback, Lyrics:, Real, or, Fake?]"
4,I Dare You: GOING BALD!?,"[I, Dare, You:, GOING, BALD!?]"


In [29]:
array.select('title', array_contains(array['new'], "MARRIAGE")).limit(5).toPandas()

Unnamed: 0,title,"array_contains(new, MARRIAGE)"
0,WE WANT TO TALK ABOUT OUR MARRIAGE,True
1,The Trump Presidency: Last Week Tonight with J...,False
2,"Racist Superman | Rudy Mancuso, King Bach & Le...",False
3,Nickelback Lyrics: Real or Fake?,False
4,I Dare You: GOING BALD!?,False


In [31]:
array.select('title', array_distinct(array['new'])).limit(10).toPandas()

Unnamed: 0,title,array_distinct(new)
0,WE WANT TO TALK ABOUT OUR MARRIAGE,"[WE, WANT, TO, TALK, ABOUT, OUR, MARRIAGE]"
1,The Trump Presidency: Last Week Tonight with J...,"[The, Trump, Presidency:, Last, Week, Tonight,..."
2,"Racist Superman | Rudy Mancuso, King Bach & Le...","[Racist, Superman, |, Rudy, Mancuso,, King, Ba..."
3,Nickelback Lyrics: Real or Fake?,"[Nickelback, Lyrics:, Real, or, Fake?]"
4,I Dare You: GOING BALD!?,"[I, Dare, You:, GOING, BALD!?]"
5,2 Weeks with iPhone X,"[2, Weeks, with, iPhone, X]"
6,Roy Moore & Jeff Sessions Cold Open - SNL,"[Roy, Moore, &, Jeff, Sessions, Cold, Open, -,..."
7,5 Ice Cream Gadgets put to the Test,"[5, Ice, Cream, Gadgets, put, to, the, Test]"
8,The Greatest Showman | Official Trailer 2 [HD]...,"[The, Greatest, Showman, |, Official, Trailer,..."
9,Why the rise of the robots won’t mean the end ...,"[Why, the, rise, of, robots, won’t, mean, end,..."


In [32]:
array.select('title', array_remove(array['new'], 'WE')).limit(5).toPandas()

Unnamed: 0,title,"array_remove(new, WE)"
0,WE WANT TO TALK ABOUT OUR MARRIAGE,"[WANT, TO, TALK, ABOUT, OUR, MARRIAGE]"
1,The Trump Presidency: Last Week Tonight with J...,"[The, Trump, Presidency:, Last, Week, Tonight,..."
2,"Racist Superman | Rudy Mancuso, King Bach & Le...","[Racist, Superman, |, Rudy, Mancuso,, King, Ba..."
3,Nickelback Lyrics: Real or Fake?,"[Nickelback, Lyrics:, Real, or, Fake?]"
4,I Dare You: GOING BALD!?,"[I, Dare, You:, GOING, BALD!?]"


In [33]:
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

In [34]:
def square(x):
    return int(x**2)

In [37]:
square_udf = udf(lambda z: square(z), IntegerType())

In [39]:
df.select('dislikes', square_udf('dislikes')).where(col('dislikes').isNotNull()) \
    .limit(5).toPandas()

Unnamed: 0,dislikes,<lambda>(dislikes)
0,2966,8797156
1,6146,37773316
2,5339,28504921
3,666,443556
4,1989,3956121
