<p align="center">
<img src="https://github.com/datacamp/python-live-training-template/blob/master/assets/datacamp.svg?raw=True" alt = "DataCamp icon" width="50%">
</p>
<br><br>

## **Cleaning Data with Pyspark**

Welcome to this hands-on training where we will investigate cleaning a dataset using Python and Apache Spark! During this training, we will cover:

* Efficiently loading data into a Spark DataFrame
* Handling errant rows / columns from the dataset, including comments, missing data, combined or misinterpreted columns, etc.
* Using Python UDFs to run advanced transformations on data


## **The Dataset**

The dataset used in this webinar is a set of CSV files named `netflix_titles_raw*.csv`. These contain information related to the movies and television shows available on Netflix. These are the *dirty* versions of the dataset - we will cover the individual problems as we work through the notebook.

Given that this is a data cleaning webinar, let's look at our intended result.  The dataset will contain the follwing information:

- `show_id`: A unique identifier for the show
- `type`: The type of content, `Movie` or `TV Show`
- `title`: The title of the content
- `director`: The director (or directors)
- `cast`: The cast
- `country`: Country (or countries) where the content is available
- `date_added`: Date added to Netflix
- `release_year`: Year of content release
- `rating`: Content rating
- `duration`: The duration
- `listed_in`: The genres the content is listed in
- `description`: A description of the content



## **Setting up a PySpark session**

Before we can start processing our data, we need to configure a Pyspark session for Google Colab. Note that this is specific for using Spark and Python in Colab and likely is not required for other environments. 

In [0]:
# Run this code as is to install Spark in Colab
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz
!pip install -q findspark

In [0]:
# Run this code to setup the environment
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"

In [0]:
# Finally, setup our Spark session
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

## **Getting started**

Before doing anything else, lets copy our data files locally. Run the follwing cell to pull the *dirty* files locally.

In [6]:
# Copy our dataset locally

!wget -O /tmp/netflix_titles_dirty_01.csv.gz 'https://github.com/datacamp/data-cleaning-with-pyspark-live-training/blob/master/data/netflix_titles_dirty_01.csv.gz?raw=True'
!wget -O /tmp/netflix_titles_dirty_02.csv.gz 'https://github.com/datacamp/data-cleaning-with-pyspark-live-training/blob/master/data/netflix_titles_dirty_02.csv.gz?raw=True'
!wget -O /tmp/netflix_titles_dirty_03.csv.gz 'https://github.com/datacamp/data-cleaning-with-pyspark-live-training/blob/master/data/netflix_titles_dirty_03.csv.gz?raw=True'
!wget -O /tmp/netflix_titles_dirty_04.csv.gz 'https://github.com/datacamp/data-cleaning-with-pyspark-live-training/blob/master/data/netflix_titles_dirty_04.csv.gz?raw=True'
!wget -O /tmp/netflix_titles_dirty_05.csv.gz 'https://github.com/datacamp/data-cleaning-with-pyspark-live-training/blob/master/data/netflix_titles_dirty_05.csv.gz?raw=True'
!wget -O /tmp/netflix_titles_dirty_06.csv.gz 'https://github.com/datacamp/data-cleaning-with-pyspark-live-training/blob/master/data/netflix_titles_dirty_06.csv.gz?raw=True'
!wget -O /tmp/netflix_titles_dirty_07.csv.gz 'https://github.com/datacamp/data-cleaning-with-pyspark-live-training/blob/master/data/netflix_titles_dirty_07.csv.gz?raw=True'



--2020-06-12 03:21:54--  https://github.com/datacamp/data-cleaning-with-pyspark-live-training/blob/master/data/netflix_titles_dirty_01.csv.gz?raw=True
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/datacamp/data-cleaning-with-pyspark-live-training/raw/master/data/netflix_titles_dirty_01.csv.gz [following]
--2020-06-12 03:21:55--  https://github.com/datacamp/data-cleaning-with-pyspark-live-training/raw/master/data/netflix_titles_dirty_01.csv.gz
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/datacamp/data-cleaning-with-pyspark-live-training/master/data/netflix_titles_dirty_01.csv.gz [following]
--2020-06-12 03:21:55--  https://raw.githubusercontent.com/datacamp/data-cleaning-with-pyspark-live-training/master/data/netflix_titles_dirty_01.csv.gz
Re

## Now, let's verify that we have all 7 files we expect

In [7]:
!ls /tmp/netflix_titles*

/tmp/netflix_titles_dirty_01.csv.gz  /tmp/netflix_titles_dirty_05.csv.gz
/tmp/netflix_titles_dirty_02.csv.gz  /tmp/netflix_titles_dirty_06.csv.gz
/tmp/netflix_titles_dirty_03.csv.gz  /tmp/netflix_titles_dirty_07.csv.gz
/tmp/netflix_titles_dirty_04.csv.gz


## And then, we'll take a look at the first 20 rows of one of the files

In [8]:
!gunzip -c /tmp/netflix_titles_dirty_03.csv.gz | head -20

80142103	Movie	Bottom of the World	Richard Sears	Jena Malone, Douglas Smith, Ted Levine, Tamara Duarte, Kelly Pendygraft, Mark Sivertsen, Jon McLaren	Canada, United States	March 31, 2017	2017	TV-MA	84 min	Dramas, Independent Movies, Thrillers	En route to a fresh start in Los Angeles, young couple Alex and Scarlett stop over in a sleepy Southwestern town that loosens their grip on reality.
80179907	Movie	Bridget Christie: Stand Up for Her		Bridget Christie	United Kingdom	March 31, 2017	2016	TV-MA	51 min	Stand-Up Comedy	Performing stand-up for a packed house in London's Hoxton Hall, comedian Bridget Christie dives into the politics of gender, sex and equality.
80152842	Movie	FirstBorn	Nirpal Bhogal	Antonia Thomas, Luke Norris, Thea Petrie, Eileen Davies, Jonathan Hyde	United Kingdom	March 31, 2017	2016	TV-MA	90 min	Horror Movies, International Movies	A young couple fights supernatural foes in an attempt to save their daughter from the dark and mysterious forces that follow her every move

# Loading our initial DataFrame

Let's take a look at what Spark does with our data and see if it can properly parse the output. To do this, we'll first load the content into a DataFrame using the `spark.read.csv()` method. We'll pass in three arguments - the path to the file(s) and an entry for `header=False`. Our files do not have a header row, so we must specify this or risk a data row being interpreted as a header. The last argument we add is the `sep` option, which specifies the field separator. Often in CSV files this is a comma `,`, but in our files it's a `\t` or tab character.

In [0]:
titles_df = spark.read.csv('/tmp/netflix_titles_dirty*.csv.gz', header=False, sep='\t')

## Initial analysis

Let's look at the first 100 rows using the `.show()` method on the DataFrame, and we'll pass in the number of rows to display and send set the `truncate` option to False so we can see all the DataFrame content.

In [17]:
titles_df.show(150, truncate=False)

+------------------------------------------------------------+-------+-------------------------------------------------+-----------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+---------------+----+--------+--------+-------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|_c0                                                         |_c1    |_c2

If we start by looking at the first column of data, we see that most of the entries are numeric IDs. If we scroll through the data we do see at least one random text entry. Let's make a note of this and browse through the rest of the data, looking for anything that might be out of the ordinary.

We can also use the `.printSchema()` method to print the inferred schema associated with the data. Notice that we have 12 columns (which is expected based on our format information) but there are no column names, incorrect datatypes, and each field is nullable.

In [14]:
titles_df.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)



## Bypassing the CSV interpeter

Our data on initial pass looks ok, but we can see that we have a few random rows even in our small sample. We know the first column should be an integer value but it looks like there are some that do not meet this requirement. Let's run a quick select statement on the DataFrame to determine the makeup of the content.

In [18]:
# Import the Pyspark SQL helper functions
from pyspark.sql import functions as F

# Determine how many rows have a column that converts properly to an integer value
titles_df.filter(F.col("_c0").cast("int").isNotNull()).count()


6173

In [20]:
# Look at rows that don't convert properly
titles_df.filter(F.col("_c0").cast("int").isNull()).show(truncate=False)

+------------------------------------------------------------------------------------------------------+-------------------------------------------------+-----------------------------------------------------------------------------------+----+-----+----+----+----+----+----+----+----+
|_c0                                                                                                   |_c1                                              |_c2                                                                                |_c3 |_c4  |_c5 |_c6 |_c7 |_c8 |_c9 |_c10|_c11|
+------------------------------------------------------------------------------------------------------+-------------------------------------------------+-----------------------------------------------------------------------------------+----+-----+----+----+----+----+----+----+----+
|# The wave for issues such as semantic network and computing                                          |null                                     

In [21]:
# Look at the full count of rows
titles_df.count()

6238

Looking at our content, we notice we have several types of problems:

- Comment rows: These begin with a `#` character in the first column, and all other columns are null
- Missing first column: We have few rows that reference `TV Show` or `Movie`, which should be the 2nd column.
- Odd columns: There are a few rows included where the columns seem out of sync (ie, a content type in the ID field, dates in the wrong column, etc).

We could fairly easily remove rows that match this pattern, but we're not entirely sure what to expect here. This is a common issue when trying to parse a large amount of data, be it in native Python, in Spark, or even with command-line tools. 

What we need to do is bypass most of the CSV parser's intelligence, but still load the content into a DataFrame. One way to do this is to modify an option on the CSV loader.

# CSV loading

Our initial import relies on the defaults for the CSV import mechanism. This typically assumes an actual comma-separated value file using `,` between fields and a normal row level terminator (ie, `\r\n`, `\r`, `\n`). While this often works well, it doesn't always handle ever data cleaning process you'd like, especially if you want to save the errant data for later examination.

One way we can trick our CSV load is to specify a custom separator that we know does not exist within our dataset. As we used above, the option to do this is called `sep` and takes a single character to be used as the column separator. The separator cannot be an empty string so depending on your data, you may need to determine a character that is not used. For our purposes, let's use a curly brace, `{`, which is most likely not present in our data.

In [0]:
# Load the files into a DataFrame with a single column

titles_single_df = spark.read.csv('/tmp/netflix_titles_dirty*.csv.gz', sep='{')

In [23]:
titles_single_df.count()

6238

In [24]:
titles_single_df.show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|_c0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      

In [25]:
titles_single_df.printSchema()

root
 |-- _c0: string (nullable = true)



# Q&A?

# Cleaning up our data

We know from some earlier analysis that we have comment rows in place (ie, rows that begin with a `#`). While Spark provides a DataFrame option to handle this automatically, let's consider what it would take to remove comment rows.

We need to:

- Determine if the column / line starts with a `#`
- If so, filter these out to a new DataFrame

There are many ways to accomplish this in Spark, but let's use a conceptually straight-forward option, `.startsWith()`.

In [26]:
titles_single_new_df = titles_single_df.filter(F.col('_c0').startswith('#'))
titles_single_new_df.show()

+--------------------+
|                 _c0|
+--------------------+
|# The wave for is...|
|# Which was, Chan...|
|# Five 40 dangero...|
|# Heisei Hyakkei ...|
|# (or will advanc...|
|# Site Montana en...|
|# To improving ex...|
|# Stanford Linear...|
|# Europeans. Carl...|
|# Offer internati...|
|# Hundred Views S...|
|# Filipino Americ...|
|# West Side desce...|
|# To British colo...|
|# Rates than laws...|
|# (87.8 percent B...|
|# On exteriority,...|
|# Roosevelt Unive...|
|# An "or". a conv...|
|# 1611–1613 Kalma...|
+--------------------+
only showing top 20 rows



In [27]:
titles_single_new_df.count()

47

We've determined that we have 47 rows that begin with a comment character. We can now easily filter these from our DataFrame as we do below.

*Note*: We're doing things in a more difficult fashion than is absolutely necessary to illustrate options. The Spark CSV reader has an option for a `comment` property, which actually defaults to skipping all rows starting with a `#` character. That said, it only supports a single character - consider if you were looking for multi-character options (ie, a // or /* from C-style syntax). This feature is also only available in newer versions of Spark, where our method works in any of the 2.x Spark releases.

In [0]:
# Filter out comments

titles_single_df = titles_single_df.filter(~ F.col('_c0').startswith('#'))

In [29]:
titles_single_df.count()

6191

# Checking column counts

Our next step for cleaning this dataset in Pyspark involves determining how many columns we should have in our dataset and take care of any outliers. We know from our earlier examination that the dataset should have 12 usable columns per row.

First, let's determine how many columns are present in the data and add that as a column. We'll do this with a combination of the `split` and `size` functions, along with the `.withColumn()` method.

In [0]:
# Add a column representing the total number of fields / columns

titles_single_df = titles_single_df.withColumn('fieldCount', F.size(F.split(titles_single_df['_c0'], '\t')))

In [40]:
# Show rows with a fieldcount > 12 (Note, select statement here isn't necessarily required - used to reorder the columns for easier viewing)
titles_single_df.select('fieldcount', '_c0').where('fieldCount > 12').show(truncate=False)

+----------+
|fieldcount|
+----------+
|        32|
|        30|
|        34|
|        35|
|        35|
|        21|
|        30|
|        32|
|        33|
|        31|
|        33|
|        23|
|        28|
|        17|
|        20|
|        26|
|        23|
|        27|
|        35|
|        15|
+----------+
only showing top 20 rows



In [42]:
# Check for any rows with fewer than 12 columns

titles_single_df.select('fieldcount', '_c0').where('fieldcount < 12').show(truncate=False)

+----------+--------------------+
|fieldcount|                 _c0|
+----------+--------------------+
|         3|TV Show	TV-MA	Cri...|
|        11|80109038	TV Show	...|
|        11|80227114	Movie	Kü...|
|         8|80108518	Anthony ...|
|         3|The Miracle	Dongh...|
|        11|80179762	She-Ra a...|
|         5|Movie	Harold and ...|
|         1|    Black Snake Moan|
|        10|70178618	Shark Ni...|
|        10|80160504	La Doña	...|
|         5|80195049	Movie	Ib...|
|         2|Jen Kirkman	May 2...|
|        10|81077597	Movie	I ...|
|         3|Movie		This fun, ...|
|        10|81047677	Movie	I ...|
|         9|	Kaycie Chase, Da...|
|        10|80223136	Movie	Ma...|
|         8|80020131	Kate Hig...|
|         1|            70213078|
|         5|The 2000s		United...|
+----------+--------------------+
only showing top 20 rows



In [0]:
# Save these to a separate dataframe for later analysis

titles_badrows_df = titles_single_df.where('fieldcount != 12')

In [44]:
# Determine total number of "bad" rows
titles_badrows_df.count()

78

In [0]:
# Set the dataframe without the bad rows
titles_single_df = titles_single_df.where('fieldcount == 12')

In [46]:
# How many current rows in dataframe?

titles_single_df.count()

6113

# Q&A

# More cleaning / prep

Now that we've removed rows that don't fit our basic formatting, let's continue on with making our dataframe more useful.

First, let's create a new column that is a list of all "columns" using the `pyspark.sql.functions.split` method. We'll call this `splitcolumn`.

In [0]:
# Create a list of split strings as a new column named splitcolumn
titles_cleaned_df = titles_single_df.select(F.split('_c0', '\t').alias('splitcolumn'))


In [102]:
# View the contents
titles_cleaned_df.show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|splitcolumn                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 

# Creating typed columns

There are several ways to do this operation depending on your needs, but for this dataset we'll explicitly convert the strings in the listcolumn to actual dataframe columns. The `.getItem()` method returns the value at the specified index of the listcolumn (ie, of `splitcolumn`). 

Note that for `show_id` and `release_year`, we'll also use `.cast()` to specify them as integers rather than just strings.

In [0]:
titles_cleaned_df = titles_cleaned_df.withColumn('show_id', titles_cleaned_df.splitcolumn.getItem(0).cast(IntegerType()))
titles_cleaned_df = titles_cleaned_df.withColumn('type', titles_cleaned_df.splitcolumn.getItem(1))
titles_cleaned_df = titles_cleaned_df.withColumn('title', titles_cleaned_df.splitcolumn.getItem(2))
titles_cleaned_df = titles_cleaned_df.withColumn('director', titles_cleaned_df.splitcolumn.getItem(3))
titles_cleaned_df = titles_cleaned_df.withColumn('cast', titles_cleaned_df.splitcolumn.getItem(4))
titles_cleaned_df = titles_cleaned_df.withColumn('country', titles_cleaned_df.splitcolumn.getItem(5))
titles_cleaned_df = titles_cleaned_df.withColumn('date_added', titles_cleaned_df.splitcolumn.getItem(6))
titles_cleaned_df = titles_cleaned_df.withColumn('release_year', titles_cleaned_df.splitcolumn.getItem(7).cast(IntegerType()))
titles_cleaned_df = titles_cleaned_df.withColumn('rating', titles_cleaned_df.splitcolumn.getItem(8))
titles_cleaned_df = titles_cleaned_df.withColumn('duration', titles_cleaned_df.splitcolumn.getItem(9))
titles_cleaned_df = titles_cleaned_df.withColumn('listed_in', titles_cleaned_df.splitcolumn.getItem(10))
titles_cleaned_df = titles_cleaned_df.withColumn('description', titles_cleaned_df.splitcolumn.getItem(11))

Let's now drop our columns that aren't needed anymore. These are `_c0` (the original single line string), `fieldcount`, and `splitcolumn`. You can drop these as a single column per entry, or a comma-separated set of column names.

In [0]:
titles_cleaned_df = titles_cleaned_df.drop('_c0', 'fieldcount', 'splitcolumn')

Let's verify our content, check the row count, and then look at our schema.

In [105]:
titles_cleaned_df.show()

+--------+-------+--------------------+--------------------+--------------------+--------------------+---------------+------------+------+--------+--------------------+--------------------+
| show_id|   type|               title|            director|                cast|             country|     date_added|release_year|rating|duration|           listed_in|         description|
+--------+-------+--------------------+--------------------+--------------------+--------------------+---------------+------------+------+--------+--------------------+--------------------+
|81002212|  Movie|            Bioscope|Gajendra Ahire, V...|Nina Kulkarni, Su...|               India|August 15, 2018|        2015| TV-14| 131 min|Dramas, Internati...|Inspired by class...|
|80176707|  Movie|  For Here or to Go?|   Rucha Humnabadkar|Ali Fazal, Melani...|United States, India|August 15, 2018|        2015| TV-MA| 105 min|Comedies, Dramas,...|A software engine...|
|80199381|  Movie|            Hostiles|        Sco

In [106]:
titles_cleaned_df.count()

6113

In [107]:
titles_cleaned_df.printSchema()

root
 |-- show_id: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- title: string (nullable = true)
 |-- director: string (nullable = true)
 |-- cast: string (nullable = true)
 |-- country: string (nullable = true)
 |-- date_added: string (nullable = true)
 |-- release_year: integer (nullable = true)
 |-- rating: string (nullable = true)
 |-- duration: string (nullable = true)
 |-- listed_in: string (nullable = true)
 |-- description: string (nullable = true)



# Even more cleanup

While we've successfully removed a lot of problem rows, let's look at fixing an issue remaining with the dataset. 

Let's look at the distinct values available in the show `type` column. 

In [108]:
titles_cleaned_df.select('type').distinct().show()

+-------+
|   type|
+-------+
|TV Show|
|  Movie|
|       |
+-------+



In [109]:
titles_cleaned_df.where('type == ""').show()

+--------+----+--------------------+--------------+--------------------+----------------+----------------+------------+------+--------+--------------------+--------------------+
| show_id|type|               title|      director|                cast|         country|      date_added|release_year|rating|duration|           listed_in|         description|
+--------+----+--------------------+--------------+--------------------+----------------+----------------+------------+------+--------+--------------------+--------------------+
|80032097|    |        Romeo Ranjha|Navaniat Singh|Jazzy B, Garry Sa...|   India, Canada|November 1, 2017|        2014| TV-14| 122 min|Action & Adventur...|Two common boys t...|
|81039410|    |         Dry Martina|  Che Sandoval|Antonella Costa, ...|Chile, Argentina|    May 11, 2019|        2018| TV-MA| 100 min|Dramas, Independe...|An odd encounter ...|
|80156212|    |           Wakefield| Robin Swicord|Bryan Cranston, J...|   United States|   March 2, 2019|    

You'll notice that we have 5 rows where the type is not specified. We have a couple of options:

- Drop the rows
- Infer what the `type` is based on other content available in the dataset

You could remove the problem rows using something like:

```
titles_cleaned_df = titles_cleaned_df.where('type == "TV Show" or type == "Movie"')
```

That feels a bit like cheating though - let's consider how else we could determine this largely automatically.

If you look at the `duration` column, you'll notice that there are different meanings behind the entries. We have durations that contain the word *min* (minutes) or the word `Season` for seasons of the show. We can try to use these to properly set the `type` value for these rows.

This problem is a bit tricky though as Spark does not have the concept of updating data within a column without jumping through several hoops. We can however work through this issue using a User Defined Function, or UDF.

## UDF

If you haven't worked with them before, a UDF is a Python function that gets applied to every row of a DataFrame. They are extremely flexible and can help us work through issues such as our current one.

First, let's define our Python function that takes two arguments, a showtype (ie, *Movie*, *TV Show*, or other) and the showduration. Note that these are strings in this case. We'll check if the showtype is already a Movie or TV Show - if so, just return that value. Otherwise, we'll check if the showduration ends with *min*, indicating a Movie. If not, we'll specify it as a TV Show.


In [0]:
# Define the UDF callable

def deriveType(showtype, showduration):
  if showtype == 'Movie' or showtype == 'TV Show':
    return showtype
  else:
    if showduration.endswith('min'):
      return "Movie"
    else:
      return "TV Show"

# Define the UDF for Spark

Now we need to configure the UDF for Spark to access it accordingly.

In [0]:
from pyspark.sql.functions import udf

udfDeriveType = udf(deriveType, StringType())

In [0]:
# Create a new derived column, passing in the appropriate values

titles_cleaned_df = titles_cleaned_df.withColumn('derivedType', udfDeriveType(F.col('type'), F.col('duration')))

In [116]:
# Show the rows where type is an empty string again, examining the derivedType
titles_cleaned_df.where('type == ""').show()

+--------+----+--------------------+--------------+--------------------+----------------+----------------+------------+------+--------+--------------------+--------------------+-----------+
| show_id|type|               title|      director|                cast|         country|      date_added|release_year|rating|duration|           listed_in|         description|derivedType|
+--------+----+--------------------+--------------+--------------------+----------------+----------------+------------+------+--------+--------------------+--------------------+-----------+
|80032097|    |        Romeo Ranjha|Navaniat Singh|Jazzy B, Garry Sa...|   India, Canada|November 1, 2017|        2014| TV-14| 122 min|Action & Adventur...|Two common boys t...|      Movie|
|81039410|    |         Dry Martina|  Che Sandoval|Antonella Costa, ...|Chile, Argentina|    May 11, 2019|        2018| TV-MA| 100 min|Dramas, Independe...|An odd encounter ...|      Movie|
|80156212|    |           Wakefield| Robin Swicord

In [0]:
# Drop the original type column and rename derviedType to type
titles_cleaned_df = titles_cleaned_df.drop('type').withColumnRenamed('derivedType', 'type')

In [118]:
# Verify we only have two types available
titles_cleaned_df.select('type').distinct().show()

+-------+
|   type|
+-------+
|TV Show|
|  Movie|
+-------+



In [119]:
# Verify our row count is the same
titles_cleaned_df.count()

6113

# Saving data for analysis / further processing

The last step of our data cleaning is to save the cleaned dataframe out to a file type. If you plan to do any further analysis or processing using Spark, it's highly recommended you use Parquet. Other options are available per your needs, but Spark is optimized to take advantage of Parquet.

In [0]:
# Save the data

titles_cleaned_df.write.parquet('/tmp/cleaned_data.parquet', mode='overwrite')

# Challenges

We've looked at several data cleaning operations using Spark. Here are some other challenges to consider within the dataset:

1) *Splitting names* - 
  You may have noticed that the names are combined for the cast and directors into a list. Consider how you would turn that data into a list / array column to easily access more detailed information (which shows have the largest cast, etc?)

2) *Splitting names further* - Consider taking any of the name fields and splitting it into first name, last name, etc. Take special consideration about how you would handle initials, names with more than 3 components, etc.

3) *Parsing dates* - Look at the `date_added` field and determine if and how you could reliably convert this to an actual datetime field.

# Last Q&A