<p align="center">
<img src="https://github.com/datacamp/python-live-training-template/blob/master/assets/datacamp.svg?raw=True" alt = "DataCamp icon" width="50%">
</p>
<br><br>

## **Cleaning Data with Pyspark**

Welcome to this hands-on training where we will investigate cleaning a dataset using Python and Apache Spark! During this training, we will cover:

* Efficiently loading data into a Spark DataFrame
* Handling errant rows / columns from the dataset, including comments, missing data, combined or misinterpreted columns, etc.
* Using Python UDFs to run advanced transformations on data


## **The Dataset**

The dataset used in this webinar is a set of CSV files named `netflix_titles_raw*.csv`. These contain information related to the movies and television shows available on Netflix. These are the *dirty* versions of the dataset - we will cover the individual problems as we work through the notebook.

Given that this is a data cleaning webinar, let's look at our intended result.  The dataset will contain the follwing information:

- `show_id`: A unique identifier for the show
- `type`: The type of content, `Movie` or `TV Show`
- `title`: The title of the content
- `director`: The director (or directors)
- `cast`: The cast
- `country`: Country (or countries) where the content is available
- `date_added`: Date added to Netflix
- `release_year`: Year of content release
- `rating`: Content rating
- `duration`: The duration
- `listed_in`: The genres the content is listed in
- `description`: A description of the content



## **Setting up a PySpark session**

Before we can start processing our data, we need to configure a Pyspark session for Google Colab. Note that this is specific for using Spark and Python in Colab and likely is not required for other environments. 

In [0]:
# Run this code as is to install Spark in Colab
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz
!pip install -q findspark

In [0]:
# Run this code to setup the environment
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"

In [0]:
# Finally, setup our Spark session
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

## **Getting started**

In [5]:
# Get dataset into local environment
!wget -O /tmp/airbnb.csv 'https://github.com/datacamp/python-live-training-template/blob/master/data/airbnb.csv?raw=True'
airbnb = spark.read.csv('/tmp/airbnb.csv', inferSchema=True, header =True)

--2020-06-02 05:09:48--  https://github.com/datacamp/python-live-training-template/blob/master/data/airbnb.csv?raw=True
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/datacamp/python-live-training-template/raw/master/data/airbnb.csv [following]
--2020-06-02 05:09:49--  https://github.com/datacamp/python-live-training-template/raw/master/data/airbnb.csv
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/datacamp/python-live-training-template/master/data/airbnb.csv [following]
--2020-06-02 05:09:49--  https://raw.githubusercontent.com/datacamp/python-live-training-template/master/data/airbnb.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.co

In [17]:
!head /tmp/airbnb.csv

,listing_id,name,host_id,host_name,neighbourhood_full,coordinates,room_type,price,number_of_reviews,last_review,reviews_per_month,availability_365,rating,number_of_stays,5_stars,listing_added
0,13740704,"Cozy,budget friendly, cable inc, private entrance!",20583125,Michel,"Brooklyn, Flatlands","(40.63222, -73.93398)",Private room,45$,10,2018-12-12,0.7,85,4.100953568650489,12.0,0.6094315047128488,2018-06-08
1,22005115,Two floor apartment near Central Park,82746113,Cecilia,"Manhattan, Upper West Side","(40.78761, -73.96862)",Entire home/apt,135$,1,2019-06-30,1.0,145,3.36759984148828,1.2,0.7461345904105886,2018-12-25
2,21667615,Beautiful 1BR in Brooklyn Heights,78251,Leslie,"Brooklyn, Brooklyn Heights","(40.7007, -73.99517)",Entire home/apt,150$,0,,,65,,,,2018-08-15
3,6425850,"Spacious, charming studio",32715865,Yelena,"Manhattan, Upper West Side","(40.79169, -73.97498)",Entire home/apt,86$,5,2017-09-23,0.13,0,4.76320299429773,6.0,0.7699470781766179,2017-03-20
4,22986519,Bedroom on the liv

In [16]:
airbnb.describe().show()

+-------+--------------------+--------------------+--------------------+-------------------+--------------------+--------------------+--------------------+--------------------+-------------------+------------------+------------------+--------------------+------------------+------------------+------------------+-------------------+-------------+
|summary|                 _c0|          listing_id|                name|            host_id|           host_name|  neighbourhood_full|         coordinates|           room_type|              price| number_of_reviews|       last_review|   reviews_per_month|  availability_365|            rating|   number_of_stays|            5_stars|listing_added|
+-------+--------------------+--------------------+--------------------+-------------------+--------------------+--------------------+--------------------+--------------------+-------------------+------------------+------------------+--------------------+------------------+------------------+-------------

In [8]:
airbnb.show(truncate=False)

+---+----------+--------------------------------------------------+---------+------------+-----------------------------+------------------------------+---------------+-----+-----------------+-----------+-----------------+----------------+------------------+------------------+------------------+-------------+
|_c0|listing_id|name                                              |host_id  |host_name   |neighbourhood_full           |coordinates                   |room_type      |price|number_of_reviews|last_review|reviews_per_month|availability_365|rating            |number_of_stays   |5_stars           |listing_added|
+---+----------+--------------------------------------------------+---------+------------+-----------------------------+------------------------------+---------------+-----+-----------------+-----------+-----------------+----------------+------------------+------------------+------------------+-------------+
|0  |13740704  |Cozy,budget friendly, cable inc, private entrance!|205

In [0]:
airbnb.createOrReplaceTempView("airbnb")

In [11]:
spark.sql("select distinct host_name from airbnb").count()

4109

In [14]:
spark.sparkContext.uiWebUrl

'http://fab795d401eb:4040'