# Initializing Spark
********************************************

In [1]:
import os
import sys

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

import findspark
findspark.init()
findspark.find()
import pyspark

from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

conf = pyspark.SparkConf().setAppName('appName').setMaster('local')
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession(sc)
sqlcontext=SQLContext(sc)


# Knowledge Questions
*******************************************************************

### Spark Assumptions and Spark Optimization

#### Some Facts:
##### 1. Spark has immutable Dataframes
##### 2. Can run on multiple Clusters
##### 3. It is always clear on how the values were created
##### 4. Lazy Transformations and Eager Actions
##### 5. Transformation results into new modified Dataframes
##### 6. The "explain()" function is used to look at the physical plan Spark has created
##### 7. If we have to filter and sort some data, the filter operation is executed before the sort operation
##### 8. Functional Programming
##### 9. Distributive Property and homomorphism
##### 10. Spark by default saves the data in several partitions





# Programming 
*********************************************************************

### Selection

You are given a data set called "German/US Trending Youtube Video Statistics" which contains the following columns:

    video_id: contains the video ID
    trending_date: contains the date on which the video has been trending.
    title: the title of the video
    channel_title: the name of the channel that published the video.
    category_id: the category id of the video
    publish_time: publish date and time of the video
    tags: the tags of the video.
    views: number of views.
    likes: number of likes.
    dislikes: number of dislikes.
    You are asked to select only the following columns: "video_id", "trending_date", "title", "views", into a new dataframe.

Please write the corresponding line of code. The code needs to be in one line, and one instruction

Note: Name of dataframe is "dataframe"

Name of new dataframe is "videoStats"

 

In [24]:
dataframe=spark.read.csv("USvideos.csv",header=True,inferSchema=True)

In [25]:
dataframe.printSchema()

root
 |-- video_id: string (nullable = true)
 |-- trending_date: string (nullable = true)
 |-- title: string (nullable = true)
 |-- channel_title: string (nullable = true)
 |-- category_id: string (nullable = true)
 |-- publish_time: string (nullable = true)
 |-- tags: string (nullable = true)
 |-- views: string (nullable = true)
 |-- likes: string (nullable = true)
 |-- dislikes: string (nullable = true)
 |-- comment_count: string (nullable = true)
 |-- thumbnail_link: string (nullable = true)
 |-- comments_disabled: string (nullable = true)
 |-- ratings_disabled: string (nullable = true)
 |-- video_error_or_removed: string (nullable = true)
 |-- description: string (nullable = true)



In [26]:
videoStats=dataframe.select("video_id","trending_date","title","views")

In [27]:
videoStats.show()

+-----------+-------------+--------------------+-------+
|   video_id|trending_date|               title|  views|
+-----------+-------------+--------------------+-------+
|2kyS6SvSYSE|     17.14.11|WE WANT TO TALK A...| 748374|
|1ZAPwfrtAFY|     17.14.11|The Trump Preside...|2418783|
|5qpjK5DgCt4|     17.14.11|Racist Superman |...|3191434|
|puqaWrEC7tY|     17.14.11|Nickelback Lyrics...| 343168|
|d380meD0W0M|     17.14.11|I Dare You: GOING...|2095731|
|gHZ1Qz0KiKM|     17.14.11|2 Weeks with iPho...| 119180|
|39idVpFF7NQ|     17.14.11|Roy Moore & Jeff ...|2103417|
|nc99ccSXST0|     17.14.11|5 Ice Cream Gadge...| 817732|
|jr9QtXwC9vc|     17.14.11|The Greatest Show...| 826059|
|TUmyygCMMGA|     17.14.11|Why the rise of t...| 256426|
|9wRQljFNDW8|     17.14.11|Dion Lewis' 103-Y...|  81377|
|VifQlJit6A0|     17.14.11|(SPOILERS) 'Shiva...| 104578|
|5E4ZBSInqUU|     17.14.11|Marshmello - Bloc...| 687582|
|GgVmn66oK_A|     17.14.11|Which Countries A...| 544770|
|TaTleo4cOs8|     17.14.11|SHOP

### Columns Creation

Now after selecting the columns we are interested in, we would like to create a new column inside "videoStats" dataframe with the name "new" that contains the views divided by 100.

Example:

row	views	new

1	10000	100
 

 

 
 

In [28]:
videoStats.withColumn("new",videoStats["views"]/100).show()

+-----------+-------------+--------------------+-------+--------+
|   video_id|trending_date|               title|  views|     new|
+-----------+-------------+--------------------+-------+--------+
|2kyS6SvSYSE|     17.14.11|WE WANT TO TALK A...| 748374| 7483.74|
|1ZAPwfrtAFY|     17.14.11|The Trump Preside...|2418783|24187.83|
|5qpjK5DgCt4|     17.14.11|Racist Superman |...|3191434|31914.34|
|puqaWrEC7tY|     17.14.11|Nickelback Lyrics...| 343168| 3431.68|
|d380meD0W0M|     17.14.11|I Dare You: GOING...|2095731|20957.31|
|gHZ1Qz0KiKM|     17.14.11|2 Weeks with iPho...| 119180|  1191.8|
|39idVpFF7NQ|     17.14.11|Roy Moore & Jeff ...|2103417|21034.17|
|nc99ccSXST0|     17.14.11|5 Ice Cream Gadge...| 817732| 8177.32|
|jr9QtXwC9vc|     17.14.11|The Greatest Show...| 826059| 8260.59|
|TUmyygCMMGA|     17.14.11|Why the rise of t...| 256426| 2564.26|
|9wRQljFNDW8|     17.14.11|Dion Lewis' 103-Y...|  81377|  813.77|
|VifQlJit6A0|     17.14.11|(SPOILERS) 'Shiva...| 104578| 1045.78|
|5E4ZBSInq

### Statistics


you are provided with a dataframe called "videoStats" about "German Trending Youtube Video Statistics" that contains the following columns:

        video_id: contains the video ID
        trending_date: contains the date on which the video has been trending.
        title: the title of the video
        views: number of views.


Note: for the following questions you do not need to put the results on a new variable.


 

    1. Please write the line of code to calculate the mean of the views:

    2. Please write the line of code to calculate the standard deviation of the population of the views:

    3. Please write the line of code to find the maximum of the views:



In [29]:
videoStats.columns

['video_id', 'trending_date', 'title', 'views']

In [30]:
videoStats.agg({"views":"mean"}).show()

+------------------+
|        avg(views)|
+------------------+
|2360784.6382573447|
+------------------+



In [32]:
videoStats.agg({"views":"std"}).show()

+-----------------+
|    stddev(views)|
+-----------------+
|7394113.759703937|
+-----------------+



In [33]:
videoStats.agg({"views":"max"}).show()



+----------+
|max(views)|
+----------+
|     99999|
+----------+



### Grouping and Aggregation

you are provided with a dataframe called "videoStats" about "German Trending Youtube Video Statistics" that contains the following columns:

    video_id: contains the video ID
    trending_date: contains the date of which the video has been in trending.
    title: the title of the video
    views: number of views.


Group the dataframe by the "trending_date" column and aggregate by the mean of the "views" column. The result should be put in a new dataframe named "videoStatGroup"

In [43]:
videoStats.columns

['video_id', 'trending_date', 'title', 'views']

In [44]:
videoStats.show(60)

+--------------------+-------------+--------------------+--------+
|            video_id|trending_date|               title|   views|
+--------------------+-------------+--------------------+--------+
|         2kyS6SvSYSE|     17.14.11|WE WANT TO TALK A...|  748374|
|         1ZAPwfrtAFY|     17.14.11|The Trump Preside...| 2418783|
|         5qpjK5DgCt4|     17.14.11|Racist Superman |...| 3191434|
|         puqaWrEC7tY|     17.14.11|Nickelback Lyrics...|  343168|
|         d380meD0W0M|     17.14.11|I Dare You: GOING...| 2095731|
|         gHZ1Qz0KiKM|     17.14.11|2 Weeks with iPho...|  119180|
|         39idVpFF7NQ|     17.14.11|Roy Moore & Jeff ...| 2103417|
|         nc99ccSXST0|     17.14.11|5 Ice Cream Gadge...|  817732|
|         jr9QtXwC9vc|     17.14.11|The Greatest Show...|  826059|
|         TUmyygCMMGA|     17.14.11|Why the rise of t...|  256426|
|         9wRQljFNDW8|     17.14.11|Dion Lewis' 103-Y...|   81377|
|         VifQlJit6A0|     17.14.11|(SPOILERS) 'Shiva...|  104

In [45]:
videoStats.groupBy("trending_date").agg({"views":"mean"}).show()

+--------------------+------------------+
|       trending_date|        avg(views)|
+--------------------+------------------+
|            18.08.05|       4717125.855|
|            17.20.11|       1285000.555|
|            17.09.12|       1472959.565|
|            18.14.02|1711845.1859296483|
|            18.20.03| 2572119.185929648|
|            18.04.05|        3779207.67|
|            18.21.02|        1271300.39|
| 2018\n\nStill ha...|              null|
|             fitness|              null|
|            18.30.04|        4089305.62|
| Twitter &amp; Wi...|              null|
|            18.06.05|       4192681.855|
|            18.11.02|       2025360.815|
|            18.15.03|2139255.1959798997|
|            18.08.06|       5542179.725|
|            18.09.06|        5458786.25|
|            18.08.01|        1159250.43|
|            18.20.01|         750397.96|
| at 10/9c on Nati...|              null|
| Then and Now: Pr...|              null|
+--------------------+------------