# Youtube Data Analysis

This usecase is about analyzing the data of YouTube. This total analysis is performed using Apache Spark. This YouTube data is publicly available and the data set is described below under the heading Data Set Description.
Using that dataset, we will perform some analysis and will draw out some insights, like what are the top 10 rated videos in YouTube and who uploaded the most number of videos.

**Data Set Description**

- **Column 1:** Video id of 11 characters.

- **Column 2:** Uploader of the video.

- **Column 3:** Interval between day of establishment of YouTube and the date of uploading of the video.

- **Column 4:** Category of the video.

- **Column 5:** Length of the video.

- **Column 6:** Number of views for the video.

- **Column 7:** Rating on the video.

- **Column 8:** Number of ratings given for the video

- **Column 9:** Number of comments on the videos.

- **Column 10:** Related video ids with the uploaded video.

## Initializing spark session

In [32]:
import findspark
findspark.init()
import pyspark

In [33]:
from pyspark.sql import SparkSession

In [34]:
spark = SparkSession.builder.appName('usecase_9').getOrCreate()
sc = spark.sparkContext
sc.setLogLevel('ERROR')

### YouTube Data

In [35]:
youtube_df = spark.read.format('csv').options(header=False, inferSchema=True, delimiter='\t').load('youtubedata.txt')

In [36]:
youtube_df.count()

4100

In [37]:
youtube_df.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: double (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: integer (nullable = true)
 |-- _c5: integer (nullable = true)
 |-- _c6: double (nullable = true)
 |-- _c7: integer (nullable = true)
 |-- _c8: integer (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: string (nullable = true)
 |-- _c13: string (nullable = true)
 |-- _c14: string (nullable = true)
 |-- _c15: string (nullable = true)
 |-- _c16: string (nullable = true)
 |-- _c17: string (nullable = true)
 |-- _c18: string (nullable = true)
 |-- _c19: string (nullable = true)
 |-- _c20: string (nullable = true)
 |-- _c21: string (nullable = true)
 |-- _c22: string (nullable = true)



In [38]:
youtube_df.show(3)

+-----------+--------------------+------+--------------+---+----+----+---+---+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
|        _c0|                 _c1|   _c2|           _c3|_c4| _c5| _c6|_c7|_c8|        _c9|       _c10|       _c11|       _c12|       _c13|       _c14|       _c15|       _c16|       _c17|       _c18|       _c19|       _c20|       _c21|       _c22|
+-----------+--------------------+------+--------------+---+----+----+---+---+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
|QuRYeRnAuXM|EvilSquirrelPictures|1135.0|Pets & Animals|252|1075|4.96| 46| 86|gFa1YMEJFag|nRcovJn9xHg|3TYqkBJ9YRk|rSJ8QZWBegU|0TZqX5MbXMA|UEvVksP91kg|ZTopArY7Nbg|0RViGi2Rne8|HT_QlOJbDpg|YZev1imoxX8|8qQrrfUTmh0|zQ83d_D2MGs|u6_DQQjLsAw|73Wz9CQFDtE|
|3TYqkBJ9YRk

In [39]:
from pyspark.sql.functions import desc

### Problem Statement 1: Find out what are the top five categories with maximum number of videos uploaded.

In [40]:
youtube_df = youtube_df.withColumnRenamed('_c3','category')

In [41]:
youtube_df.groupBy('category').count().orderBy(desc('count')).show(5)

+---------------+-----+
|       category|count|
+---------------+-----+
|  Entertainment|  908|
|          Music|  862|
|         Comedy|  414|
| People & Blogs|  398|
|News & Politics|  333|
+---------------+-----+
only showing top 5 rows



### Problem Statement 2: Find the top 10 rated videos in YouTube.

In [42]:
youtube_df = youtube_df.withColumnRenamed('_c0','video_id').withColumnRenamed('_c6','Rating')

In [43]:
youtube_df.select('video_id','rating').show(3)

+-----------+------+
|   video_id|rating|
+-----------+------+
|QuRYeRnAuXM|  4.96|
|3TYqkBJ9YRk|   5.0|
|rSJ8QZWBegU|  4.31|
+-----------+------+
only showing top 3 rows



In [44]:
youtube_df.groupBy('video_id').agg({'Rating':'sum'}).orderBy(desc('sum(Rating)')).show(10)

+-----------+-----------+
|   video_id|sum(Rating)|
+-----------+-----------+
|FG-j841ezPw|        5.0|
|GCeXFaL24UA|        5.0|
|ENmQiCV_N1c|        5.0|
|4dK-9jLPGqc|        5.0|
|i8Jtlmtz6rE|        5.0|
|TaLUDgZTp6E|        5.0|
|Hg47-CwiP-I|        5.0|
|nUu52z7Jo6w|        5.0|
|pEbK3C7bZxU|        5.0|
|ZzuGxkWLops|        5.0|
+-----------+-----------+
only showing top 10 rows



## Closing Spark Session

In [45]:
spark.stop()