# YouTube Analysis and Recommender System

In this notebook, we perform a detailed analysis of YouTube data to build a recommender system. YouTube, being one of the largest video-sharing platforms, holds a wealth of information that can be leveraged to understand user preferences, predict trends, and suggest relevant content.

## Overview

This project focuses on analyzing YouTube videos, their metadata, user interactions, and building a system that can recommend videos based on different factors such as view history, user preferences, and similar content.

## Objectives

- **Data Collection**: Retrieve YouTube video data, including video titles, descriptions, view counts, likes, and comments.
- **Data Preprocessing**: Clean and prepare the dataset for analysis.
- **Exploratory Data Analysis (EDA)**: Investigate patterns, trends, and insights from the dataset.
- **Recommender System**: Build a content-based or collaborative filtering recommender system to suggest videos to users.
- **Evaluation**: Evaluate the recommender system using appropriate metrics like accuracy, precision, and recall.

## Dataset

The dataset used in this analysis is collected from publicly available YouTube video statistics. It includes various features such as:

- Video Title
- Channel Name
- View Count
- Like Count
- Comment Count
- Video Description
- Tags
- Published Date

## Methodology

We will employ machine learning algorithms to build our recommender system. Depending on the approach, we might use:
- **Content-Based Filtering**: Recommending videos based on similarities between the content of videos.
- **Collaborative Filtering**: Recommending videos based on user interaction history and preferences.

The model will be evaluated based on the ability to suggest relevant and engaging videos to users.

## Conclusion

By the end of this notebook, we aim to have a fully functional recommender system capable of providing personalized video recommendations, enhancing user engagement, and improving their overall experience on YouTube.



## Initializing PySpark

Before using PySpark, we need to initialize the Spark session.

In [9]:
from pyspark.sql import SparkSession
import os
import sys
    
spark = SparkSession.builder \
.master("local[*]") \
.config("spark.submit.deployMode","client") \
.getOrCreate()

if not 'spark' in locals():
    spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory","64G") \
        .getOrCreate()
    
sc = spark.sparkContext

## Load the Data from a CSV file

In [10]:
filePath = os.path.join(os.getcwd(), "dataset", "trending_yt_videos_113_countries.csv")
data = spark.read.format("csv").option("header", "true").option("sep", ",").option("multiLine", "true").option("quote", "\"").load(filePath)

data.show(20, False)

+--------------------------------------------------------------------------------------------------+-------------------------+----------+--------------+---------------+-------------+-------+----------+----------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Configure the Schema

Update the Schema with the proper data types and generate an object constructor using Rows

In [11]:
data.printSchema()

root
 |-- title: string (nullable = true)
 |-- channel_name: string (nullable = true)
 |-- daily_rank: string (nullable = true)
 |-- daily_movement: string (nullable = true)
 |-- weekly_movement: string (nullable = true)
 |-- snapshot_date: string (nullable = true)
 |-- country: string (nullable = true)
 |-- view_count: string (nullable = true)
 |-- like_count: string (nullable = true)
 |-- comment_count: string (nullable = true)
 |-- description: string (nullable = true)
 |-- thumbnail_url: string (nullable = true)
 |-- video_id: string (nullable = true)
 |-- channel_id: string (nullable = true)
 |-- video_tags: string (nullable = true)
 |-- kind: string (nullable = true)
 |-- publish_date: string (nullable = true)
 |-- langauge: string (nullable = true)



In [12]:
from pyspark.sql.types import *

schema = StructType([
    StructField('title', StringType(), True),
    StructField('channel_name', StringType(), True),
    StructField('daily_rank', IntegerType(), True),
    StructField('daily_movement', IntegerType(), True),
    StructField('weekly_movement', IntegerType(), True),
    StructField('snapshot_date', DateType(), True),
    StructField('country', StringType(), True),
    StructField('view_count', IntegerType(), True),
    StructField('like_count', IntegerType(), True),
    StructField('comment_count', IntegerType(), True),
    StructField('description', StringType(), True),
    StructField('thumbnail_url', StringType(), True),
    StructField('video_id', StringType(), True),
    StructField('channel_id', StringType(), True),
    StructField('video_tags', StringType(), True),
    StructField('kind', StringType(), True),
    StructField('country', StringType(), True),
    StructField('publish_date', DateType(), True),
    StructField('language', StringType(), True),
    ])

In [13]:
from pyspark.sql import Row

Video = Row('title', 'channel_name','daily_rank', 'daily_movement', 'weekly_movement', 'snapshot_date', 'country', 'view_count', 'like_count', 'comment_count', 
            'description', 'thumbnail_url', 'video_id', 'channel_id', 'video_tags', 'kind', 'country', 'publish_date', 'language')