YouTube uses few key factors like measuring user interactions; number of views, shares, comments, and likes etc. to rank the videos as the top trending on the platform. The top trending list of videos mostly belongs to movies, music, celebrities, reality shows, TV shows, etc. This dataset – USvideos.csv - includes a few years of daily trending YouTube videos. The dataset category_title.csv has category_id field and corresponding title mapping.
Use Spark on a cloud service to do exploratory analysis and extract actionable insights in a proper format with API for retrieval. Please note that the dataset is not available on the cloud service you'll use, thus it needs to be streamed to the cloud service.
Dataset:
Insights to be collected: Insights.md
- We are going to use the Confluent Platform Community version and use Kafka Connect’s CSV Source Connector to read the CSV files and publish their data along with schema into their respective topics
- With ADLS Gen2 Sink Connector as consumer, we’ll read data from the published topics and convert into Avro format files which are going to be stored in an ADLS Gen2 container named ‘topics’ created automatically by the connector
- Our Scala based Spark application running on an Azure HDInsight Spark cluster will read the Avro data, generate insights and store the results as JSON files in ADLS Gen2 container. Intellij IDEA being our development IDE, it is convenient to use the Azure Toolkit for Intellij to remotely run our application in the HDInsight cluster from within the IDE itself.
- Finally, we’ll copy the insights JSON files from the ADLS Gen2 container into a Cosmos DB SQL API container using Azure Data Factory
- The insights out of the YouTube videos analysis are now available for query using the Cosmos SQL API
Please check solution_steps.pdf
GNU General Public License v3.0 or later
See LICENSE.txt to see the full text.
