**Coursebook: Fetch Data Youtube API**

- Part of Data Engineering Airflow Specialization
- Course Length: 3 Hours
- Last Updated: May 2024

---


- Developed by [Algoritma](https://algorit.ma/)'s product division and instructors team

# Background



## Top-Down Approach

The coursebook is part of the **Data Engineering Airflow Specialization** offered by [Algoritma](https://algorit.ma). It takes a more accessible approach compared to Algoritma's core educational products, by getting participants to overcome the "how" barrier first, rather than a detailed breakdown of the "why". 

This translates to an overall easier learning curve, one where the reader is prompted to write short snippets of code in frequent intervals, before being offered an explanation on the underlying theoretical frameworks. Instead of mastering the syntactic design of the Python programming language, then moving into data structures, and then the `pandas` library, and then the mathematical details in an imputation algorithm, and its code implementation; we would do the opposite: Implement the imputation, then a succinct explanation of why it works and applicational considerations (what to look out for, what are assumptions it made, when _not_ to use it, etc).

The coursebook is intended for a restricted audience only, i.e. the individuals and organizations having received this coursebook directly from the training organization. It may not be reproduced, distributed, translated or adapted in any form outside these individuals and organizations without permission.

Algoritma is a data science education center based in Jakarta. We organize workshops and training programs to help working professionals and students gain mastery in various data science sub-fields: data visualization, machine learning, data modeling, statistical inference, etc.

## Training Objectives

This coursebook is intended to perform how to fetch data from REST API, cleansing it and restore it in new file. In this coursebook we will use Youtube API to fetch trending videos, cleansing and store it to JSON file. No prior programming knowledge is assumed.

This coursebook focused on:

- Importance of Data Fetching
- Fetching Videos Youtube API
- Data Preprocessing or Cleaning Data
- Store Data to JSON file



## Importance of Data Fetching

### Importance of Data Fetching

In today's data-driven world, access to timely and relevant data is crucial for businesses, researchers, and individuals alike. Data fetching refers to the process of retrieving information from various sources, enabling analysis, insights, and informed decision-making. The significance of data fetching lies in:

1. **Informed Decision-Making**: Data fetching provides access to real-time and historical data, empowering organizations to make informed decisions based on accurate information and trends.

2. **Business Intelligence**: Fetching data from diverse sources enables businesses to gain insights into market trends, customer behavior, and competitor activities, aiding in strategic planning and performance evaluation.

3. **Research and Innovation**: Researchers rely on data fetching to gather empirical evidence, conduct analyses, and validate hypotheses across various domains, fostering innovation and scientific discovery.

4. **Personalization**: Data fetching facilitates the delivery of personalized experiences in sectors such as e-commerce, entertainment, and healthcare, enhancing customer satisfaction and engagement.

## Methods of Data Fetching
There are several methods to fetch data from different sources, each suited to specific requirements and use cases:

1. **Web Scraping**: This involves extracting data directly from web pages using automated scripts or tools. While effective for accessing publicly available information, web scraping may face challenges related to website structure and legality.

2. **File Transfer**: Data can be fetched by transferring files from one system to another using protocols like FTP (File Transfer Protocol) or SFTP (Secure File Transfer Protocol). This method is commonly used for batch data transfers.

3. **Database Queries**: Fetching data from databases involves executing queries using SQL (Structured Query Language) or NoSQL (Not Only SQL) languages to retrieve specific information from structured datasets stored in relational or non-relational databases.

4. **API (Application Programming Interface)**: APIs provide a standardized way for different software applications to communicate and exchange data. They offer a structured interface for accessing specific functionalities and datasets, making them a powerful and versatile method for data fetching.

### Example: YouTube API

The YouTube API is a specific API provided by Google that allows developers to access and interact with YouTube's features and data programmatically. With the YouTube API, developers can perform tasks such as retrieving video information, uploading videos, managing playlists, and accessing analytics data.

In the context of our project, we will focus on learning how to fetch data from YouTube using its API. By leveraging the YouTube API, we can programmatically retrieve trending videos, access metadata such as titles, descriptions, and view counts, and incorporate this data into our analytics pipelines or applications. Understanding how to interact with the YouTube API opens up opportunities for data-driven insights and innovation in various domains, including marketing, content creation, and audience engagement.

### Introduction to API
An API, or Application Programming Interface, is a set of rules, protocols, and tools that allows different software applications to communicate with each other. APIs define the methods and data formats that applications can use to request and exchange information.

**Key Characteristics of APIs:**
- **Standardization**: APIs provide a standardized way for applications to interact, ensuring compatibility and interoperability across different systems.
- **Abstraction**: APIs abstract the underlying complexities of systems, allowing developers to access functionality or data without needing to understand the internal workings.
- **Security**: APIs often incorporate authentication and authorization mechanisms to control access to sensitive data and resources.
- **Scalability**: APIs are designed to handle large volumes of requests efficiently, enabling scalability and performance optimization.

# Obtaining a YouTube API Key and Creating a YouTube API Client

**1. Sign in to Google Developer Console**
- Go to the [Google Developer Console](https://console.developers.google.com/).
- Sign in with your Google account or create one if you don't have it already.

**2. Create a New Project**
- Once signed in, create a new project by clicking on the "Select a project" dropdown menu at the top of the page and then clicking on the "+ New Project" button.
- Enter a name for your project and click "Create".

**3. Enable the YouTube Data API v3**
- In the Google Developer Console, navigate to the "Library" section from the left sidebar.
- Search for "YouTube Data API v3" and click on it.
- Click the "Enable" button to enable the API for your project.

**4. Create API Key**
- After enabling the YouTube Data API v3, navigate to the "Credentials" section from the left sidebar.
- Click on the "Create credentials" dropdown menu and select "API key".
- A dialog will appear displaying your API key. Copy this API key and store it securely.

**5. Secure Your API Key**
- It's essential to keep your API key secure to prevent unauthorized access and usage. Consider restricting the API key to only allow requests from specific websites or applications.

**6. Integrate API Key into Your Application**
- In your Python script, import the necessary libraries (`dotenv` for managing environment variables, `os` for interacting with the operating system).
- Use the `load_dotenv` function to load environment variables from a `.env` file. Ensure that you have the `python-dotenv` library installed (`pip install python-dotenv`).
- Set your YouTube API key as an environment variable in the `.env` file. For example:
  ```
  YOUTUBE_API_KEY=YOUR_API_KEY_HERE
  ```

## Library Preparation

In [1]:
import os
from dotenv import load_dotenv

from datetime import datetime, timezone
import json # handling json file
import isodate # handling datetime
from googleapiclient.discovery import build

In [2]:
# load API key from .env file
load_dotenv("dags/.env") # path only for trial
api_key = os.environ.get("YOUTUBE_API_KEY")

**Note**: Later, use `"/opt/airflow/dags/..."` for docker purpose

## Access the API in Your Script

- Within your Python script, import the `dotenv` and `os` libraries to access the API key.
- Use the `os.environ.get()` function to retrieve the API key from the environment variables.
- With the API key obtained, you can now create a YouTube API client using the `build` function from the `googleapiclient.discovery` module.

In [3]:
# create YouTube API client
youtube = build("youtube", "v3", developerKey=api_key)

By following these steps, you can obtain a YouTube API key and create a YouTube API client in your Python script. This allows you to programmatically interact with the YouTube Data API, fetching data such as trending videos, video metadata, and analytics information for further processing or analysis in your applications or workflows. Remember to keep your API key secure and adhere to Google's usage policies to avoid any potential issues or restrictions.

# Fetching Videos Using YouTube API

## 1. Make a Single API Request for Videos
Let's begin by taking our first step into accessing YouTube's vast video data through its API. We'll make a single request tailored to our criteria, fetching videos based on specific parameters. 

This initial action lays the foundation for our journey, offering a fundamental understanding of how to engage with the API and retrieve essential data. 

In [4]:
region_code='ID'

In [5]:
# make API request for videos
request = youtube.videos().list(

    # define the parts of the video data to retrieve
    part="snippet,contentDetails,statistics",  

    # specify the type of chart to fetch videos from
    chart="mostPopular",  

    # specify the region based on ISO 3166-1 alpha-2 country code
    regionCode=region_code,  

    # define the maximum number of videos to retrieve in a single request (1-50)
    maxResults=50  
)

Certainly, let's break down the purpose of each parameter in the API request for a clearer understanding:

- `part`: This parameter specifies which parts of the video resource to include in the API response. In this case, we're requesting the `snippet` (basic details like title, description), `contentDetails` (information about the video content like duration), and `statistics` (metrics like view counts, likes, comments) parts.

- `chart`: The `chart` parameter determines the type of chart to retrieve videos from.

- `regionCode`: This parameter specifies the region for which to fetch videos. It uses the ISO 3166-1 alpha-2 country code format to identify the desired region.

- `maxResults`: The `maxResults` parameter sets the maximum number of videos to retrieve in a single API request. In this case, we're fetching up to 50 videos at a time, which is the maximum allowed by the YouTube API.

**Note:** Detail documentation about `youtube.video.list()` parameters, you can check [here](https://developers.google.com/youtube/v3/docs/videos/list).

By providing these parameters in our API request, we define the scope and criteria for fetching videos from YouTube, tailoring the data retrieval process to our specific needs and objectives.

## 2. Execute a Single API Requests for Videos

After that, let's execute the requests using `execute()` function. This function is like pressing the "send" button on an email. It's what actually sends our request to YouTube's servers, asking for the videos we want. Once we hit "send" (or call `execute()`), YouTube gets our request, processes it, and sends back the information we asked for in a response. This response contains the videos we requested, along with their details like titles, views, and more.

In [6]:
response = request.execute()
response

{'kind': 'youtube#videoListResponse',
 'etag': 'VztUde9rJ5agADJsrNY60gJ4OFM',
 'items': [{'kind': 'youtube#video',
   'etag': 'ELrQJuz8aLr0qnwwmOqOL3Ndbi8',
   'id': '4ofJpOEXrZs',
   'snippet': {'publishedAt': '2024-05-03T22:00:08Z',
    'channelId': 'UCn_FAXem2-e3HQvmK-mOH4g',
    'title': 'THE AMAZING DIGITAL CIRCUS - Ep 2: Candy Carrier Chaos!',
    'description': 'The gang are BACK for a WAaAaAaACKY candy filled adventure! They also discover that their lives literally have no meaning. Woohoo! So waaccky!!!!! \n\nALSOOO we just dropped the entire main Digital Circus characters as PLUSHIES!!!: https://digitalcircus.store/ Consider getting one to help support the production of more episodes. No pressure though! Have a great day ❤️',
    'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/4ofJpOEXrZs/default.jpg',
      'width': 120,
      'height': 90},
     'medium': {'url': 'https://i.ytimg.com/vi/4ofJpOEXrZs/mqdefault.jpg',
      'width': 320,
      'height': 180},
     'hig

If successful, this method returns a response body with the following structure:
```
{
  'kind': 'youtube#videoListResponse',
  'etag': etag,
  'items': [
    video Resource
  ],
  'nextPageToken': string,
  'pageInfo': {'totalResults': integer, 'resultsPerPage': integer},
}
```

## 3. Extract Videos from API Response

Once we've sent our request, the next step is to get the videos from the response sent back by the API. This allows us to access all the important information about each video returned by the API, such as titles, descriptions, and view counts. This step is essential because it's how we actually get the data we're interested in from the API's response.

In [7]:
videos = response.get("items", [])
videos

[{'kind': 'youtube#video',
  'etag': 'ELrQJuz8aLr0qnwwmOqOL3Ndbi8',
  'id': '4ofJpOEXrZs',
  'snippet': {'publishedAt': '2024-05-03T22:00:08Z',
   'channelId': 'UCn_FAXem2-e3HQvmK-mOH4g',
   'title': 'THE AMAZING DIGITAL CIRCUS - Ep 2: Candy Carrier Chaos!',
   'description': 'The gang are BACK for a WAaAaAaACKY candy filled adventure! They also discover that their lives literally have no meaning. Woohoo! So waaccky!!!!! \n\nALSOOO we just dropped the entire main Digital Circus characters as PLUSHIES!!!: https://digitalcircus.store/ Consider getting one to help support the production of more episodes. No pressure though! Have a great day ❤️',
   'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/4ofJpOEXrZs/default.jpg',
     'width': 120,
     'height': 90},
    'medium': {'url': 'https://i.ytimg.com/vi/4ofJpOEXrZs/mqdefault.jpg',
     'width': 320,
     'height': 180},
    'high': {'url': 'https://i.ytimg.com/vi/4ofJpOEXrZs/hqdefault.jpg',
     'width': 480,
     'height': 360

The code `response.get('items', [])` is typically used in Python to retrieve a value associated with the key 'items'.

## 4. Processing the Extracted Videos

### a) Basic Processing for Extracting Video Data

Now that we've obtained the videos from the API response, we can proceed to process them according to our requirements. This processing step involves extracting relevant information from each video or presenting them to the user in a suitable format.

In [8]:
# iterates through the first five video.
for video in videos[:5]:
    title = video["snippet"]["title"]  # extract video's title
    published_at = video["snippet"]["publishedAt"]  # extract video's publish date 
    view_count = video["statistics"]["viewCount"]  # extract video's view count 

    print(f"Title: {title}, Published At: {published_at}, Views: {view_count}")

Title: THE AMAZING DIGITAL CIRCUS - Ep 2: Candy Carrier Chaos!, Published At: 2024-05-03T22:00:08Z, Views: 37702723
Title: NGE-RATING SEMUA ROBOSEN TRANSFORMERS YANG PERNAH DIBUAT!, Published At: 2024-05-04T11:30:16Z, Views: 766791
Title: ARAFAH NGAKU SIAPA PACARNYA SEKARANG! - OMWEN, Published At: 2024-05-04T12:00:55Z, Views: 207703
Title: KELUARGA BARU KU PART 5 (TAMAT) - Animasi Sekolah, Published At: 2024-05-04T11:12:35Z, Views: 4198713
Title: skibidi toilet 73 (part 2), Published At: 2024-05-03T03:00:06Z, Views: 22953193


In this code snippet, we iterate over each video in the `videos` list and extract specific details such as the title, publish date, and view count. These details can then be used for various purposes, such as analysis, reporting, or presentation to users. 

> However, it's important to note that the data fetched from each video may include more information beyond what is currently being printed.

The first method above (basic processing) involves directly extracting specific details, such as the video title, publish date, and view count, from the 'video' object using individual assignments. 

> While suitable for basic extraction needs, this approach can become cumbersome as requirements become more complex. 

### b) Advanced Processing Techniques for Comprehensive Video Extraction

That's why, let's advance to extract more detailed information and append to a list (advanced approach)

#### Define Relevant Information to Extract

If the requirements are more complex or the details fetched are more numerous, the basic processing is not recommended. We need a method that allows for easier customization and scalability as new attributes can be added or modified without altering the code structure.

Thus, we can use this advanced method which offers a structured approach by defining a dictionary, 'infos', which maps categories like 'snippet' and 'statistics' to desired attributes. 

- After that, by iterating through each category and extracting specified details, stored in a dictionary called 'video_details', this approach streamlines the extraction process and enhances code readability. 

In [9]:
# define the relevant information to extract
infos = {
    'snippet': ['title', 'publishedAt', 'channelTitle'],
    'statistics': ['viewCount']
}

Expand the original `infos` dictionary into `complete_infos`, adding more attributes to ensure a thorough extraction of video data.

In [10]:
complete_infos = {
    'snippet':['title', 'publishedAt', 'channelTitle',
               'description', 'categoryId', 'defaultAudioLanguage', 'thumbnails'],
    'statistics':['viewCount', 'likeCount', 'commentCount'],
    'contentDetails':['duration']
}

With the advanced method, even as we add more information we want to fetch, there's no need to worry about longer code.

#### Loop Through `infos` to Get All Information

This loop iterates over different categories of information stored in the 'complete_infos' dictionary. 

In [11]:
# loop through different categories of information
for key, value in complete_infos.items():
    print(f'key: {key} ||  value: {value}')

key: snippet ||  value: ['title', 'publishedAt', 'channelTitle', 'description', 'categoryId', 'defaultAudioLanguage', 'thumbnails']
key: statistics ||  value: ['viewCount', 'likeCount', 'commentCount']
key: contentDetails ||  value: ['duration']


Each info, represented by a key, corresponds to a list of specific details we want to extract from the videos.

#### Accessing Video Details

Within each category, the code iterates over the videos in the `videos` list. For each video, it initializes an empty dictionary called `video_details` to store its extracted details.

In [12]:
for video in videos:
    # for each video, it initializes an empty dictionary to store its details
    video_details = {}
    
    for key, value in complete_infos.items():
        for info in value:
            video_details[info] = video.get(key, {}).get(info)


Next, it loops through the specific details (`info`) listed under each category (`key`). 

Then, `video.get(key, {}).get(info)`: It accesses these details from the `video` object using the `get()` method. If a detail is not found, it defaults to an empty string.

Those code same aim with:

```python
if key in video and info in video[key]:
    video_details[info] = video[key][info]
else:
    video_details[info] = None
```

#### Saving Extracted Details

```python
videos_list = []

for video in videos:
    # code: accessing video details
    videos_list.append(video_details)
```

After extracting all the details for a video, the `video_details` dictionary containing these details is appended to the `videos_list`. This list will eventually contain dictionaries, each representing the extracted details of a single video.


Let's combine the code to the loop.

In [13]:
# initialize list to save videos data
videos_list = []

for video in videos:
    video_details = {}
    
    for key, value in complete_infos.items():
        for info in value:
            video_details[info] = video.get(key, {}).get(info)

    # saving extracted details
    videos_list.append(video_details)

In [14]:
videos_list

[{'title': 'THE AMAZING DIGITAL CIRCUS - Ep 2: Candy Carrier Chaos!',
  'publishedAt': '2024-05-03T22:00:08Z',
  'channelTitle': 'GLITCH',
  'description': 'The gang are BACK for a WAaAaAaACKY candy filled adventure! They also discover that their lives literally have no meaning. Woohoo! So waaccky!!!!! \n\nALSOOO we just dropped the entire main Digital Circus characters as PLUSHIES!!!: https://digitalcircus.store/ Consider getting one to help support the production of more episodes. No pressure though! Have a great day ❤️',
  'categoryId': '1',
  'defaultAudioLanguage': 'en',
  'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/4ofJpOEXrZs/default.jpg',
    'width': 120,
    'height': 90},
   'medium': {'url': 'https://i.ytimg.com/vi/4ofJpOEXrZs/mqdefault.jpg',
    'width': 320,
    'height': 180},
   'high': {'url': 'https://i.ytimg.com/vi/4ofJpOEXrZs/hqdefault.jpg',
    'width': 480,
    'height': 360},
   'standard': {'url': 'https://i.ytimg.com/vi/4ofJpOEXrZs/sddefault.jpg',

#### Define Additional Details to Extract (`trendingAt` & `videoId`)

Since we'll continually update the data then, let's note the time when the video becomes trending and fetch it (`trendingAt`), and also record its video ID (`videoId`) to facilitate storage in the database later.

In [15]:
# retrieves the current time in UTC format 
time_now = datetime.now(timezone.utc).strftime('%Y-%m-%dT%H:%M:%SZ')
time_now

'2024-05-06T00:50:43Z'

In [16]:
# way to get videoId
videos[0].get("id")

'4ofJpOEXrZs'

On the code before, we only make `video_details = {}`

But because we want to add `videoId` and `trendingAt` to `video_details`, we can directly put it in the initialization like this: 

`video_details = {'videoId': video.get("id"), 'trendingAt': time_now}`

In [17]:
videos_list = []

for video in videos:
    # initialize a dictionary to store video details
    video_details = {'videoId': video.get("id"), 'trendingAt': time_now}
    
    for key, value in complete_infos.items():
        for info in value:
            video_details[info] = video.get(key, {}).get(info)
    
    videos_list.append(video_details)

Finally, we have completed the process. We've transformed raw video data into a structured format ready for analysis.  

Overall, while the basic method may suffice for simple tasks, the advanced method provides a more robust and scalable solution capable of meeting complex requirements and promoting better development practices.

## 5. Write Fetched Videos Data to a JSON file

Writing fetched videos data to a JSON file is crucial as it serves as an interim step in the development process before transitioning to storing the data in a database. This enables efficient data management and facilitates the seamless integration of the extracted information into the database for future development.

We can store the fetched data using `json.dump()`, which takes two parameters: the data to be stored and the file object where the data will be written. 

- `target_file_path` is the path where the JSON file will be created or overwritten.
- `videos_list` is a list containing dictionaries, with each dictionary representing details of a video.

In [18]:
file_path = 'dags/tmp_file.json'

**Note**: Later, use `"/opt/airflow/dags/..."` for docker purpose

In [19]:
with open(file_path, "w") as f:
    json.dump(videos_list, f)

`json.dump()` serializes the `videos_list` into a JSON formatted string and writes it to the file.

> Open the target file in write mode ('w') using a 'with' statement. This ensures that the file is properly closed after writing, even if an exception occurs.

In [20]:
len(videos_list)

50

Using this single request API, we can get maximum 50 trending videos. Congratulations! 

In summary:
1. We sent a request to the YouTube API to fetch videos based on specific criteria.
2. The API responded with the requested videos.
3. We extracted important details like titles, publish dates, and view counts from the response.
4. Then, we processed these details to gain insights or present them to users.
5. Finally, we stored the data into a JSON file.

Through these steps, we effectively interacted with the YouTube API, retrieved relevant video data, and made use of it for our purposes.

**Disadvantages of Single Request**

Actually, there are typically 200 pieces of data for trending YouTube videos at any given time, but due to the maximum limit of 50 items per single request, we cannot retrieve all of them in one fetch. 

So, what should be done? This highlights the inherent limitation of a single request: limited data and the risk of data loss. Hence, the necessity for pagination.

**Solution :**
To solve the limitations of single requests, we can implement pagination by:

1. Sending an initial request to fetch the first batch of videos.
2. Checking if there are more pages of results available.
3. If more pages are available, sending additional requests for each page until all desired data is retrieved.
4. Processing and aggregating the data from each page to obtain a complete dataset for analysis or presentation.

By following this approach, we ensure comprehensive data retrieval while maintaining efficient resource usage and scalability in our application.

## Paginate Through Results Using to Get All Trending Videos

Pagination allows us to retrieve additional batches of videos beyond the initial request.  In this case, we utilize multiple requests to fetch data in batches, ensuring comprehensive coverage of trending videos.

It's a very simple and straightforward process, but by using pagination, we can overcome limitations and ensure we get all trending videos.

```python
next_page_token = ""
while next_page_token is not None:
    # all code before here
    # ...
    next_page_token = response.get("nextPageToken", None)
```

In [21]:
videos_list = []
# initialize the token
next_page_token = ""

while next_page_token is not None:
    request = youtube.videos().list(
        part="snippet,contentDetails,statistics",
        chart="mostPopular",
        regionCode=region_code,
        maxResults=50,
        pageToken=next_page_token # add pageToken to inform the API
    )
    response = request.execute()
    videos = response.get("items", [])

    time_now = datetime.now(timezone.utc).strftime('%Y-%m-%dT%H:%M:%SZ')
    for video in videos:
        video_details = {'videoId': video["id"], 'trendingAt': time_now}
        
        for key, value in complete_infos.items():
            for info in value:
                video_details[info] = video.get(key, {}).get(info)
                    
        videos_list.append(video_details)

    # store the next page to fetch (token)
    next_page_token = response.get("nextPageToken", None)

with open(file_path, "w") as f:
    json.dump(videos_list, f)

In [22]:
len(videos_list)

200

Yay! By utilizing pagination, we've finally obtained all the trending videos! It's simple, isn't it?

#### Conclusion
By following these steps, we can progressively learn how to fetch videos from YouTube's API, starting from making a single request and advancing to paginating through the results. This approach allows beginners to grasp the fundamentals of API interaction and data retrieval while gradually building their skills.

# Data Preprocessing Youtube Data

Preprocess our fetched data is essential for preparing raw data fetched from YouTube for analysis or usage by restructuring it into a more organized format, enriching it with additional information, cleaning inconsistencies, and normalizing its structure. 

> We need to enhances the data's readability and usability while ensuring consistency and reliability. It also improves the efficiency of downstream applications or analyses that rely on the processed data. 

Let's take a look to our sample raw data!

In [23]:
# load the fetched videos data and categories dictionary from json files
with open(file_path, 'r') as f:
    videos_list = json.load(f)

In [24]:
example_video = videos_list[0]
example_video

{'videoId': '4ofJpOEXrZs',
 'trendingAt': '2024-05-06T00:50:43Z',
 'title': 'THE AMAZING DIGITAL CIRCUS - Ep 2: Candy Carrier Chaos!',
 'publishedAt': '2024-05-03T22:00:08Z',
 'channelTitle': 'GLITCH',
 'description': 'The gang are BACK for a WAaAaAaACKY candy filled adventure! They also discover that their lives literally have no meaning. Woohoo! So waaccky!!!!! \n\nALSOOO we just dropped the entire main Digital Circus characters as PLUSHIES!!!: https://digitalcircus.store/ Consider getting one to help support the production of more episodes. No pressure though! Have a great day ❤️',
 'categoryId': '1',
 'defaultAudioLanguage': 'en',
 'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/4ofJpOEXrZs/default.jpg',
   'width': 120,
   'height': 90},
  'medium': {'url': 'https://i.ytimg.com/vi/4ofJpOEXrZs/mqdefault.jpg',
   'width': 320,
   'height': 180},
  'high': {'url': 'https://i.ytimg.com/vi/4ofJpOEXrZs/hqdefault.jpg',
   'width': 480,
   'height': 360},
  'standard': {'url': '

We can do it by converting duration to seconds, mapping category identifiers to their names, parsing thumbnail URLs, and converting string representations of numerical values to integers. 

## Converting Duration to Seconds

We need to convert ISO 8601 duration to seconds after fetching YouTube data because ISO 8601 duration represents the duration of a video in a standardized format, including hours, minutes, and seconds.

In [25]:
example_video.get('duration', '')

'PT25M14S'

We can achieve the conversion by using the `isodate.parse_duration()` function. This function parses the duration string into a duration object, from which we can extract the total duration in seconds using the `.total_seconds()` method.

In [26]:
# convert ISO 8601 duration to seconds
isodate.parse_duration(example_video.get('duration', '')).total_seconds()

1514.0

> **Insight:** The example video's duration is 711 seconds.

 By converting it to seconds, we can easily perform calculations, comparisons, and manipulations on video durations, allowing for more efficient data processing and analysis.

## Convert `categoryId` to Category based on Categories Dictionary

YouTube videos data often includes a `categoryId` attribute representing the category of the video. 

However, these IDs are not intuitive for users to understand. Therefore, by converting `categoryId` to a more descriptive category based on a predefined dictionary, we can provide users with clearer and more meaningful information about the content of the videos. This enhances user experience and makes the data more accessible and understandable.

**Note**: Later, use `"/opt/airflow/dags/..."` for docker purpose

In [27]:
with open('dags/categories.json', 'r') as f:
    categories = json.load(f)

categories

{'1': 'Film & Animation',
 '2': 'Autos & Vehicles',
 '10': 'Music',
 '15': 'Pets & Animals',
 '17': 'Sports',
 '18': 'Short Movies',
 '19': 'Travel & Events',
 '20': 'Gaming',
 '21': 'Videoblogging',
 '22': 'People & Blogs',
 '23': 'Comedy',
 '24': 'Entertainment',
 '25': 'News & Politics',
 '26': 'Howto & Style',
 '27': 'Education',
 '28': 'Science & Technology',
 '30': 'Movies',
 '31': 'Anime/Animation',
 '32': 'Action/Adventure',
 '33': 'Classics',
 '34': 'Comedy',
 '35': 'Documentary',
 '36': 'Drama',
 '37': 'Family',
 '38': 'Foreign',
 '39': 'Horror',
 '40': 'Sci-Fi/Fantasy',
 '41': 'Thriller',
 '42': 'Shorts',
 '43': 'Shows',
 '44': 'Trailers'}

In [28]:
example_video.get('categoryId', '')

'1'

In [29]:
complete_infos

{'snippet': ['title',
  'publishedAt',
  'channelTitle',
  'description',
  'categoryId',
  'defaultAudioLanguage',
  'thumbnails'],
 'statistics': ['viewCount', 'likeCount', 'commentCount'],
 'contentDetails': ['duration']}

## convert categoryId to category based on categories dictionary

In [30]:
example_video.get('categoryId', '')

'1'

We can achieve the conversion by utilizing a predefined dictionary named `categories`. 

By accessing this dictionary with the `categoryId` retrieved from each video, we can obtain the corresponding category. 

In [31]:
# convert categoryId to category
categories.get(example_video.get('categoryId', ''))

'Film & Animation'

This process ensures that each video's category is represented in a more user-friendly and understandable format.

## Parse the Thumbnail `url`

Parsing the thumbnail URL is important to extract a single URL that represents the thumbnail image of the video in a standardized format. 

Before parsing, the thumbnail data is structured into a dictionary with different sizes of thumbnails (default, medium, high, standard, maxres) each containing a URL along with its corresponding width and height.

In [32]:
example_video['thumbnails']

{'default': {'url': 'https://i.ytimg.com/vi/4ofJpOEXrZs/default.jpg',
  'width': 120,
  'height': 90},
 'medium': {'url': 'https://i.ytimg.com/vi/4ofJpOEXrZs/mqdefault.jpg',
  'width': 320,
  'height': 180},
 'high': {'url': 'https://i.ytimg.com/vi/4ofJpOEXrZs/hqdefault.jpg',
  'width': 480,
  'height': 360},
 'standard': {'url': 'https://i.ytimg.com/vi/4ofJpOEXrZs/sddefault.jpg',
  'width': 640,
  'height': 480},
 'maxres': {'url': 'https://i.ytimg.com/vi/4ofJpOEXrZs/maxresdefault.jpg',
  'width': 1280,
  'height': 720}}

It is enough for us, to retrieve the URL of the standard thumbnail image only.

In [33]:
# parse the thumbnail url
example_video.get('thumbnails', {}).get('standard', {}).get('url')

'https://i.ytimg.com/vi/4ofJpOEXrZs/sddefault.jpg'

 After parsing, we extract the URL of the standard thumbnail ('sddefault') which typically represents a higher quality image suitable for displaying in various contexts. 
 
 This simplified URL format is more convenient for storage, sharing, and usage in applications compared to the original structured thumbnail data.

## Convert Statistical Columns to Integer from String

Converting 'viewCount', 'likeCount', and 'commentCount' from string to integer is essential for proper data handling and analysis. By converting these values to integers, we can perform mathematical operations, comparisons, and statistical analysis accurately. 

In [34]:
example_video['viewCount']

'37702723'

We accomplish the conversion by utilizing the int() function, which converts the string representation of 'viewCount' (and similarly 'likeCount' and 'commentCount') to an integer format. 

This ensures that these values are represented as numerical data, enabling us to perform mathematical operations and analysis accurately. 

In [35]:
int(example_video.get('viewCount', 0))

37702723

> **Insight:** Total views from example video is 1,809,813 views

Additionally, providing a default value of 0 ensures graceful handling in case the 'viewCount' attribute is missing or contains invalid data.

This conversion allows us to effectively analyze the popularity, engagement, and interactions of YouTube videos based on their view counts, like counts, and comment counts.

## Preprocess Each Video in the List

Combining all the preprocessing functions into a single loop allows for efficient processing of each video in the video list. Instead of manually applying each function to every video, looping through the list enables automated application of all necessary preprocessing steps. 

In [36]:
for video in videos_list:
    # convert categoryId to category based on categories dictionary
    video['category'] = categories.get(video.get('categoryId', ''))
    del video['categoryId']

    # convert ISO 8601 duration to seconds
    video['duration'] = isodate.parse_duration(video.get('duration', '')).total_seconds()
    del video['duration']

    # parse the thumbnail url
    video['thumbnailUrl'] = video.get('thumbnails', {}).get('standard', {}).get('url')
    del video['thumbnails']
    
    # convert viewCount, likeCount, and commentCount to integer
    video['viewCount'] = int(video.get('viewCount', 0))
    video['likeCount'] = int(video.get('likeCount', 0))
    video['commentCount'] = int(video.get('commentCount', 0))

This approach enhances processing speed and scalability, especially when dealing with a large number of videos, such as the 200 videos typically present in trending YouTube data. By streamlining the preprocessing process, it ensures consistency, accuracy, and effectiveness in preparing the data for further analysis or usage.

## Store Postprocessed Data into a JSON File

After preprocessing, the next step is to store the postprocessed data into a JSON format. This allows for easy storage, sharing, and retrieval of the processed data, ensuring that it remains accessible for future analysis or usage.

In [37]:
processed_file_path = 'dags/tmp_file_processed.json'

**Note**: Later, use `"/opt/airflow/dags/..."` for docker purpose

In [38]:
# save the processed videos data to a new file
with open(processed_file_path, "w") as f:
    json.dump(videos_list, f)

# Make Custom Function for All Processes

## Make Fetching Data Function 

Create a single function that combines all the fetching steps for YouTube video data into one streamlined process.

In [39]:
def fetch_trending_videos(region_code: str, file_path: str):
    """Fetches trending videos from YouTube for a specific region.

    Args:
        region_code: A string representing the ISO 3166-1 alpha-2 country code for the desired region.
        target_file_path: A string representing the path to the file to be written.
    """
    
    # load API key from .env file
    load_dotenv("/opt/airflow/dags/.env")
    api_key = os.environ.get("YOUTUBE_API_KEY")

    # create YouTube API client
    youtube = build("youtube", "v3", developerKey=api_key)

    complete_infos = {
        'snippet':['title', 'publishedAt', 'channelTitle',
                'description', 'categoryId', 'defaultAudioLanguage', 'thumbnails'],
        'statistics':['viewCount', 'likeCount', 'commentCount'],
        'contentDetails':['duration']
    }

    # fetch videos until max_results is reached or there are no more results
    videos_list = []
    # initialize the token
    next_page_token = ""

    while next_page_token is not None:
        request = youtube.videos().list(
            part="snippet,contentDetails,statistics",
            chart="mostPopular",
            regionCode=region_code,
            maxResults=50,
            pageToken=next_page_token # add pageToken to inform the API
        )
        response = request.execute()
        videos = response.get("items", [])

        time_now = datetime.now(timezone.utc).strftime('%Y-%m-%dT%H:%M:%SZ')
        for video in videos:
            video_details = {'videoId': video["id"], 'trendingAt': time_now}
            
            for key, value in complete_infos.items():
                for info in value:
                    video_details[info] = video.get(key, {}).get(info)
                        
            videos_list.append(video_details)

        # store the next page to fetch (token)
        next_page_token = response.get("nextPageToken", None)

    with open(file_path, "w") as f:
        json.dump(videos_list, f)
    
    print('Done fetched the data.')

In [40]:
fetch_trending_videos(region_code='ID', file_path='dags/tmp_file.json')

Done fetched the data.


**Note**: Later, use `"/opt/airflow/dags/..."` for docker purpose

This process can be named as `fetch_trending_videos` as it encapsulates the functionality of fetching trending videos from YouTube for a specific region and storing them in a JSON file

## Make Data Preprocessing Function 

Create a single function that combines all the preprocessing steps for YouTube video data into one streamlined process.

In [41]:
def data_processing(source_file_path: str, target_file_path: str):
    """Processes the raw data fetched from YouTube.
    
    Args:
        source_file_path: A string representing the path to the file to be processed.
        target_file_path: A string representing the path to the file to be written.
    """
    # Load the fetched videos data from the json file
    with open(source_file_path, 'r') as f:
        videos_list = json.load(f)
    
    # Load the categories dictionary from the json file
    with open('dags/categories.json', 'r') as f:
        categories = json.load(f)
    
    # process each video in the list
    for video in videos_list:
        # convert categoryId to category based on categories dictionary
        video['category'] = categories.get(video.get('categoryId', ''))
        del video['categoryId']

        # convert ISO 8601 duration to seconds
        video['duration'] = isodate.parse_duration(video.get('duration', '')).total_seconds()
        del video['duration']

        # parse the thumbnail url
        video['thumbnailUrl'] = video.get('thumbnails', {}).get('standard', {}).get('url')
        del video['thumbnails']
        
        # convert viewCount, likeCount, and commentCount to integer
        video['viewCount'] = int(video.get('viewCount', 0))
        video['likeCount'] = int(video.get('likeCount', 0))
        video['commentCount'] = int(video.get('commentCount', 0))

    # save the processed videos data to a new file
    with open(target_file_path, "w") as f:
        json.dump(videos_list, f)

    print('Done preprocessed data.')

In [42]:
data_processing('dags/tmp_file.json', 'dags/tmp_file_processed.json')

Done preprocessed data.


**Note**: Later, use `"/opt/airflow/dags/..."` for docker purpose

The process described here can be named "data_processing". It involves several steps, including converting categoryId to category based on a predefined dictionary, converting ISO 8601 duration to seconds, parsing the thumbnail URL, and converting viewCount, likeCount, and commentCount to integers. 

## Why is Important to Make All Processes a Function

It's important to make this process a function for several reasons:
1. **Reusability**: By encapsulating these preprocessing steps into a function, we can easily reuse it across different parts of the codebase or in other projects. This promotes code reuse and reduces redundancy.
  
2. **Modularity**: The function creates a clear separation of concerns by encapsulating the preprocessing logic. This makes the code easier to understand, maintain, and debug.

3. **Abstraction**: The function abstracts away the implementation details of the preprocessing steps, allowing users to focus on the high-level logic without getting bogged down in the specifics.

4. **Testability**: By making the preprocessing logic a function, it becomes easier to write unit tests to ensure the correctness of the processing steps.

Relating it to the use case of scheduling with Apache Airflow, making this preprocessing logic a function enables us to easily integrate it into an Airflow DAG (Directed Acyclic Graph). We can create an Airflow operator that calls this function to preprocess the data before performing further tasks in the pipeline. This promotes automation, scalability, and maintainability of the data processing workflow within Airflow.

# Conclusion

In summary, we've successfully completed the fetch and preprocess data steps, enabling us to retrieve trending YouTube videos for a specific region, process the data to extract relevant information, and store it in a JSON format. By encapsulating these processes within functions and ensuring modularity and reusability, we've established a robust foundation for automating our data pipeline.

Moving forward, our next steps involve integrating these functions into a Directed Acyclic Graph (DAG) using Apache Airflow. This will allow us to automate the entire process, scheduling it to run at regular intervals to ensure we always have up-to-date data. Additionally, we can implement error handling, logging, and monitoring within Airflow to ensure the reliability and resilience of our data pipeline. 

Overall, this integration with Airflow will enhance the scalability, efficiency, and maintainability of our data acquisition workflow, enabling us to derive valuable insights from trending YouTube data with ease.