### Background: Importance of Data Fetching and Introduction to APIs

#### Importance of Data Fetching:
In today's data-driven world, access to timely and relevant data is crucial for businesses, researchers, and individuals alike. Data fetching refers to the process of retrieving information from various sources, enabling analysis, insights, and informed decision-making. The significance of data fetching lies in:

1. **Informed Decision-Making**: Data fetching provides access to real-time and historical data, empowering organizations to make informed decisions based on accurate information and trends.

2. **Business Intelligence**: Fetching data from diverse sources enables businesses to gain insights into market trends, customer behavior, and competitor activities, aiding in strategic planning and performance evaluation.

3. **Research and Innovation**: Researchers rely on data fetching to gather empirical evidence, conduct analyses, and validate hypotheses across various domains, fostering innovation and scientific discovery.

4. **Personalization**: Data fetching facilitates the delivery of personalized experiences in sectors such as e-commerce, entertainment, and healthcare, enhancing customer satisfaction and engagement.

#### Methods of Data Fetching:
There are several methods to fetch data from different sources, each suited to specific requirements and use cases:

1. **Web Scraping**: This involves extracting data directly from web pages using automated scripts or tools. While effective for accessing publicly available information, web scraping may face challenges related to website structure and legality.

2. **File Transfer**: Data can be fetched by transferring files from one system to another using protocols like FTP (File Transfer Protocol) or SFTP (Secure File Transfer Protocol). This method is commonly used for batch data transfers.

3. **Database Queries**: Fetching data from databases involves executing queries using SQL (Structured Query Language) or NoSQL (Not Only SQL) languages to retrieve specific information from structured datasets stored in relational or non-relational databases.

4. **API (Application Programming Interface)**: APIs provide a standardized way for different software applications to communicate and exchange data. They offer a structured interface for accessing specific functionalities and datasets, making them a powerful and versatile method for data fetching.

#### Introduction to API:
An API, or Application Programming Interface, is a set of rules, protocols, and tools that allows different software applications to communicate with each other. APIs define the methods and data formats that applications can use to request and exchange information.

**Key Characteristics of APIs:**
- **Standardization**: APIs provide a standardized way for applications to interact, ensuring compatibility and interoperability across different systems.
- **Abstraction**: APIs abstract the underlying complexities of systems, allowing developers to access functionality or data without needing to understand the internal workings.
- **Security**: APIs often incorporate authentication and authorization mechanisms to control access to sensitive data and resources.
- **Scalability**: APIs are designed to handle large volumes of requests efficiently, enabling scalability and performance optimization.

#### Example: YouTube API:
The YouTube API is a specific API provided by Google that allows developers to access and interact with YouTube's features and data programmatically. With the YouTube API, developers can perform tasks such as retrieving video information, uploading videos, managing playlists, and accessing analytics data.

In the context of our project, we will focus on learning how to fetch data from YouTube using its API. By leveraging the YouTube API, we can programmatically retrieve trending videos, access metadata such as titles, descriptions, and view counts, and incorporate this data into our analytics pipelines or applications. Understanding how to interact with the YouTube API opens up opportunities for data-driven insights and innovation in various domains, including marketing, content creation, and audience engagement.

### Step-by-Step Guide: Obtaining a YouTube API Key and Creating a YouTube API Client

#### 1. Sign in to Google Developer Console:
- Go to the [Google Developer Console](https://console.developers.google.com/).
- Sign in with your Google account or create one if you don't have it already.

#### 2. Create a New Project:
- Once signed in, create a new project by clicking on the "Select a project" dropdown menu at the top of the page and then clicking on the "+ New Project" button.
- Enter a name for your project and click "Create".

#### 3. Enable the YouTube Data API v3:
- In the Google Developer Console, navigate to the "Library" section from the left sidebar.
- Search for "YouTube Data API v3" and click on it.
- Click the "Enable" button to enable the API for your project.

#### 4. Create API Key:
- After enabling the YouTube Data API v3, navigate to the "Credentials" section from the left sidebar.
- Click on the "Create credentials" dropdown menu and select "API key".
- A dialog will appear displaying your API key. Copy this API key and store it securely.

#### 5. Secure Your API Key:
- It's essential to keep your API key secure to prevent unauthorized access and usage. Consider restricting the API key to only allow requests from specific websites or applications.

#### 6. Integrate API Key into Your Application:
- In your Python script, import the necessary libraries (`dotenv` for managing environment variables, `os` for interacting with the operating system).
- Use the `load_dotenv` function to load environment variables from a `.env` file. Ensure that you have the `python-dotenv` library installed (`pip install python-dotenv`).
- Set your YouTube API key as an environment variable in the `.env` file. For example:
  ```
  YOUTUBE_API_KEY=YOUR_API_KEY_HERE
  ```

In [24]:
# import necessary libraries
import os
from dotenv import load_dotenv

In [25]:
# load API key from .env file
load_dotenv("dags/.env") # path only for trial
api_key = os.environ.get("YOUTUBE_API_KEY")

**Note**: Later, use "/opt/airflow/dags/.env" for docker purpose


#### 7. Access the API in Your Script:
- Within your Python script, import the `dotenv` and `os` libraries to access the API key.
- Use the `os.environ.get()` function to retrieve the API key from the environment variables.
- With the API key obtained, you can now create a YouTube API client using the `build` function from the `googleapiclient.discovery` module.

In [26]:
# create YouTube API client
from googleapiclient.discovery import build
youtube = build("youtube", "v3", developerKey=api_key)

By following these steps, you can obtain a YouTube API key and create a YouTube API client in your Python script. This allows you to programmatically interact with the YouTube Data API, fetching data such as trending videos, video metadata, and analytics information for further processing or analysis in your applications or workflows. Remember to keep your API key secure and adhere to Google's usage policies to avoid any potential issues or restrictions.

### Step-by-Step Guide: Fetching Videos Using YouTube API

#### 1. Make a Single API Request for Videos
Let's begin by taking our first step into accessing YouTube's vast video data through its API. We'll make a single request tailored to our criteria, fetching videos based on specific parameters. 

This initial action lays the foundation for our journey, offering a fundamental understanding of how to engage with the API and retrieve essential data. 

In [27]:
region_code='ID'

In [28]:
# Make API request for videos
request = youtube.videos().list(

    # define the parts of the video data to retrieve: snippet (basic details), contentDetails (video content), statistics (view counts, likes, comments).
    part="snippet,contentDetails,statistics",  

    # specify the type of chart to fetch videos from: "mostPopular" retrieves the most popular videos.
    chart="mostPopular",  

    # specify the region for which to fetch videos based on ISO 3166-1 alpha-2 country code.
    regionCode=region_code,  

    # define the maximum number of videos to retrieve in a single request.
    maxResults=50  
)

Certainly, let's break down the purpose of each parameter in the API request for a clearer understanding:

- `part`: This parameter specifies which parts of the video resource to include in the API response. In this case, we're requesting the `snippet` (basic details like title, description), `contentDetails` (information about the video content like duration), and `statistics` (metrics like view counts, likes, comments) parts.

- `chart`: The `chart` parameter determines the type of chart to retrieve videos from.

- `regionCode`: This parameter specifies the region for which to fetch videos. It uses the ISO 3166-1 alpha-2 country code format to identify the desired region.

- `maxResults`: The `maxResults` parameter sets the maximum number of videos to retrieve in a single API request. In this case, we're fetching up to 50 videos at a time, which is the maximum allowed by the YouTube API.

By providing these parameters in our API request, we define the scope and criteria for fetching videos from YouTube, tailoring the data retrieval process to our specific needs and objectives.

#### 2. Execute a Single API Requests for Videos

After that, let's execute the requests using `execute()` function. This function is like pressing the "send" button on an email. It's what actually sends our request to YouTube's servers, asking for the videos we want. Once we hit "send" (or call `execute()`), YouTube gets our request, processes it, and sends back the information we asked for in a response. This response contains the videos we requested, along with their details like titles, views, and more.

In [29]:
response = request.execute()
response

{'kind': 'youtube#videoListResponse',
 'etag': 'BWRLo_y0AaU9tf6fM1OuDBAR0oo',
 'items': [{'kind': 'youtube#video',
   'etag': 'l__jdILeDTBnmfNj4bvJM47HsHE',
   'id': '2bopnQpPwgI',
   'snippet': {'publishedAt': '2024-03-17T18:49:27Z',
    'channelId': 'UCViN2fPWI6zuZDXwn0O54xg',
    'title': 'Manchester United 2-2 (4-3 aet) Liverpool | FA Cup 23/24 Match Highlights',
    'description': 'Tonton liga-liga terbaik hanya di beIN SPORTS CONNECT: https://bein.onelink.me/bApY/beinYT \n\nSimak berita dan update menarik seputar liga terbaik dan beIN Sports Indonesia di semua sosial media:\n\nFacebook: https://www.facebook.com/beinsportsindonesia/\nTwitter: https://twitter.com/beinsportsid\nInstagram: https://www.instagram.com/beinsports.id/\nYouTube: https://www.youtube.com/c/beINSPORTSID',
    'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/2bopnQpPwgI/default.jpg',
      'width': 120,
      'height': 90},
     'medium': {'url': 'https://i.ytimg.com/vi/2bopnQpPwgI/mqdefault.jpg',
   

#### 3. Extract Videos from API Response

Once we've sent our request, the next step is to get the videos from the response sent back by the API. This allows us to access all the important information about each video returned by the API, such as titles, descriptions, and view counts. This step is essential because it's how we actually get the data we're interested in from the API's response.

In [30]:
videos = response.get("items", [])

#### 4. Processing the Extracted Videos

Now that we've obtained the videos from the API response, we can proceed to process them according to our requirements. This processing step involves extracting relevant information from each video or presenting them to the user in a suitable format.

In [31]:
for video in videos[:5]:
    title = video["snippet"]["title"]  # Extract the title of the video
    published_at = video["snippet"]["publishedAt"]  # Extract the publish date of the video
    view_count = video["statistics"]["viewCount"]  # Extract the view count of the video
    print(f"Title: {title}, Published At: {published_at}, Views: {view_count}")

Title: Manchester United 2-2 (4-3 aet) Liverpool | FA Cup 23/24 Match Highlights, Published At: 2024-03-17T18:49:27Z, Views: 1330742
Title: HAPPY ASMARA Feat. GILGA SAHID - MANOT | Feat. BINTANG FORTUNA (Official Music Video), Published At: 2024-03-17T03:00:07Z, Views: 1565368
Title: STUDY TOUR #25 - Serena Hilang, Published At: 2024-03-17T09:05:15Z, Views: 346045
Title: Kiky Saputri dan Suami Buka Suara Terkait Kehamilannya, Published At: 2024-03-17T16:35:08Z, Views: 255976
Title: NGABUBURIT 2, Published At: 2024-03-17T09:19:49Z, Views: 882143


In this code snippet, we iterate over each video in the `videos` list and extract specific details such as the title, publish date, and view count. These details can then be used for various purposes, such as analysis, reporting, or presentation to users. 

Processing the videos in this manner allows us to work with the data in a meaningful way, enabling us to derive insights and take actions based on the information retrieved from the API.

Now, let's advance to extract more detailed information and append to a list (advanced approach)

In [32]:
# define the relevant information to extract
infos = {
    'snippet': ['title', 'publishedAt', 'channelTitle'],
    'statistics': ['viewCount']
}

In [33]:
from datetime import datetime, timedelta, timezone

In [34]:
# define additional details to extract (advanced approach)
now = datetime.now(timezone.utc).strftime('%Y-%m-%dT%H:%M:%SZ')
now

'2024-03-18T09:02:57Z'

In [35]:
# loop through different categories of information
for key, value in infos.items():
    print(f'key: {key} ||  value: {value}')

key: snippet ||  value: ['title', 'publishedAt', 'channelTitle']
key: statistics ||  value: ['viewCount']


Let's enhance our `infos` become involved more information (`complete_infos`)

In [36]:
complete_infos = {
    'snippet':['title', 'publishedAt', 'channelTitle',
               'description', 'categoryId', 'defaultAudioLanguage', 'thumbnails'],
    'statistics':['viewCount', 'likeCount', 'commentCount'],
    'contentDetails':['duration']
}

In [37]:
# initialize a list to store extracted video details
videos_list = []

# iterate through all videos
for video in videos:
    # initialize a dictionary to store video details
    video_details = {'videoId': video.get("id"), 'trendingAt': now}
    
    for key, value in complete_infos.items():
        # loop through specific details in each category
        for info in value:
            # extract information from the video and store it in the details dictionary
            # use .get() method to handle missing information and avoid KeyError
            video_details[info] = video.get(key, {}).get(info)
    
    # append the extracted details to the list
    videos_list.append(video_details)

**Additional**

```python
video_details[info] = video.get(key, {}).get(info)
```

same with:

```python
if key in video and info in video[key]:
    video_details[info] = video[key][info]
else:
    video_details[info] = None
```

##### Write Fetched Videos Data to a JSON file

- `target_file_path` is the path where the JSON file will be created or overwritten.
- `videos_list` is a list containing dictionaries, with each dictionary representing details of a video.

> Open the target file in write mode ('w') using a 'with' statement. This ensures that the file is properly closed after writing, even if an exception occurs.

In [38]:
# file_path = '/opt/airflow/dags/tmp_file.json'
file_path = 'dags/tmp_file.json'

In [39]:
import json

with open(file_path, "w") as f:
    json.dump(videos_list, f)

In [40]:
len(videos_list)

50

Use the `json.dump()` function to write the contents of `videos_list` to the file `f`. `json.dump()` serializes the `videos_list` into a JSON formatted string and writes it to the file.

In summary:
1. We sent a request to the YouTube API to fetch videos based on specific criteria.
2. The API responded with the requested videos.
3. We extracted important details like titles, publish dates, and view counts from the response.
4. Then, we processed these details to gain insights or present them to users.

Through these steps, we effectively interacted with the YouTube API, retrieved relevant video data, and made use of it for our purposes.

**Disadvantages of Single Request**:

When relying solely on a single request to fetch data from the YouTube API, there are a few drawbacks to consider:

1. **Limited Data**: With just one request, we can only retrieve a fixed number of videos. This often isn't enough for thorough analysis or meeting application needs, leaving out potentially important data.

2. **Risk of Data Loss**: If the number of videos exceeds what can be fetched in a single request, we may lose valuable data. Without pagination, there's no way to access additional videos beyond what's returned initially.

3. **Increased Latency**: Fetching a large amount of data in one go can lead to delays, slowing down our application's responsiveness. Users might experience longer wait times, impacting their experience.


#### Solution:
To solve the limitations of single requests, we can implement pagination by:

1. Sending an initial request to fetch the first batch of videos.
2. Checking if there are more pages of results available.
3. If more pages are available, sending additional requests for each page until all desired data is retrieved.
4. Processing and aggregating the data from each page to obtain a complete dataset for analysis or presentation.

By following this approach, we ensure comprehensive data retrieval while maintaining efficient resource usage and scalability in our application.

#### [Advanced] Paginate Through Results Using a While Loop:
- **What**: To fetch more than the initial batch of videos, we use a while loop to paginate through the results until we have enough data.
- **Why**: Pagination allows us to retrieve additional batches of videos beyond the initial request.

```python
next_page_token = ""
while next_page_token is not None:
    # your code here
    # ...
    next_page_token = response.get("nextPageToken", None)
```

In [41]:
# fetch videos until max_results is reached or there are no more results
videos_list = []
next_page_token = ""

while next_page_token is not None:

    request = youtube.videos().list(
        part="snippet,contentDetails,statistics",
        chart="mostPopular",
        regionCode=region_code,
        maxResults=50,
        pageToken=next_page_token
    )
    response = request.execute()
    videos = response.get("items", [])

    next_page_token = response.get("nextPageToken", None)

    now = datetime.now(timezone.utc).strftime('%Y-%m-%dT%H:%M:%SZ')
    for video in videos:
        video_details = {
            'videoId': video["id"],
            'trendingAt': now
        }
        
        for key, value in complete_infos.items():
            for info in value:
                video_details[info] = video.get(key, {}).get(info)
                    
        videos_list.append(video_details)

with open(file_path, "w") as f:
    json.dump(videos_list, f)

In [42]:
len(videos_list)

200

#### Conclusion
By following these steps, we can progressively learn how to fetch videos from YouTube's API, starting from making a single request and advancing to paginating through the results. This approach allows beginners to grasp the fundamentals of API interaction and data retrieval while gradually building their skills.

# Data Preprocessing

In [43]:
 # Load the fetched videos data and categories dictionary from json files
with open(file_path, 'r') as f:
    videos_list = json.load(f)

with open('dags/categories.json', 'r') as f:
    categories = json.load(f)

In [44]:
import isodate

In [45]:
complete_infos

{'snippet': ['title',
  'publishedAt',
  'channelTitle',
  'description',
  'categoryId',
  'defaultAudioLanguage',
  'thumbnails'],
 'statistics': ['viewCount', 'likeCount', 'commentCount'],
 'contentDetails': ['duration']}

In [51]:
example_video = videos_list[0]
example_video

{'videoId': '2bopnQpPwgI',
 'trendingAt': '2024-03-18T09:02:57Z',
 'title': 'Manchester United 2-2 (4-3 aet) Liverpool | FA Cup 23/24 Match Highlights',
 'publishedAt': '2024-03-17T18:49:27Z',
 'channelTitle': 'beIN SPORTS Indonesia',
 'description': 'Tonton liga-liga terbaik hanya di beIN SPORTS CONNECT: https://bein.onelink.me/bApY/beinYT \n\nSimak berita dan update menarik seputar liga terbaik dan beIN Sports Indonesia di semua sosial media:\n\nFacebook: https://www.facebook.com/beinsportsindonesia/\nTwitter: https://twitter.com/beinsportsid\nInstagram: https://www.instagram.com/beinsports.id/\nYouTube: https://www.youtube.com/c/beINSPORTSID',
 'categoryId': '17',
 'defaultAudioLanguage': None,
 'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/2bopnQpPwgI/default.jpg',
   'width': 120,
   'height': 90},
  'medium': {'url': 'https://i.ytimg.com/vi/2bopnQpPwgI/mqdefault.jpg',
   'width': 320,
   'height': 180},
  'high': {'url': 'https://i.ytimg.com/vi/2bopnQpPwgI/hqdefault.j

## convert categoryId to category based on categories dictionary

In [59]:
example_video.get('categoryId', '')

'17'

In [54]:
# convert categoryId to category based on categories dictionary
categories.get(example_video.get('categoryId', ''))

'Sports'

## convert ISO 8601 duration to seconds

In [61]:
example_video.get('duration', '')

'PT11M51S'

In [60]:
# convert ISO 8601 duration to seconds
isodate.parse_duration(example_video.get('duration', '')).total_seconds()

711.0

## Parse the thumbnail url

In [None]:
example_video['thumbnails']

In [64]:
# Parse the thumbnail url
example_video.get('thumbnails', {}).get('standard', {}).get('url')

'https://i.ytimg.com/vi/2bopnQpPwgI/sddefault.jpg'

## Convert viewCount, likeCount, and commentCount to integer

In [67]:
example_video['viewCount']

'1330742'

In [66]:
int(example_video.get('viewCount', 0))

1330742

## Preprocess each video in the list

In [68]:
# process each video in the list
for video in videos_list:
    # convert categoryId to category based on categories dictionary
    video['category'] = categories.get(video.get('categoryId', ''))
    del video['categoryId']

    # convert ISO 8601 duration to seconds
    video['duration'] = isodate.parse_duration(video.get('duration', '')).total_seconds()
    del video['duration']

    # parse the thumbnail url
    video['thumbnailUrl'] = video.get('thumbnails', {}).get('standard', {}).get('url')
    del video['thumbnails']
    
    # convert viewCount, likeCount, and commentCount to integer
    video['viewCount'] = int(video.get('viewCount', 0))
    video['likeCount'] = int(video.get('likeCount', 0))
    video['commentCount'] = int(video.get('commentCount', 0))

In [70]:
# processed_file_path = '/opt/airflow/dags/tmp_file_processed.json'

processed_file_path = 'dags/tmp_file_processed.json'

In [71]:
# Save the processed videos data to a new file
with open(processed_file_path, "w") as f:
    json.dump(videos_list, f)

In [75]:
def data_processing(source_file_path: str, target_file_path: str):
    """Processes the raw data fetched from YouTube.
    
    Args:
        source_file_path: A string representing the path to the file to be processed.
        target_file_path: A string representing the path to the file to be written.
    """
    # Load the fetched videos data from the json file
    with open(source_file_path, 'r') as f:
        videos_list = json.load(f)
    
    # Load the categories dictionary from the json file
    with open('dags/categories.json', 'r') as f:
        categories = json.load(f)
    
    # process each video in the list
    for video in videos_list:
        # convert categoryId to category based on categories dictionary
        video['category'] = categories.get(video.get('categoryId', ''))
        del video['categoryId']

        # convert ISO 8601 duration to seconds
        video['duration'] = isodate.parse_duration(video.get('duration', '')).total_seconds()
        del video['duration']

        # parse the thumbnail url
        video['thumbnailUrl'] = video.get('thumbnails', {}).get('standard', {}).get('url')
        del video['thumbnails']
        
        # convert viewCount, likeCount, and commentCount to integer
        video['viewCount'] = int(video.get('viewCount', 0))
        video['likeCount'] = int(video.get('likeCount', 0))
        video['commentCount'] = int(video.get('commentCount', 0))

    # save the processed videos data to a new file
    with open(target_file_path, "w") as f:
        json.dump(videos_list, f)

    print('Done preprocessed data.')

In [76]:
data_processing(file_path, processed_file_path)

Done preprocessed data.
