# Fetch Raw Data From Youtube

In this notebook, we'll dive into YouTube data to construct a NLP dataset from a YouTube channel. Our focus will be on a compelling episode of the Lex Friedman podcast featuring [astrobiologist Betül Kaçar](https://www.youtube.com/watch?v=NXU_M4030nE&ab_channel=LexFridman). 

Luckily for us, YouTube is full of freely-contributed user-generated text content. To achieve this goal on our target channel, we'll query YouTube Data API.

In [2]:
# Helper libraries
import warnings

# Scientific and visual libraries
import pandas as pd
import googleapiclient.errors

%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'retina'

# Various settings
warnings.filterwarnings("ignore")
pd.set_option("display.max_rows", 120)
pd.set_option("display.max_colwidth", 40)
pd.set_option("display.precision", 4)
pd.set_option("display.max_columns", None)

First to kick off our analysis, we need to gather the raw comments from the chosen YouTube video. Leveraging the power of the YouTube API, our ETL process will efficiently collect valuable data for building a Pandas DataFrame.

The next Python code below exemplifies the extraction function. Utilizing the googleapiclient library, the function seamlessly interacts with the YouTube API, allowing us to capture essential information such as author details, publication and update timestamps, like counts, and the textual content of each comment.

The `fetch_youtube_comments` function requires your Youtube personal API key for authentication. It is implemented in the `data` module.

In [3]:
from youtube_analysis.data import fetch_youtube_comments
import youtube_analysis.config as config


def fetch_batch_data_from_video(vid, limit=3000):
    api_service_name = "youtube"
    api_version = "v3"
    api_key = config.YOUTUBE_API_KEY
    try:
        ytb_df = pd.DataFrame(
            fetch_youtube_comments(
                "snippet", vid, limit, api_service_name, api_version, api_key
            ),
            columns=["author", "published_at", "updated_at", "likes", "text"],
        )
    except googleapiclient.errors.HttpError as e:
        print(f"Error {e.resp.status} occurred while fetching data:\n{e.content}")
    return ytb_df

We need to get and copy the target video ID on Youtube. Now we can use it like this:

In [4]:
lex_video_id = "NXU_M4030nE"
lex_comments = fetch_batch_data_from_video(vid=lex_video_id)

In [5]:
lex_comments.head()

Unnamed: 0,author,published_at,updated_at,likes,text
0,Lex Fridman,2022-12-29T17:34:04Z,2022-12-29T17:34:04Z,194,Here are the timestamps. Please chec...
1,John Dickinson,2023-11-19T10:59:46Z,2023-11-19T10:59:46Z,0,Protein research is a major new brea...
2,John Dickinson,2023-11-19T10:50:38Z,2023-11-19T10:50:38Z,0,This is a good one.
3,john g henderson,2023-11-18T03:47:08Z,2023-11-18T03:49:31Z,0,A very interesting conversation unti...
4,arife dickerson,2023-11-17T20:49:24Z,2023-11-17T20:49:24Z,0,This chick is always in Bilderberg g...


Now all we have to do is save our corpus. To circumvent the need for repetitive hard-coding of lengthy paths and constant copy/pasting, I import my own `paths` module. I opt for saving in pickle format due to its convenience for Python objects.

In [6]:
from youtube_analysis.paths import RAW_DATA_DIR

lex_comments.to_pickle(RAW_DATA_DIR / "lex_comments.pkl")