<center><h1 style="background-color: #C6F3CD; border-radius: 10px; color: #FFFFFF; padding: 5px;">
Building Simple datapipeline
</h1><center/>

**Link to the article** : https://medium.com/ai-in-plain-english/how-to-build-data-pipelines-for-ml-projects-5675662c0483?sk=7b3e13d667f1ce337ace8985df05072e

In [1]:
import requests
import json
import polars as pl
from API import youtube #My Youtube API key
from youtube_transcript_api import YouTubeTranscriptApi

<center><h1 style="background-color: #C6F3CD; border-radius: 10px; color: #FFFFFF; padding: 5px;">
Extract
</h1><center/>

In [2]:
def getVideoRecords(response: requests.models.Response) -> list:
    """
        Function to extract YouTube video data from GET request response
    """

    video_record_list = []
    
    for raw_item in json.loads(response.text)['items']:
    
        # only execute for youtube videos
        if raw_item['id']['kind'] != "youtube#video":
            continue
        
        video_record = {}
        video_record['video_id'] = raw_item['id']['videoId']
        video_record['datetime'] = raw_item['snippet']['publishedAt']
        video_record['title'] = raw_item['snippet']['title']
        
        video_record_list.append(video_record)

    return video_record_list

In [3]:
channel_id = "UCCF6pCTGMKdo9r_kFQS-H3Q"
# define url for Youtube Search API
url = "https://www.googleapis.com/youtube/v3/search"
# Ini page token
page_token = None
# Ini list to store video data 
video_record_list  = []

In [4]:
# extract video data across multiple search result pages
while page_token != 0:
    # define parameters for API call
    params = {"key": youtube, 'channelId': channel_id, 'part': ["snippet","id"],
               'order': "date", 'maxResults':50, 'pageToken': page_token}
    # make get request
    response = requests.get(url, params=params)

    # append video records to list
    video_record_list += getVideoRecords(response)

    try:
        # grab next page token
        page_token = json.loads(response.text)['nextPageToken']
    except:
        # if no next page token kill while loop
        page_token = 0

In [5]:
df = pl.DataFrame(video_record_list)
df.head()

video_id,datetime,title
str,str,str
"""lPdek5rgQlQ""","""2024-07-11T17:00:26Z""","""I&#39;ve said it once if I&#39…"
"""DcIV4YCfAw0""","""2024-07-11T09:00:10Z""","""Lost Temple and the SECRET GLA…"
"""ryhJSUAS_dw""","""2024-07-10T17:00:41Z""","""I gave this guy +13 IN CLAWS t…"
"""Vtpgsf_hpO0""","""2024-07-10T09:00:24Z""","""There&#39;s 3 games here, you …"
"""NTuRDDvjkvQ""","""2024-07-09T17:00:16Z""","""This guy is so INCREDIBLY BM a…"


In [6]:
def extract_text(transcript: list) -> str:
    """
        Function to extract text from transcript dictionary
    """
    
    text_list = [transcript[i]['text'] for i in range(len(transcript))]
    return ' '.join(text_list)

In [7]:
transcript_text_list = []

for i in range(len(df)):

    # try to extract captions
    try:
        transcript = YouTubeTranscriptApi.get_transcript(df['video_id'][i])
        transcript_text = extract_text(transcript)
    # if not available set as n/a
    except:
        transcript_text = "n/a"
    
    transcript_text_list.append(transcript_text)

In [8]:
# add transcripts to dataframe
df = df.with_columns(pl.Series(name="transcript", values=transcript_text_list))
df.head()

video_id,datetime,title,transcript
str,str,str,str
"""lPdek5rgQlQ""","""2024-07-11T17:00:26Z""","""I&#39;ve said it once if I&#39…","""yo it's PTO again let's go P N…"
"""DcIV4YCfAw0""","""2024-07-11T09:00:10Z""","""Lost Temple and the SECRET GLA…","""huntresses don't need a buff t…"
"""ryhJSUAS_dw""","""2024-07-10T17:00:41Z""","""I gave this guy +13 IN CLAWS t…","""guys what's up it is time to f…"
"""Vtpgsf_hpO0""","""2024-07-10T09:00:24Z""","""There&#39;s 3 games here, you …","""yo guys what's up it's W3 Cham…"
"""NTuRDDvjkvQ""","""2024-07-09T17:00:16Z""","""This guy is so INCREDIBLY BM a…","""oh it's this guy night off ver…"


<center><h1 style="background-color: #C6F3CD; border-radius: 10px; color: #FFFFFF; padding: 5px;">
Transform
</h1><center/>

In [66]:
# Check for duplicate values

print("shape:", df.shape)
print("n unique rows:", df.n_unique())
for j in range(df.shape[1]):
    print("n unique elements (" + df.columns[j] + "):", df[:,j].n_unique())

shape: (515, 4)
n unique rows: 515
n unique elements (video_id): 515
n unique elements (datetime): 515
n unique elements (title): 515
n unique elements (transcript): 417


In [65]:

sorted_df = df.sort('video_id')

# Identify duplicates based on 'video_id'
duplicates_mask = (sorted_df['video_id'].shift(1) == sorted_df['video_id'])

# Print information about duplicates
print("Duplicate rows:")
print(sorted_df.filter(duplicates_mask))

# Remove duplicates and update df
df = sorted_df.filter(~duplicates_mask)

# Print updated shape and unique counts
print("Shape after removing duplicates:", df.shape)
print("Number of unique rows:", len(df))
for col in df.columns:
    print(f"Number of unique elements ({col}):", df[col].n_unique())


Duplicate rows:
shape: (15, 4)
┌─────────────┬─────────────────────┬───────────────────────────────┬──────────────────────────────┐
│ video_id    ┆ datetime            ┆ title                         ┆ transcript                   │
│ ---         ┆ ---                 ┆ ---                           ┆ ---                          │
│ str         ┆ datetime[μs]        ┆ str                           ┆ str                          │
╞═════════════╪═════════════════════╪═══════════════════════════════╪══════════════════════════════╡
│ 0YZFjTaKc8M ┆ 2023-05-08 10:04:17 ┆ My Ally LEFT GAME IMMEDIATELY ┆ n/a                          │
│             ┆                     ┆ …                             ┆                              │
│ 1ePFN0fR9KM ┆ 2022-06-26 20:30:00 ┆ Here We Go Again... | WC3 |   ┆ n/a                          │
│             ┆                     ┆ Gr…                           ┆                              │
│ 5GPz35VbkMg ┆ 2021-10-20 11:46:25 ┆ Grubby | WC3 | [LEGEND

In [23]:
# Check the datatypes
df = df.with_columns(pl.col('datetime').cast(pl.Datetime))
print(df.head())

shape: (5, 4)
┌─────────────┬─────────────────────┬───────────────────────────────┬──────────────────────────────┐
│ video_id    ┆ datetime            ┆ title                         ┆ transcript                   │
│ ---         ┆ ---                 ┆ ---                           ┆ ---                          │
│ str         ┆ datetime[μs]        ┆ str                           ┆ str                          │
╞═════════════╪═════════════════════╪═══════════════════════════════╪══════════════════════════════╡
│ lPdek5rgQlQ ┆ 2024-07-11 17:00:26 ┆ I&#39;ve said it once if      ┆ yo it's PTO again let's go P │
│             ┆                     ┆ I&#39…                        ┆ N…                           │
│ DcIV4YCfAw0 ┆ 2024-07-11 09:00:10 ┆ Lost Temple and the SECRET    ┆ huntresses don't need a buff │
│             ┆                     ┆ GLA…                          ┆ t…                           │
│ ryhJSUAS_dw ┆ 2024-07-10 17:00:41 ┆ I gave this guy +13 IN CLAWS  ┆ guys wh

In [27]:
#Handle Special Chracter
print(df['title'][3])

There&#39;s 3 games here, you need to see all 3 to appreciate the 3rd - WC3


In [28]:
print(df['transcript'][3])

yo guys what's up it's W3 Champions felt like uh giving myself and you guys some juicy orc versus night off try hard or at least some juicy uh try hard games of course I didn't know what race I'm going to be playing against be a bit weird if I knew who I was going to be playing against wouldn't it like how do you know do you have some kind of cheat no I don't have a cheat why do you ask so yeah p is very good um I'm going to be doing this cool thing that I've seen something doing uh recently from uh Hitman do you want to know what it is it's a cool thing it's uh I've seen it from Hitman but before that I've seen it from me who knows maybe he's seen it from me I first did this in 2006 but I saw him do this six days ago it's basically where you get double circlet and then you start uh doing knock knock jokes in the night of place thank you for the sub con saon something so you do a scout with a Pon and then you go home but you don't really go home you instead make a Voodoo Lounge on The 

In [53]:
special_strings = ['&#39;', '&amp;', 'I&#39']
special_string_replacements = ["'", "&", "'"]

for i in range(len(special_strings)):
    df = df.with_columns(df['title'].str.replace(special_strings[i], special_string_replacements[i]).alias('title'))
    df = df.with_columns(df['transcript'].str.replace(special_strings[i], special_string_replacements[i]).alias('transcript'))

In [54]:
print(df['title'][3])

There's 3 games here, you need to see all 3 to appreciate the 3rd - WC3


In [55]:
df

video_id,datetime,title,transcript
str,datetime[μs],str,str
"""lPdek5rgQlQ""",2024-07-11 17:00:26,"""I've said it once if I've said…","""yo it's PTO again let's go P N…"
"""DcIV4YCfAw0""",2024-07-11 09:00:10,"""Lost Temple and the SECRET GLA…","""huntresses don't need a buff t…"
"""ryhJSUAS_dw""",2024-07-10 17:00:41,"""I gave this guy +13 IN CLAWS t…","""guys what's up it is time to f…"
"""Vtpgsf_hpO0""",2024-07-10 09:00:24,"""There's 3 games here, you need…","""yo guys what's up it's W3 Cham…"
"""NTuRDDvjkvQ""",2024-07-09 17:00:16,"""This guy is so INCREDIBLY BM a…","""oh it's this guy night off ver…"
…,…,…,…
"""CSm6i0tPv-I""",2020-11-20 14:09:46,"""Grubby | WC3 | GHOULS + ALL TH…","""creep to look for other aras g…"
"""IhvrXf9ZE60""",2020-11-18 13:40:07,"""Grubby | WC3 | 4v4 | Losing My…","""n/a"""
"""XphBV-QeZq0""",2020-11-14 00:23:56,"""Grubby | WC3 | KEYBOARD PoV!""","""paladin [Laughter] all right g…"
"""YhM4TrOYsG0""",2020-11-09 23:24:02,"""Grubby | WC3 | Down To Basics …","""all right so what we're about …"


<center><h1 style="background-color: #C6F3CD; border-radius: 10px; color: #FFFFFF; padding: 5px;">
Load
</h1><center/>

In [57]:
df.write_parquet('video-transcripts.parquet')
df.write_csv('video-transcripts.csv')