# Welcome to "The Real House Helps of Kawagware" Episode Link Retrieval Notebook!

🎬 Lights, Camera, Links! 📺

Get ready to streamline your "Real House Helps of Kawagware" binge-watching experience like never before! In this notebook, we're on a mission to organize and gather all the episode links from the show's YouTube channel, ensuring you can navigate through the series effortlessly.

No more endless scrolling or frantic searching – we're here to curate a neatly ordered list of episode links, so you can dive straight into the drama without missing a beat. From the pilot to the latest release, every episode link will be at your fingertips, meticulously arranged for your convenience.

Join us as we harness the power of web scraping to extract these valuable links, ensuring they're presented in perfect sequential order. With our streamlined approach, you'll spend less time searching and more time indulging in the captivating world of "The Real House Helps of Kawagware."

So grab your 🍿, settle into your favorite spot, and let's embark on this quest to organize and conquer the episode links. Lights, camera, links – let's make binge-watching a breeze!

# imports

importing the necessary libraries for the project 

In [3]:
# common imports 
import numpy as np
import pandas as pd


# Data Sources 

the data source for this project is going to be youtube data is scraped using scrape tube package from the TRHK main youtube channel 

### Scrape Tube

ScrapeTube is a Python library designed specifically for scraping  data from YouTube. It offers a convenient and efficient way to extract various types of information from YouTube channels, videos, and playlists. 📺 👾

[ Check out ScrapeTube here!](https://github.com/TeamHG-Memex/scrape-tube) 

In [2]:
# Scrapetube import
import scrapetube

# Getting data from the TRHK 
videos = scrapetube.get_channel("UCP456Szyc9zy-f0j3UoHzeg")



An empty data frame to store our data, let's call it  `df`


In [3]:
df = pd.DataFrame()

loop through all the found videos assigning the to a list 
loop through all of the YouTube video objects found in the search results and  assign them to a list


In [4]:

videos_list = []
for video in  videos:
    videos_list.append(video)

convert the list to a pandas data frame [df = pd.DataFrame (list)]


In [5]:
df = pd.DataFrame(videos_list)

In [6]:
# save a copy of our scraped data before cleaning 
df.to_csv("data/scraped_data.csv")

## Data Description

In [7]:
# getting Dataframe info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 710 entries, 0 to 709
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   videoId             710 non-null    object
 1   thumbnail           710 non-null    object
 2   title               710 non-null    object
 3   descriptionSnippet  709 non-null    object
 4   publishedTimeText   710 non-null    object
 5   lengthText          710 non-null    object
 6   viewCountText       710 non-null    object
 7   navigationEndpoint  710 non-null    object
 8   trackingParams      710 non-null    object
 9   showActionMenu      710 non-null    bool  
 10  shortViewCountText  710 non-null    object
 11  menu                710 non-null    object
 12  thumbnailOverlays   710 non-null    object
 13  richThumbnail       564 non-null    object
dtypes: bool(1), object(13)
memory usage: 72.9+ KB


- We have 710 videos from the channel with 14 columns  ����

- Most of our data is in nested objects / lists and needs to be cleaned 🧹

- We also have unnecessary columns for our use case  🗑️

- Our data is relatively clean with only a few missing rows 🧐 


## Data Cleaning

### Dropping unnecessary columns  

we have the following columns 

* **videoId** - the unique identifier for  the video

* **thumbnail** - a  thumbnail image of the video

* **title** - the title of the video

* **descriptionSnippet**  - a short description of the video

* **publishedTimeText** - the date and time the video was published

* **lengthText** - the length of the video

* **viewCountText** - the number of views the video has received 

### Columns to drop

The following columns are not necessary for our analysis and can  be dropped:

* `navigationEndpoint`: This column contains the URL to the video's page on YouTube.
* `trackingParams`: This column contains  tracking parameters that are used by YouTube to track the performance of the video.
* `showActionMenu`: This column indicates whether or not the action menu is visible on the video's page.
* `shortViewCountText`: This column contains the number of short views the video has received.
* ` menu`: This column contains the menu items that are available on the video's page.
* `thumbnailOverlays`: This column contains the overlays that are displayed on the video's thumbnail.
* `richThumbnail`: This column contains the rich thumbnail that is displayed on the video's page. 


In [8]:
df = df.drop(columns=['trackingParams','showActionMenu','shortViewCountText','menu','thumbnailOverlays','richThumbnail',"navigationEndpoint"])
df.head(5)

Unnamed: 0,videoId,thumbnail,title,descriptionSnippet,publishedTimeText,lengthText,viewCountText
0,nsJIaPvbspg,{'thumbnails': [{'url': 'https://i.ytimg.com/v...,{'runs': [{'text': 'Heartbreak ni ile ile | TR...,{'runs': [{'text': 'The Real Househelps of Kaw...,{'simpleText': '10 months ago'},{'accessibility': {'accessibilityData': {'labe...,"{'simpleText': '251,135 views'}"
1,kn11JasWx28,{'thumbnails': [{'url': 'https://i.ytimg.com/v...,{'runs': [{'text': 'Luma Mongaras wakwende uko...,{'runs': [{'text': 'The Real Househelps of Kaw...,{'simpleText': '10 months ago'},{'accessibility': {'accessibilityData': {'labe...,"{'simpleText': '197,956 views'}"
2,EqvW6AwLB5s,{'thumbnails': [{'url': 'https://i.ytimg.com/v...,{'runs': [{'text': 'Mapenzi Inawaramba! | TRHK...,{'runs': [{'text': 'The Real Househelps of Kaw...,{'simpleText': '10 months ago'},{'accessibility': {'accessibilityData': {'labe...,"{'simpleText': '47,116 views'}"
3,xX_6THG8YR4,{'thumbnails': [{'url': 'https://i.ytimg.com/v...,{'runs': [{'text': 'Ndanda ni wewe! | TRHK EP3...,{'runs': [{'text': 'The Real Househelps of Kaw...,{'simpleText': '10 months ago'},{'accessibility': {'accessibilityData': {'labe...,"{'simpleText': '153,011 views'}"
4,qZ2jpOiPQsQ,{'thumbnails': [{'url': 'https://i.ytimg.com/v...,{'runs': [{'text': 'He he he!!! Sema kuwithdra...,{'runs': [{'text': 'The Real Househelps of Kaw...,{'simpleText': '10 months ago'},{'accessibility': {'accessibilityData': {'labe...,"{'simpleText': '179,489 views'}"


## checking for null values 

In [9]:
df.isnull().sum()

videoId               0
thumbnail             0
title                 0
descriptionSnippet    1
publishedTimeText     0
lengthText            0
viewCountText         0
dtype: int64

We have only 1 null value in descriptionSnippet

In [10]:
# finding the row with the null value
df[df["descriptionSnippet"].isna()]

Unnamed: 0,videoId,thumbnail,title,descriptionSnippet,publishedTimeText,lengthText,viewCountText
709,q7gLD7y8oik,{'thumbnails': [{'url': 'https://i.ytimg.com/v...,{'runs': [{'text': 'The Real Househelps of Kaw...,,{'simpleText': '10 years ago'},{'accessibility': {'accessibilityData': {'labe...,"{'simpleText': '16,037 views'}"


The above column is the column with null value

## Data conversion

### Cleaning thumbnail column

In [11]:
# Getting thumbnail link from nested data 
df['thumbnail'] = df['thumbnail'].apply(lambda x:x['thumbnails'][0]['url'])
df.head(2)

Unnamed: 0,videoId,thumbnail,title,descriptionSnippet,publishedTimeText,lengthText,viewCountText
0,nsJIaPvbspg,https://i.ytimg.com/vi/nsJIaPvbspg/hqdefault.j...,{'runs': [{'text': 'Heartbreak ni ile ile | TR...,{'runs': [{'text': 'The Real Househelps of Kaw...,{'simpleText': '10 months ago'},{'accessibility': {'accessibilityData': {'labe...,"{'simpleText': '251,135 views'}"
1,kn11JasWx28,https://i.ytimg.com/vi/kn11JasWx28/hqdefault.j...,{'runs': [{'text': 'Luma Mongaras wakwende uko...,{'runs': [{'text': 'The Real Househelps of Kaw...,{'simpleText': '10 months ago'},{'accessibility': {'accessibilityData': {'labe...,"{'simpleText': '197,956 views'}"


### Cleaning title Column 

In [12]:
# Getting title text from nested title data
df['title'] = df["title"].apply(lambda x:x['runs'][0]['text'])
df.head(2)

Unnamed: 0,videoId,thumbnail,title,descriptionSnippet,publishedTimeText,lengthText,viewCountText
0,nsJIaPvbspg,https://i.ytimg.com/vi/nsJIaPvbspg/hqdefault.j...,Heartbreak ni ile ile | TRHK EP312 Pt 2,{'runs': [{'text': 'The Real Househelps of Kaw...,{'simpleText': '10 months ago'},{'accessibility': {'accessibilityData': {'labe...,"{'simpleText': '251,135 views'}"
1,kn11JasWx28,https://i.ytimg.com/vi/kn11JasWx28/hqdefault.j...,Luma Mongaras wakwende uko NKT! | TRHK EP312 Pt 1,{'runs': [{'text': 'The Real Househelps of Kaw...,{'simpleText': '10 months ago'},{'accessibility': {'accessibilityData': {'labe...,"{'simpleText': '197,956 views'}"


### Cleaning descriptionSnippet Column

In [13]:
# Getting description text from nested descriptionSnippet 
# For null values fill with unknown
df['descriptionSnippet'] = df['descriptionSnippet'].apply(lambda x: x['runs'][0]['text'] if isinstance(x,dict) else "unknown")
df.head(2)

Unnamed: 0,videoId,thumbnail,title,descriptionSnippet,publishedTimeText,lengthText,viewCountText
0,nsJIaPvbspg,https://i.ytimg.com/vi/nsJIaPvbspg/hqdefault.j...,Heartbreak ni ile ile | TRHK EP312 Pt 2,The Real Househelps of Kawangware follows the ...,{'simpleText': '10 months ago'},{'accessibility': {'accessibilityData': {'labe...,"{'simpleText': '251,135 views'}"
1,kn11JasWx28,https://i.ytimg.com/vi/kn11JasWx28/hqdefault.j...,Luma Mongaras wakwende uko NKT! | TRHK EP312 Pt 1,The Real Househelps of Kawangware follows the ...,{'simpleText': '10 months ago'},{'accessibility': {'accessibilityData': {'labe...,"{'simpleText': '197,956 views'}"


### Cleaning publishedTimeText Column

In [14]:
# Getting publishedTimeText from nested object 
df['publishedTimeText'] = df['publishedTimeText'].apply(lambda x:x['simpleText'])
df.head(2)

Unnamed: 0,videoId,thumbnail,title,descriptionSnippet,publishedTimeText,lengthText,viewCountText
0,nsJIaPvbspg,https://i.ytimg.com/vi/nsJIaPvbspg/hqdefault.j...,Heartbreak ni ile ile | TRHK EP312 Pt 2,The Real Househelps of Kawangware follows the ...,10 months ago,{'accessibility': {'accessibilityData': {'labe...,"{'simpleText': '251,135 views'}"
1,kn11JasWx28,https://i.ytimg.com/vi/kn11JasWx28/hqdefault.j...,Luma Mongaras wakwende uko NKT! | TRHK EP312 Pt 1,The Real Househelps of Kawangware follows the ...,10 months ago,{'accessibility': {'accessibilityData': {'labe...,"{'simpleText': '197,956 views'}"


### Cleaning LengthText Column 

In [15]:
# Getting lengthText from nested object  
df["lengthText"] = df["lengthText"].apply(lambda x:x['accessibility']['accessibilityData']['label'])
df.head(2)

Unnamed: 0,videoId,thumbnail,title,descriptionSnippet,publishedTimeText,lengthText,viewCountText
0,nsJIaPvbspg,https://i.ytimg.com/vi/nsJIaPvbspg/hqdefault.j...,Heartbreak ni ile ile | TRHK EP312 Pt 2,The Real Househelps of Kawangware follows the ...,10 months ago,"12 minutes, 36 seconds","{'simpleText': '251,135 views'}"
1,kn11JasWx28,https://i.ytimg.com/vi/kn11JasWx28/hqdefault.j...,Luma Mongaras wakwende uko NKT! | TRHK EP312 Pt 1,The Real Househelps of Kawangware follows the ...,10 months ago,"11 minutes, 18 seconds","{'simpleText': '197,956 views'}"


### Cleaning viewCountText Column

In [16]:
# Getting viewCountText from nested object 
df['viewCountText'] = df['viewCountText'].apply(lambda x:x['simpleText'])
df.head(2)

Unnamed: 0,videoId,thumbnail,title,descriptionSnippet,publishedTimeText,lengthText,viewCountText
0,nsJIaPvbspg,https://i.ytimg.com/vi/nsJIaPvbspg/hqdefault.j...,Heartbreak ni ile ile | TRHK EP312 Pt 2,The Real Househelps of Kawangware follows the ...,10 months ago,"12 minutes, 36 seconds","251,135 views"
1,kn11JasWx28,https://i.ytimg.com/vi/kn11JasWx28/hqdefault.j...,Luma Mongaras wakwende uko NKT! | TRHK EP312 Pt 1,The Real Househelps of Kawangware follows the ...,10 months ago,"11 minutes, 18 seconds","197,956 views"


- Append the videoId to end of the YouTube link:  


In [17]:
df = df.assign(Link='https://www.youtube.com/watch?v='+df['videoId'])


In [18]:
df.head()

Unnamed: 0,videoId,thumbnail,title,descriptionSnippet,publishedTimeText,lengthText,viewCountText,Link
0,nsJIaPvbspg,https://i.ytimg.com/vi/nsJIaPvbspg/hqdefault.j...,Heartbreak ni ile ile | TRHK EP312 Pt 2,The Real Househelps of Kawangware follows the ...,10 months ago,"12 minutes, 36 seconds","251,135 views",https://www.youtube.com/watch?v=nsJIaPvbspg
1,kn11JasWx28,https://i.ytimg.com/vi/kn11JasWx28/hqdefault.j...,Luma Mongaras wakwende uko NKT! | TRHK EP312 Pt 1,The Real Househelps of Kawangware follows the ...,10 months ago,"11 minutes, 18 seconds","197,956 views",https://www.youtube.com/watch?v=kn11JasWx28
2,EqvW6AwLB5s,https://i.ytimg.com/vi/EqvW6AwLB5s/hqdefault.j...,Mapenzi Inawaramba! | TRHK EP312 Promo,The Real Househelps of Kawangware follows the ...,10 months ago,44 seconds,"47,116 views",https://www.youtube.com/watch?v=EqvW6AwLB5s
3,xX_6THG8YR4,https://i.ytimg.com/vi/xX_6THG8YR4/hqdefault.j...,Ndanda ni wewe! | TRHK EP311 Pt 2,The Real Househelps of Kawangware follows the ...,10 months ago,"11 minutes, 53 seconds","153,011 views",https://www.youtube.com/watch?v=xX_6THG8YR4
4,qZ2jpOiPQsQ,https://i.ytimg.com/vi/qZ2jpOiPQsQ/hqdefault.j...,He he he!!! Sema kuwithdraw pesa kwa choo | TR...,The Real Househelps of Kawangware follows the ...,10 months ago,"12 minutes, 8 seconds","179,489 views",https://www.youtube.com/watch?v=qZ2jpOiPQsQ


In [19]:
# reorder our dataframe from the first episode to the last
df.sort_index(ascending=False)
df.head()


Unnamed: 0,videoId,thumbnail,title,descriptionSnippet,publishedTimeText,lengthText,viewCountText,Link
0,nsJIaPvbspg,https://i.ytimg.com/vi/nsJIaPvbspg/hqdefault.j...,Heartbreak ni ile ile | TRHK EP312 Pt 2,The Real Househelps of Kawangware follows the ...,10 months ago,"12 minutes, 36 seconds","251,135 views",https://www.youtube.com/watch?v=nsJIaPvbspg
1,kn11JasWx28,https://i.ytimg.com/vi/kn11JasWx28/hqdefault.j...,Luma Mongaras wakwende uko NKT! | TRHK EP312 Pt 1,The Real Househelps of Kawangware follows the ...,10 months ago,"11 minutes, 18 seconds","197,956 views",https://www.youtube.com/watch?v=kn11JasWx28
2,EqvW6AwLB5s,https://i.ytimg.com/vi/EqvW6AwLB5s/hqdefault.j...,Mapenzi Inawaramba! | TRHK EP312 Promo,The Real Househelps of Kawangware follows the ...,10 months ago,44 seconds,"47,116 views",https://www.youtube.com/watch?v=EqvW6AwLB5s
3,xX_6THG8YR4,https://i.ytimg.com/vi/xX_6THG8YR4/hqdefault.j...,Ndanda ni wewe! | TRHK EP311 Pt 2,The Real Househelps of Kawangware follows the ...,10 months ago,"11 minutes, 53 seconds","153,011 views",https://www.youtube.com/watch?v=xX_6THG8YR4
4,qZ2jpOiPQsQ,https://i.ytimg.com/vi/qZ2jpOiPQsQ/hqdefault.j...,He he he!!! Sema kuwithdraw pesa kwa choo | TR...,The Real Househelps of Kawangware follows the ...,10 months ago,"12 minutes, 8 seconds","179,489 views",https://www.youtube.com/watch?v=qZ2jpOiPQsQ


save the cleaned data 

In [20]:
df.to_csv("data/cleaned_data.csv")

#### woohoo! We've Successfully Scraped "The Real House Helps of Kawagware" YouTube Channel!

🎉📺🎉

We've reached the end of our exhilarating journey through the digital realm of "The Real House Helps of Kawagware"! Armed with our Python skills and the powerful ScrapeTube library, we've successfully scraped a treasure trove of data from the show's YouTube channel.

But this is just the beginning! With our newfound dataset in hand, the possibilities are endless. Whether we're analyzing viewer engagement, uncovering trends, or simply indulging in some binge-watching, the insights we've gathered will surely add depth and excitement to our fandom.

As we close this notebook, let's take a moment to appreciate the thrill of discovery and the satisfaction of a job well done. We've mastered the art of data scraping, and "The Real House Helps of Kawagware" is just the beginning of our data-driven adventures.

So, what's next? Perhaps another YouTube channel to explore, or maybe a deep dive into data visualization and analysis? The choice is ours, and the world of data awaits!

Until next time, let's keep scraping, keep exploring, and keep embracing the thrill of discovery. And remember, the real drama isn't just on screen – it's in the data!

Lights out, camera off, but the adventure continues... 🚀✨