<center>

# Extracting Text From A YouTube Playlist

</center>

In [1]:
# Install these libraries if not installed

# pip install selenium
# pip install bs4

In [2]:
from selenium import webdriver
from bs4 import BeautifulSoup

import re
import pandas as pd

#### <span style="background-color: yellow;">Notes on libraries</span>

**`selenium`** - Selenium is a popular Python library for automating web browsers. It allows you to control web browsers, interact with web elements, and extract data from websites. With Selenium, you can automate tasks, perform web scraping, and conduct automated testing. It supports multiple browsers and offers a wide range of features for web automation. Selenium is widely used for its flexibility, ease of use, and community support, making it a valuable tool for web developers and testers. 


**`bs4`** - Beautiful Soup (bs4) is a Python library used for web scraping and parsing HTML/XML documents. It provides an intuitive interface for extracting data from HTML/XML files, navigating the document structure, and searching for specific elements using CSS selectors. With bs4, you can easily scrape data from websites, extract information, and perform data analysis. It is known for its simplicity, flexibility, and robustness, making it a go-to choice for web scraping tasks in Python.

**`webdriver`** - Webdriver is a part of the Selenium library in Python that provides a programming interface to interact with web browsers. It allows automating web actions like clicking buttons, filling forms, navigating pages, and extracting data. With Webdriver, you can simulate user interactions, perform web testing, and scrape data from dynamic websites. It supports multiple browsers like Chrome, Firefox, and Safari, enabling cross-browser testing. Webdriver provides a powerful toolset for web automation, making it an essential tool for tasks requiring browser interaction and web scraping in Python.

**`BeautifulSoup`** - BeautifulSoup is a Python library within the bs4 package that simplifies web scraping by parsing HTML and XML documents. It provides a convenient API to navigate and search the document structure, extract data, and manipulate HTML elements. With BeautifulSoup, you can easily extract specific content from web pages, such as text, links, tables, or images. It handles malformed or messy HTML gracefully, making it a popular choice for web scraping tasks. BeautifulSoup's simplicity and flexibility make it an excellent tool for extracting data from websites in Python.

**`re`** - The 're' module in Python provides regular expression matching operations. It allows for pattern searching, extraction, and manipulation of strings based on specific patterns. Using 're', you can perform tasks like pattern matching, search and replace, splitting strings, and more. Regular expressions offer powerful and flexible text processing capabilities, enabling you to handle complex string manipulations efficiently. The 're' module is widely used for tasks such as data validation, text parsing, and pattern matching in Python applications.

**`pandas`**  - Pandas is a powerful data manipulation and analysis library in Python. It provides easy-to-use data structures and data analysis tools for handling and processing structured data. With Pandas, you can efficiently handle tabular data, perform data cleaning, filtering, transformation, aggregation, and visualization. It simplifies tasks like data loading, indexing, and slicing, making data analysis more accessible. Pandas is widely used in data science and data analysis projects for its rich functionality and integration with other libraries such as NumPy and Matplotlib.


In [3]:
# Specify the URL of the webpage
url = 'https://www.youtube.com/playlist?list=PLot-Xpze53lfQmTEztbgdp8ALEoydvnRQ'

# Configure the Selenium webdriver
# Here we use 'webdriver' to load and extract from the webpage as evident in the below steps
driver = webdriver.Chrome()

# Load the webpage
driver.get(url)

# Extract the page source
page_source = driver.page_source

# Parse the HTML source using BeautifulSoup
soup = BeautifulSoup(page_source, 'html.parser')

# Extract all the text from the webpage
all_text = soup.get_text()

# Print the extracted text
#print(all_text)          # Commenting this print as the o/p is very large

# Quit the webdriver.     # This is done in order to free up the resources
driver.quit()


In [4]:
text = all_text

# Remove leading and trailing spaces
text = text.strip()

# Split the text into lines
lines = text.splitlines()

# Remove empty lines
lines = [line for line in lines if line.strip() != ""]

# Join the lines back into a single string
result = "\n".join(lines)

# print(result)  # Commenting this print as the o/p is large


#### <span style="background-color: yellow;">Explaining : lines = [line for line in lines if line.strip() != ""]</span>


* **`line.strip()`** removes any leading or trailing whitespace characters from a line.
* **`line.strip() != ""`** checks if the stripped line is not an empty string.
* The list comprehension **`[line for line in lines if line.strip() != ""]`** iterates over each line in the lines list and filters out any lines that are empty (contain only whitespace characters). It creates a new list with the non-empty lines.

In [5]:
# Displaying the first 50 lines of the o/p 

text = all_text.strip()
lines = text.splitlines()
lines = [line for line in lines if line.strip() != ""]

# Print the first three lines
print('\n'.join(lines[:50]))


•
NaN / NaN
      Back
  IN
Skip navigation
  Search
        Search
  Search with your voice
Sign in
  IN
Home
    Home
Shorts
    Shorts
Subscriptions
    Subscriptions
Library
    Library
History
    History
Play all
EASY
NeetCode
·
61 videos
302,728 views
Last updated on Sep 6, 2022
  Save playlist
  Share
Play all
Shuffle
 …...More...More 
 …...More...More 
Play all
Shuffle
    NeetCode
1
  12:01
Now playing
          Valid Anagram - Leetcode 242 - Python
NeetCode
    NeetCode
•
221K views • 1 year ago
•
2
  8:26
Now playing
          Two Sum - Leetcode 1 - HashMap - Python
NeetCode


In [6]:
# import re                  # re stands for "regular expression". re module is used here. Note "re" is a module not a library.
"""
Here the objective is to only select those lines that contains the word "Leetcode". This is because it is the only common word
found in every title of the playlist followed by the number. 
"""
def filter_lines_with_word(lines, word):
    filtered_lines = [line for line in lines if re.search(word, line)]
    return filtered_lines

def print_filtered_lines(filtered_lines):
    for line in filtered_lines:
        print(line.strip())

def filter_lines_with_Leetcode(text):
    lines = text.splitlines()
    filtered_lines = filter_lines_with_word(lines, r"Leetcode")   
    return [line.strip() for line in filtered_lines]                  

# Example usage
text = result

filtered_lines = filter_lines_with_Leetcode(text)
print_filtered_lines(filtered_lines)


Valid Anagram - Leetcode 242 - Python
Two Sum - Leetcode 1 - HashMap - Python
Maximum Subarray - Amazon Coding Interview Question - Leetcode 53 - Python
TWO SUM II - Amazon Coding Interview Question - Leetcode 167 - Python
House Robber -  Leetcode 198 - Python Dynamic Programming
Merge Two Sorted Lists - Leetcode 21 - Python
Sliding Window: Best Time to Buy and Sell Stock - Leetcode 121 - Python
Merge Sorted Array - Leetcode 88 - Python
Climbing Stairs - Dynamic Programming - Leetcode 70 - Python
Reverse Linked List - Iterative AND Recursive - Leetcode 206 - Python
Diameter of a Binary Tree - Leetcode 543 - Python
Valid Parentheses - Stack - Leetcode 20 - Python
Palindrome Linked List - Leetcode 234 - Python
Invert Binary Tree - Depth First Search - Leetcode 226
Leetcode 1299 - REPLACE ELEMENTS WITH GREATEST ELEMENT ON RIGHT SIDE
Merge Two Binary Trees - Leetcode 617
Reverse Integer - Bit Manipulation - Leetcode 7 - Python
Lowest Common Ancestor of a Binary Search Tree - Leetcode 235 -

#### <span style="background-color: yellow;">NOTE -</span>

An alternate strategy here would be to just copy the output from the above code cell, paste it into an excel file and do the necessary alterations (i,e, transform data for a cleaner look). This would be more time efficient.

#### <span style="background-color: yellow;">Explaning each function -</span>

* **`filter_lines_with_word(lines, word):`** This function takes a list of lines (lines) and a word pattern (word) as parameters. It uses a list comprehension to iterate over each line in the lines list and filters out the lines that contain the specified word pattern using re.search(). The filtered lines are stored in the filtered_lines list, which is returned by the function.

* **`print_filtered_lines(filtered_lines):`** This function takes a list of filtered lines (filtered_lines) and prints each line after removing leading and trailing spaces using the strip() method.

* **`filter_lines_with_Leetcode(text):`** This function takes a string of text (text) as input. It splits the text into lines using splitlines(), then calls the filter_lines_with_word function with the lines and the word pattern r"Leetcode". It returns a list comprehension that strips leading and trailing spaces from each line in the filtered_lines list.

* **`r"Leetcode":`** is a raw string literal used to represent the pattern "Leetcode" in a regular expression. It ensures that backslashes (`\`) are treated as literal characters and not escape characters. This is important when working with patterns that contain special characters or sequences that might otherwise be interpreted differently. Using `r"..."` as a prefix is a convention in Python to create raw string literals.

In [7]:
# import pandas as pd           # pandas library is used here. "pandas" is both a library as well as a module.

def print_filtered_lines(lines):
    for line in lines:
        print(line)

filtered_lines = [
    "Valid Anagram - Leetcode 242 - Python",
    "Two Sum - Leetcode 1 - HashMap - Python",
    "Maximum Subarray - Amazon Coding Interview Question - Leetcode 53 - Python",
    "TWO SUM II - Amazon Coding Interview Question - Leetcode 167 - Python",
    "House Robber - Leetcode 198 - Python Dynamic Programming",
    "Merge Two Sorted Lists - Leetcode 21 - Python",
    "Sliding Window: Best Time to Buy and Sell Stock - Leetcode 121 - Python",
    "Merge Sorted Array - Leetcode 88 - Python",
    "Climbing Stairs - Dynamic Programming - Leetcode 70 - Python",
    "Reverse Linked List - Iterative AND Recursive - Leetcode 206 - Python",
    "Diameter of a Binary Tree - Leetcode 543 - Python",
    "Valid Parentheses - Stack - Leetcode 20 - Python",
    "Palindrome Linked List - Leetcode 234 - Python",
    "Invert Binary Tree - Depth First Search - Leetcode 226",
    "Leetcode 1299 - REPLACE ELEMENTS WITH GREATEST ELEMENT ON RIGHT SIDE",
    "Merge Two Binary Trees - Leetcode 617",
    "Reverse Integer - Bit Manipulation - Leetcode 7 - Python",
    "Lowest Common Ancestor of a Binary Search Tree - Leetcode 235 - Python",
    "Happy Number - Leetcode 202 - Python",
    "Design Min Stack - Amazon Interview Question - Leetcode 155 - Python",
    "Remove Linked List Elements - Leetcode 203",
    "Search Insert Position - Binary Search - Leetcode 35 - Python",
    "Last Stone Weight - Priority Queue - Leetcode 1046 - Python",
    "Remove Duplicates from Sorted Array - Leetcode 26 - Python",
    "Ugly Number - Leetcode 263 - Python",
    "Length of Last Word - Leetcode 58 - Python",
    "Remove Element - Leetcode 27 - Python",
    "Unique Email Addresses - Two Solutions - Leetcode 929 Python",
    "Min Cost Climbing Stairs - Dynamic Programming - Leetcode 746 - Python",
    "Subtree of Another Tree - Leetcode 572 - Python",
    "Valid Palindrome - Leetcode 125 - Python",
    "Isomorphic Strings - Leetcode 205 - Python",
    "Number of 1 Bits - Leetcode 191 - Python",
    "Contains Duplicate - Leetcode 217 - Python",
    "Kth Largest Element in a Stream - Leetcode 703 - Python",
    "Remove Duplicates from Sorted List - Leetcode 83 - Python",
    "Can Place Flowers - Leetcode 605 - Python",
    "Find the Index of the First Occurrence in a String - Leetcode 28 - Python",
    "Knuth–Morris–Pratt KMP - Find the Index of the First Occurrence in a String - Leetcode 28 - Python",
    "Majority Element - Leetcode 169 - Python",
    "Implement Stack using Queues - Leetcode 225 - Python",
    "Squares of a Sorted Array - Leetcode 977 - Python",
    "Path Sum - Leetcode 112 - Python",
    "Move Zeroes - Leetcode 283 - Python",
    "Find Pivot Index - Leetcode 724 - Python",
    "Single Number - Leetcode 136 - Python",
    "Intersection of Two Linked Lists - Leetcode 160 - Python",
    "Find All Numbers Disappeared in an Array - Leetcode 448 - Python",
    "Maximum Number of Balloons - Leetcode 1189 - Python",
    "Guess Number Higher or Lower - Leetcode 374 - Python",
    "Arranging Coins - Leetcode 441 - Python",
    "Valid Perfect Square - Leetcode 367 - Python",
    "Word Pattern - Leetcode 290 - Python",
    "Iterative & Recursive - Binary Tree Inorder Traversal - Leetcode 94 - Python",
    "Next Greater Element I - Leetcode 496 - Python",
    "Binary Search - Leetcode 704 - Python",
    "Reverse String - 3 Ways - Leetcode 344 - Python",
    "Valid Palindrome II - Leetcode 680 - Python",
    "Baseball Game - Leetcode 682 - Python",
    "Shift 2D Grid - Leetcode 1260 - Python",
    "Construct String from Binary Tree - Leetcode 606 - Python"
]

df = pd.DataFrame({'Title': filtered_lines})

print(df)


                                                Title
0               Valid Anagram - Leetcode 242 - Python
1             Two Sum - Leetcode 1 - HashMap - Python
2   Maximum Subarray - Amazon Coding Interview Que...
3   TWO SUM II - Amazon Coding Interview Question ...
4   House Robber - Leetcode 198 - Python Dynamic P...
..                                                ...
56    Reverse String - 3 Ways - Leetcode 344 - Python
57        Valid Palindrome II - Leetcode 680 - Python
58              Baseball Game - Leetcode 682 - Python
59             Shift 2D Grid - Leetcode 1260 - Python
60  Construct String from Binary Tree - Leetcode 6...

[61 rows x 1 columns]


In [8]:
# Create a new column called "LeetCodes"

# Extract numbers after "Leetcode "
df['LeetCode'] = df['Title'].apply(lambda x: re.findall(r'Leetcode (\d+)', x)[0] if re.findall(r'Leetcode (\d+)', x) else '')

# Displaying the entire df
pd.set_option('display.max_rows', None)

df


Unnamed: 0,Title,LeetCode
0,Valid Anagram - Leetcode 242 - Python,242
1,Two Sum - Leetcode 1 - HashMap - Python,1
2,Maximum Subarray - Amazon Coding Interview Que...,53
3,TWO SUM II - Amazon Coding Interview Question ...,167
4,House Robber - Leetcode 198 - Python Dynamic P...,198
5,Merge Two Sorted Lists - Leetcode 21 - Python,21
6,Sliding Window: Best Time to Buy and Sell Stoc...,121
7,Merge Sorted Array - Leetcode 88 - Python,88
8,Climbing Stairs - Dynamic Programming - Leetco...,70
9,Reverse Linked List - Iterative AND Recursive ...,206


In [9]:
df.dtypes

Title       object
LeetCode    object
dtype: object

In [10]:
# Converting the Data Types of "Title" and "LeetCode"

df["Title"] = df["Title"].astype(str)
df["LeetCode"] = df["LeetCode"].astype(int)

In [11]:
df.dtypes

Title       object
LeetCode     int32
dtype: object

#### <span style="background-color: yellow;">NOTE - </span>
In the above code we tried to convert the data type of elements in the `"Title"` column from `object` to `string`. This did not work for us despite using the write syntax. This is because there are `integers` present within the elements of "Title". This presence of non-string values (i.e. integers) within the elements is preventing the conversion of the data type.  

In [12]:
# Arranging the df in Asceding Order of "Leetcodes"

# Sort the DataFrame in ascending order of "LeetCode" column
df_sorted = df.sort_values("LeetCode", ignore_index=True)

# Print the sorted DataFrame
df_sorted.head(10)


Unnamed: 0,Title,LeetCode
0,Two Sum - Leetcode 1 - HashMap - Python,1
1,Reverse Integer - Bit Manipulation - Leetcode ...,7
2,Valid Parentheses - Stack - Leetcode 20 - Python,20
3,Merge Two Sorted Lists - Leetcode 21 - Python,21
4,Remove Duplicates from Sorted Array - Leetcode...,26
5,Remove Element - Leetcode 27 - Python,27
6,Find the Index of the First Occurrence in a St...,28
7,Knuth–Morris–Pratt KMP - Find the Index of the...,28
8,Search Insert Position - Binary Search - Leetc...,35
9,Maximum Subarray - Amazon Coding Interview Que...,53


In [13]:
# Converting df into .csv file
df_sorted.to_csv('YT_playlist_data.csv', index=False)     # Here "index=False" specifies that you don't want to include the index column in the CSV file

# Retriving the .csv file
csv_df = pd.read_csv('YT_playlist_data.csv', index_col=False)

csv_df.head(10)


Unnamed: 0,Title,LeetCode
0,Two Sum - Leetcode 1 - HashMap - Python,1
1,Reverse Integer - Bit Manipulation - Leetcode ...,7
2,Valid Parentheses - Stack - Leetcode 20 - Python,20
3,Merge Two Sorted Lists - Leetcode 21 - Python,21
4,Remove Duplicates from Sorted Array - Leetcode...,26
5,Remove Element - Leetcode 27 - Python,27
6,Find the Index of the First Occurrence in a St...,28
7,Knuth–Morris–Pratt KMP - Find the Index of the...,28
8,Search Insert Position - Binary Search - Leetc...,35
9,Maximum Subarray - Amazon Coding Interview Que...,53
