<h1 style="text-align: center">Fornesus Blog Web Scraping Project</h1>
<p style="text-align: center">Author: Fornesus</p>
<p style="text-align: center">Date: 13 April 2024</p>

<h2>Introduction</h2>

<p>Buenas at Kumusta, I am Fornesus and, in this project, I utilized GitHub Copilot to assist me in doing something new: web scraping an existing website, in this case, my blog - <a href="https://fornesus.blog" style="color: blue;"><b>Fornesus Blog</b></a>, which you can find at <a href="https://fornesus.blog" style="color: blue;"><em>https://fornesus.blog</em></a>.</p>
<p>NOTE: While I used AI (GitHub Copilot) as a peer programmer throughout this project, the text in this notebook was not AI-generated.</p>

<h3>Background Information</h3>
<p>While there is much discussion on the impact that GenAI (generative AI) will have in the fields of data science and software engineering, one thing is certain, which is that the top skill of the future is problem solving, as has always been the case in the past.  Though problem solving has often been gatekept (still the case in many ways within corporate America), individuals are becoming empowered to identify, create and solve for an array of problems within their work scope, as a hobby, etc.  The real potential in the concept of the <em>AI revolution</em> lies in its capability to empower individuals to forge their own paths and increase the scope of our problem solving skills and overall creativity so long as we ensure that we understand the constraints of GenAI and understand the <em>value of ethics as it relates to GenAI</em>.</p>
<p>That could be a whole report in and of itself but, for the duration of this notebook, I will focus on the details of my first web scraping project, with the help of GitHub Copilot.</p>

<h3>Programming Context</h3>
<p>Python is a powerful programming language and acts as a sort of Swiss army knife for programmers in many of the same ways that JavaScript is known (by some) in the context of the web (they also happen to be my preferred programming languages).  What makes programming languages, like Python and JavaScript, so useful are the plethora of libraries that were created to make life easier for programmers, which can enable us to focus on the most important task: solving the problem at-hand.</p>
<p>So, in this notebook (which uses Python), the first step will be to import two important Python libraries for web scraping: <b>bs4 (BeautifulSoup4)</b> and <b>requests</b></p>
<h3>Import Details</h3>
<p>The <b>BeautifulSoup4 library</b> is a Python library that can be used by the program to parse and scrape HTML/XML content.</p>
<p style="margin-left: 2em">If you don't know what HTML is, HTML, or <b>Hypertext Markup Language</b>, is the programming language that browsers use to serve content to users via the internet.  On a regular webpage (and not a Python notebook), you can right-click and go to an option that either has (or contains similar text to) <b>View Page Source</b> as an option, and you will most likely see the HTML text that the server of the website has served to your browser.  Alternatively, you can right-click and choose an option that with the exact words (or has similar words to) <b>Inspect Element</b>, and you will see your browser's console where, in real time, you can explore the different intricacies of the different elements on the webpage.  If you would like to learn more, feel free to use Microsoft Copilot, Google Gemini, ChatGPT, or another generalized Gen AI tool, to learn more about what you can do in a browser's console.</p>
<p>The <b>BeautifulSoup class</b> is how the program will instantiate this process of parsing HTML/XML content within a source, in this case, a webpage.</p>
<p>Finally, importing the <b>requests</b> library enables the program to make <b>HTTP/1.1</b> requests using a simple API that abstracts the process, otherwise.</p>

<h3>The Imported Libraries</h3>

In [45]:
from bs4 import BeautifulSoup
import requests

<p>To retrieve the above code, I asked GitHub Copilot about "which libraries I could use to process data in an HTML web page", but later specified that I would like to retrieve the contents of all of the individual HTML elements that I would specify.  It is very important that you learn how to structure your prompts when you use any GenAI tool, whether it be GitHub Copilot, Microsoft Copilot, or another Gne AI tool.</p>

<p>GenAI tools, such as GitHub Copilot, require refinement (at the user's discretion) because AI, and machine code in general, operate in a manner called "structured thinking".</p>

<p>Structured thinking is about analyzing a problem by learning to understand different steps required in solving said problem, then figuring out a solution.  Afterwards, the solution is broken down into different steps, much like a recipe, to figure out what tasks need to be completed to meet the needs of a solution and whatever functionalities that a solution entails.</p>

<h3>Structured Thinking and GenAI</h3>

<p>Generative AI tools require refinement (at the user's discretion) because AI, and software programs in general, operate in terms of structured thinking.</p>

<p>Both GenAI tools and programs require this understanding of how machines interpret code (whether that code is in the form of "natural" human language or written in a programming language) since the goal of LLMs (large language models) is to empower the user to generate new content (the output) based off of a prompt (the input), much like how the goal of writing software is to implement a solution using a specific programming language or a specific stack of programming languages.  This is why good code and good GenAI prompts contain a task (keep this as simple as possible), context that adds more information about the task, and formatting that is clear and consistent.</p>

<p>Context makes stories a lot clearer than when you don't have them, right?  In the same way, GenAI prompts and software or code can include context to be more specific, to be tailored to an industry-specific task, etc.  And, as you will find out, context is also why I include so many Python comments (which start with a hashtag "#" symbol).</p>

<h3>Finding the correct library</h3>
<p>The original recommendation that I was given was to use <b>NumPy</b>, a popular Python library in the field of data science, which has the limitation of only being able to collect structured (or tabular) HTML code, which essentially means that any code collected would need to be in the format of an HTML table.  Thus, I had to rewrite my original prompt to specify that I wanted to "scrape different elements in an HTML webpage" and not just HTML tables.  This led GitHub Copilot to recommend the <b>bs4 (BeautifulSoup4)</b> and <b>requests</b> libraries, as written in the code above.</p>

<h2>Program Code Summary</h2>

<p>Now that we have the necessary libraries imported, we can finally get into writing the code itself.</p>
<p>So I actually began writing the code by prompting Copilot to show me how to create a browser request, the code for which you can find in the "browser_request" function, as well as the "body" variable in the "content" function that is meant to store the output that I wanted from the program.</p>
<p>That's when Copilot output that the different elements of the browser request were the following: headers for a browser, a response from the server of the website, a "soup" which is an instantiation of how the BeautifulSoup4 library parses the HTML content of a webpage, and a variable for the elements to scrape all of the HTML tags as indicated under my discretion.</p>
<p>Now that I have indicated the HTML elements that I would like for the program to scrape for, I prompted Copilot to show me how to loop through this list of elements.</p>
<p>The original variant of this program simply output all of the different indicated elements, and so the original loop was simply outputting these elements in the order that they came in the webpage.</p>
<p>After some thought, I realized that grouping elements by div elements would make for a more readable output, and so I prompted Copilot to show me how to, first, loop only through div elements (I would later add section elements as well) and, two, loop through each division element's child elements which, along with the div_counter element and body container element, form the rest of the "content" function.</p>
<p>Finally, I tested out the functionality by creating an introduction string that included the greeting, name, additional information, title of the website to explore, and the url of the website to explore.  I later refactored this string into the "intro" function that would take these different elements as arguments to output the necessary introductory content.  I used the same approach with the conclusion which is instantiated as the "ending" function, which takes only the title and url elements as arguments.</p>
<p>To put everything together, I concatenated the introduction, body, and conclusion elements and returned a string containing all of this data in the "read_webpage_data" function.</p>

<p>Now, let's look deeper into the code.  Of course, the first function that shows up is the "intro" function, as coded below:</p>

In [46]:
# Define the "intro" function:
def intro(greeting, my_name, additional_info, title, url):
    # Define the introduction of the content
    introduction = "\nIntroduction: "
    
    # Add the greeting, name, and additional information to the introduction
    introduction += f"{greeting}, I am {my_name} and this is a web scraping project that I am creating using the BS4 (BeautifulSoup4) library in Python.\n\n"
    introduction += additional_info + "\n\n"
    
    # Add the title and URL of the webpage to the introduction
    introduction += f"The content in the homepage of {title} <{url}>, are as follows:\n\n"
    
    # Return the introduction
    return introduction

<p>The "intro" function takes in the arguments of: greeting, my_name, additional_info, title, and url.  Using these variables, the function generates an "introduction" variable, to which a formatted string with the argument variables is appended.  Finally, the completed "introduction" variable is returned by the function for use in the "read_webpage_data" function.</p>

In [47]:
# Define the "ending" function:
def ending(title, url):
    # Define the conclusion of the content
    conclusion = "-" * 50 + "\n"
    
    # Add the end of the output message to the conclusion, along with the title and URL
    conclusion += f"\nEnd of the output for the content in the homepage of my blog, ({title} <{url[8::]}>).\n\n"
    
    # Add the final message to the conclusion  
    conclusion += "I am Fornesus, the owner of the blog and the programmer of this code. Thank you for visiting!\n"
    
    # Return the conclusion
    return conclusion

<p>The "ending" function takes in the arguments of: title and url.  Using these variables, the function generates a "conclusion" variable, to which a formatted string with the argument variables is appended.  Finally, the completed "conclusion" variable is returned by the function for use in the "read"webpage_data" function.</p>

In [48]:
# Define the "browser_request" function:
def browser_request(url):
    # Send HTTP request by masquerading as a browser to prevent 403 error and access the website:
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 Edge/16.16299'}
    
    # Send a GET request to the specified URL
    response = requests.get(url, headers=headers)
    
    # Parse the HTML content of the response with Beautiful Soup
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Find all HTML tags
    elements = soup.find_all(['div', 'section', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'img', 'iframe', 'a', 'p'])

    # Return the elements and the URL
    return [elements, url]

<p>Copilot showed me that a browser request from the BS4 library contains headers, a response, soup, and the elements that you would like for the program to scrape which, in my case, are: div, h1, h2, h3, h4, h5, h6, img, iframe, a, and p.  Thus, this content is available in the browser_request as the headers, response, soup, and elements variables, while only the elements and given URL are returned by the function.</p>

In [49]:
# Define the "content" function:
def content(information):
    # Assign the elements argument to the "elements" variable:
    elements = information[0]
    url = information[1]

    # Create the "body" variable as an empty string:
    body = ''

    # Track the count of <div> elements
    div_count = 0

    # Loop through the elements
    for element in elements:
        # Check if the element is a <div> element
        if element.name == 'div' or element.name == 'section':
            element_list = [child for child in element.children if child.name in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'img', 'iframe', 'a', 'p']]
            # Check if the <div> element has any children
            if len(element_list) == 0:
                continue
            # If the division element has children, increment the div_count and add the content to the body
            else:
                # Increment the div_count
                div_count += 1

                # Add heading content to the body for each element
                body += "-" * 50 + "\n"
                body += "\nContent Division <{}> element #{}:\n\n".format(element.name, div_count)
                
                # Initialize the child_count variable
                child_count = 0

                # Loop through the children of the <div> element
                for child in element_list:
                    # Check if the child is a valid element
                    if child.name in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'img', 'iframe', 'a', 'p']:
                        # Increment the child_count
                        child_count += 1
                        # Check the type of element and add the content to the body
                        if child.name == 'img':
                            body += "Child {} - <{}>:\n".format(child_count, child.name)
                            body += "+ \"{}\"\n\n".format(child['src'])
                        # Check if the child is an anchor tag
                        elif child.name == 'a':
                            # Check if the href attribute is empty
                            if child['href'] == '':
                                body += "Child #{} - <{}>:\n".format(child_count, child.name)
                                body += "+ \"This link is empty.\"\n\n"
                            # Check if the href attribute is a relative URL
                            elif 'https' not in child['href']:
                                body += "Child #{} - <{}>:\n".format(child_count, child.name)
                                body += "+ \"{}\"\n\n".format(url + '/' + child['href'])
                            # Else, add the absolute URL to the body
                            else:
                                body += "Child #{} - <{}>:\n".format(child_count, child.name)
                                body += "+ \"{}\"\n\n".format(child['href'])
                        # Check if the child is an iframe element
                        elif child.name == 'iframe':
                            body += "Child #{} - <{}>:\n".format(child_count, child.name)
                            body += "+ \"{}\"\n\n".format(child['src'])
                        # Else, add the text content of the element to the body
                        else:
                            body += "Child #{} - <{}>:\n".format(child_count, child.name)
                            body += "+ \"{}\"\n\n".format(child.text.strip())

    # Return the body content
    return body

<p>The "content" function takes in an argument called "information" which is a list comprised of only the specific element and the url.</p>
<p>This is also where looping through the elements was done in the following manner:</p>
<ul>
    <li>Create a loop for each element in a table of the HTML elements that were scraped.</li>
    <li>Filter the loop with an if statement so that only division tags (div and section) can be initially identified to be scraped.</li>
    <li>Within this if statement, create a variable to store a list of elements that match any of the desired HTML elements.</li>
    <li>Then, create a nested if statement to weed out any section or div elements that have no child elements.</li>
    <li>Within the else statement, create a div_count variable to track the count of section or div elements, add a header for each division element to the body, then created a child_count variable to track the count of children elements for each section or div element.</li>
    <li>Create a loop that goes through each element in the element_list variable that is also a desired HTML element, with if/elif/else statements tailored for each HTML element type (as-needed).</li>
    <li>Finally, when this loop is done, all conditional statements and past loops end, which leads to the return statement with all of the content in the "body" element.</li>
</ul>

In [50]:
# Define the "read_webpage_data" function:
def read_webpage_data(greeting, name, additional_info, title, url):
    # Define the introduction, body, and conclusion of the content
    introduction = intro(greeting, name, additional_info, title, url)
    # Assign the content of the webpage to the variable "body"
    body = content(browser_request(url))
    # Assign the conclusion to the variable "conclusion"
    conclusion = ending(title, url)

    # Return the concatenated content
    return introduction + body + conclusion

<p>Now, this is the fun part: putting it all together!  The "read_webpage_data" element is called to capture all of the information that's necessary in this, and the helper functions, to do their job.  In this function, the contents of the "intro" function is assigned to the introduction variable, the body variable contains the content in the "content(browser_request(url))" function call which creates an API call to the stated website and collects this as an argument that creates the actual informational output of the program.  Finally, the conclusion variable contains the contents of the "ending" function.  Finally, the introduction, body, and conclusion variables are concatenated together to form a single string that outputs the desired content from the program.</p>

In [51]:
# Define the necessary variables for the "intro" function:
greeting = "Buenas at Kumusta"
name = "Fornesus"
title = "Fornesus Blog"
url = 'https://fornesus.blog'
additional_info = f"I am scraping the content of my blog, {title}, avaialble at {url}."

conclusion = "-" * 50 + "\n"
conclusion += "\nEnd of the output for the content in the homepage of my blog, (Fornesus Blog <fornesus.blog>).\n\n"
conclusion += "I am Fornesus, the owner of the blog and the programmer of this code. Thank you for visiting!\n"

# Print out the content of the specified webpage
print(read_webpage_data(greeting, name, additional_info, title, url))


Introduction: Buenas at Kumusta, I am Fornesus and this is a web scraping project that I am creating using the BS4 (BeautifulSoup4) library in Python.

I am scraping the content of my blog, Fornesus Blog, avaialble at https://fornesus.blog.

The content in the homepage of Fornesus Blog <https://fornesus.blog>, are as follows:

--------------------------------------------------

Content Division <div> element #1:

Child #1 - <a>:
+ "https://fornesus.blog/"

--------------------------------------------------

Content Division <div> element #2:

Child #1 - <a>:
+ "https://fornesus.blog/login"

--------------------------------------------------

Content Division <div> element #3:

Child #1 - <a>:
+ "https://fornesus.blog/blog/"

--------------------------------------------------

Content Division <div> element #4:

Child #1 - <a>:
+ "https://fornesus.com"

--------------------------------------------------

Content Division <div> element #5:

Child #1 - <a>:
+ "https://fornes.us"

-------

<p>As you may have noticed, I added a LOT of formatting to the output of this program, which undoubtedly took the most time to refine.  Formatting and human-friendly software design, including making clear comments and formatting output for human-friendly reading, leads to better and more effective code and solutions.</p>

In [52]:
# Demarcate the results of scraping data from these two websites with two print statements:
print()
print("-" * 50)
print()


--------------------------------------------------



<p>Now, let's see what happens when I use these same exact functions above to scrape data from my portfolio website at <a href="https://chris.com.ph">https://chris.com.ph</a>.</p>
<p>The necessary code can be found below:</p>

In [53]:
# Demarcate the results of scraping data from these two websites with two print statements:
print()
print("-" * 50)
print()


--------------------------------------------------



In [54]:
# Define the necessary variables for the "intro" function:
greeting = "Buenas at Kumusta"
name = "Fornesus"
title = "What Fornesus Builds"
url = "https://chris.com.ph"
additional_info = f"I am scraping the content of my portfolio site, {title}, avaialble at {url}."

conclusion = "-" * 50 + "\n"
conclusion += "\nEnd of the output for the content in the homepage of my portfolio website, (What Fornesus Builds <chris.com.ph>).\n\n"
conclusion += "I am Fornesus, the owner of the site and the programmer of this code. Thank you for visiting!\n"

# Print the content of the specified webpage
print(read_webpage_data(greeting, name, additional_info, title, url, conclusion))


Introduction: Buenas at Kumusta, I am Fornesus and this is a web scraping project that I am creating using the BS4 (BeautifulSoup4) library in Python.

I am scraping the content of my portfolio site, What Fornesus Builds, avaialble at https://chris.com.ph.

The content in the homepage of What Fornesus Builds <https://chris.com.ph>, are as follows:

--------------------------------------------------

Content Division <div> element #1:

Child 1 - <img>:
+ "https://assets.zyrosite.com/cdn-cgi/image/format=auto,w=551,h=704,fit=crop/d95rrNbqkpIXbpoo/logo-square-dOqNNjk0a0HN12xM.png"

--------------------------------------------------

Content Division <div> element #2:

Child 1 - <img>:
+ "https://assets.zyrosite.com/cdn-cgi/image/format=auto,w=328,h=418,fit=crop/d95rrNbqkpIXbpoo/logo-square-dOqNNjk0a0HN12xM.png"

--------------------------------------------------

Content Division <div> element #3:

Child #1 - <h1>:
+ "Hi, I'm Fornesus"

--------------------------------------------------


In [55]:
# Demarcate the results of scraping data from these two websites with two print statements:
print()
print("-" * 50)
print()


--------------------------------------------------



<p>Finally, let's see what happens when I use these same exact functions above to scrape data from my art portfolio website at <a href="https://fornesusart.com">https://fornesusart.com</a>.</p>
<p>The necessary code can be found below:</p>

In [55]:
# Demarcate the results of scraping data from these two websites with two print statements:
print()
print("-" * 50)
print()


--------------------------------------------------



In [56]:
# Define the necessary variables for the "intro" function:
greeting = "Buenas at Kumusta"
name = "Fornesus"
title = "Fornesus Art"
url = "https://fornesusart.com"
additional_info = f"I am scraping the content of my art portfolio, the {title}, avaialble at {url}."

conclusion = "-" * 50 + "\n"
conclusion += f"\nEnd of the output for the content in the homepage of my portfolio website, ({title} <{url[8::]}>).\n\n"
conclusion += "I am Fornesus, the owner of the site and the programmer of this code. Thank you for visiting!\n"

# Print the content of the specified webpage
print(read_webpage_data(greeting, name, additional_info, title, url, conclusion))


Introduction: Buenas at Kumusta, I am Fornesus and this is a web scraping project that I am creating using the BS4 (BeautifulSoup4) library in Python.

I am scraping the content of my art portfolio, the Fornesus Art, avaialble at https://fornesusart.com.

The content in the homepage of Fornesus Art <https://fornesusart.com>, are as follows:

--------------------------------------------------

Content Division <div> element #1:

Child #1 - <a>:
+ "https://fornesusart.com//"

--------------------------------------------------

Content Division <div> element #2:

Child #1 - <a>:
+ "https://fornesusart.com//"

--------------------------------------------------

Content Division <div> element #3:

Child #1 - <a>:
+ "https://fornesusart.com//#about"

--------------------------------------------------

Content Division <div> element #4:

Child #1 - <a>:
+ "https://fornesusart.com//#showcase"

--------------------------------------------------

Content Division <div> element #5:

Child #1 - <

<p>Thank you for reading through this notebook and I hope that an overview of this process helps with your own journey in learning a programming language, or even jumpstarts a new one if you weren't already considering it.</p>
<p>We live in times when knowledge is more accessible than it has ever been, and taking advantage of these opportunities by supercharging your learning may be a worthwhile endeavor, or at least a fun way to quench your curiosity!</p>
<p>Salamat!</p>