# Web Scraping Workshop with Python Demo

If you have dabbled even a little bit in the world of data science, you have likely heard of the term *web scraping*. **Web scraping** is the process of using automation to obtain vast amounts of data simply from sources publicly available on the internet.

There are three steps required to successfully scrape a website:

1) **Retreiving the data**

    When you enter a URL into your browser and load a website on your computer, the contents of and the structure of the webpage is downloaded to your computer and displayed by your browser. This data is stored in an *HTML* file. To successfully web scrape a webpage, we must get access to this HTML data so that we can further analyze its contents.
 
2) **Parsing the data**

    How that you have access to a website's HTML data, you now have to make sense of this data. Oftentimes, a big part of this step is figuring out which components of the HTML you need -- and which ones you don't. This is often the hardest and most tedious part of web scraping, but this is where the magic really happens.
    
3) **Using the data**

    Congratulations! Now that you have you data, you can now do something cool with it. You can analyze this data, gain new insights and knowledge, even work on your very own data science app with this data as well (with proper attribution, of course!)
    
In this workshop, you will learn the basics of creating your own web scraping script using the `BeautifulSoup` and `requests` package for the Python programming language. You will learn how to perform all three tasks of web scraping - retreiving the raw source HTML from a webpage, parsing that data to gain valuable data and knowledge, and storing that data to develop new insights or work on your very own data science app. We will also discuss how web scraping has been used in technologies and software commonly used today, as well as the potential ethical implications of the practice.

This workshop is meant to be **introductory** and is open to all skill levels. No prior knowledge of web scraping or any of the Python packages mentioned are required.

## Getting Started

This workshop heavily uses the `requests` and `beautifulsoup4` package for Python. To install this, run the cell below:

In [1]:
!pip install requests
!pip install beautifulsoup4

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


Now that you have installed the `requests` and `beautifulsoup4` package, import them below:

In [110]:
# Import the `requests` package
import requests

# Import `BeautifulSoup` from the `bs4` package
from bs4 import BeautifulSoup

import pandas as pd

import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
import re
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\yeahs\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

As you noticed above, `beautifulsoup4` is abbreviated as `bs4`. Then from there, we can import `BeautifulSoup`.

**Congratulations!** You are now ready to start web scraping.

## Part 1: Introduction

Remember, when you enter a URL into your browser and load a website on your computer, the contents of and the structure of the webpage is downloaded to your computer and displayed by your browser.

This data is stored in an *HTML* file! To successfully web scrape a webpage, we must get access to this HTML data so that we can further analyze its contents.

In Python, we can use the `requests` package to download and store the HTML from a URL!

Which URL?

In [16]:
# URL for Professor Eskandarian's course explorer tool:
url = "https://cs.nyu.edu/dynamic/courses/schedule/?semester=spring_2024"


Now, we want to fetch the data from this URL! Let's use the `requests.get()` function:

In [17]:
# Get data from the URL
data = requests.get(url)
data

<Response [200]>

As you can see, we received a response! Have you ever heard of the *404 Not Found* error? That is a response as well. 200 means all is well.

Now, let's find that HTML:

In [18]:
# Find data text (the HTML!) 
html = data.text
html

'\n<!DOCTYPE html>\n<!--[if lt IE 7]>\n<html class="no-js lt-ie9 lt-ie8 lt-ie7" lang="en"> <![endif]-->\n<!--[if IE 7]>\n<html class="no-js lt-ie9 lt-ie8" lang="en"> <![endif]-->\n<!--[if IE 8]>\n<html class="no-js lt-ie9" lang="en"> <![endif]-->\n<!--[if gt IE 8]><!-->\n<html class="no-js" lang="en">\n<!--<![endif]-->\n<head>\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta charset="utf-8">\n    <title>NYU Computer Science Department</title>\n    <meta name="description" content="">\n    <meta name="viewport" content="width=device-width, initial-scale=1">\n    <link href="//maxcdn.bootstrapcdn.com/bootstrap/3.3.0/css/bootstrap.min.css" rel="stylesheet" type="text/css">\n    <link rel="stylesheet" href="//maxcdn.bootstrapcdn.com/font-awesome/4.3.0/css/font-awesome.min.css">\n    <link rel="stylesheet" href="/static/css/main.css?20180702b" type="text/css">\n    <script src="/static/js/vendor/modernizr-2.8.3-respond-1.4.2.min.js"></script>\n\n    <!-- favicon -->\n  

Wow, that is quite long! Congratulations, you have now completed step 1 -- retrieving the data! But now, how do we make sense of all that gibberish?

Now that we have the HTML data, we now unleash the full power of `BeautifulSoup` to parse this data.

In [19]:
# Parse data using the HTML that we extracted using the `html.parser`, and save output to soup.
soup = BeautifulSoup(html, 'html.parser')


In [20]:
# Now, let's print out the "prettified" version of that HTML!
print(soup.prettify())

<!DOCTYPE html>
<!--[if lt IE 7]>
<html class="no-js lt-ie9 lt-ie8 lt-ie7" lang="en"> <![endif]-->
<!--[if IE 7]>
<html class="no-js lt-ie9 lt-ie8" lang="en"> <![endif]-->
<!--[if IE 8]>
<html class="no-js lt-ie9" lang="en"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en">
 <!--<![endif]-->
 <head>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta charset="utf-8"/>
  <title>
   NYU Computer Science Department
  </title>
  <meta content="" name="description"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <link href="//maxcdn.bootstrapcdn.com/bootstrap/3.3.0/css/bootstrap.min.css" rel="stylesheet" type="text/css"/>
  <link href="//maxcdn.bootstrapcdn.com/font-awesome/4.3.0/css/font-awesome.min.css" rel="stylesheet"/>
  <link href="/static/css/main.css?20180702b" rel="stylesheet" type="text/css"/>
  <script src="/static/js/vendor/modernizr-2.8.3-respond-1.4.2.min.js">
  </script>
  <!-- favicon -->
  <link href="/home/apple-ico

`BeautifulSoup` does a lot more than just printing HTML in a nice way. We can now also find specific elements on the webpage!

Let's find all of the data for the COMP classes. How? Let's take a look at the HTML.

You will find that all of the data is located in the `<table>` element! So, let's try and find that element.

In [34]:
# Find all course entries
courses = soup.find_all('li', class_='row')

for course in courses:
    # Safely extract the course code
    course_code_tag = course.find('a', {'data-toggle': 'tooltip'})
    course_code = course_code_tag.get_text(strip=True) if course_code_tag else 'N/A'
    
    # Safely extract the course title
    course_title_tag = course.find('a', class_='expand')
    course_title = course_title_tag.get_text(strip=True) if course_title_tag else 'N/A'
    
    # Safely extract the instructor name
    instructor_tag = course.find('span', class_='col-xs-12 col-sm-2').find('a') if course.find('span', class_='col-xs-12 col-sm-2') else None
    instructor_name = instructor_tag.get_text(strip=True) if instructor_tag else 'N/A'
    
    # Safely extract the course schedule
    schedule_tags = course.find_all('span', class_='col-xs-12 col-sm-2')
    schedule = schedule_tags[-2].get_text(strip=True) if len(schedule_tags) > 1 else 'N/A'
    
    # Print the extracted information if course code and title are found
    if course_code != 'N/A' and course_title != 'N/A':
        print(f"Course Code: {course_code}")
        print(f"Title: {course_title}")
        print(f"Instructor: {instructor_name}")
        print(f"Schedule: {schedule}")
        print('-' * 40)

Course Code: CSCI-GA.1170-​001
Title: Fundamental Algorithms
Instructor: Yevgeniy Dodis
Schedule: W 4:55-6:55PM
----------------------------------------
Course Code: CSCI-GA.1170-​003
Title: Fundamental Algorithms Recitation
Instructor: N/A
Schedule: F 4:55-5:45PM
----------------------------------------
Course Code: CSCI-GA.1170-​004
Title: Fundamental Algorithms
Instructor: N/A
Schedule: T 4:55-6:55PM
----------------------------------------
Course Code: CSCI-GA.1180-​001
Title: Mathematical Techniques For CS Applications
Instructor: N/A
Schedule: W 7:10-9:10PM
----------------------------------------
Course Code: CSCI-GA.2110-​001
Title: Programming Languages
Instructor: N/A
Schedule: M 4:55-6:55PM
----------------------------------------
Course Code: CSCI-GA.2110-​002
Title: Programming Languages Recitation
Instructor: N/A
Schedule: R 7:10-8:00PM
----------------------------------------
Course Code: CSCI-GA.2130-​001
Title: Compiler Construction
Instructor: Joseph Tassarotti
Schedu

In [39]:
# Find all course entries
courses = soup.find_all('li', class_='row')

# Initialize an empty list to store each course's information
courses_data = []

for course in courses:
    # Safely extract the course code
    course_code_tag = course.find('a', {'data-toggle': 'tooltip'})
    course_code = course_code_tag.get_text(strip=True) if course_code_tag else 'N/A'
    
    # Safely extract the course title
    course_title_tag = course.find('a', class_='expand')
    course_title = course_title_tag.get_text(strip=True) if course_title_tag else 'N/A'
    
    # Safely extract the instructor name
    instructor_tag = course.find('span', class_='col-xs-12 col-sm-2').find('a') if course.find('span', class_='col-xs-12 col-sm-2') else None
    instructor_name = instructor_tag.get_text(strip=True) if instructor_tag else 'N/A'
    
    # Safely extract the course schedule
    schedule_tags = course.find_all('span', class_='col-xs-12 col-sm-2')
    schedule = schedule_tags[-2].get_text(strip=True) if len(schedule_tags) > 1 else 'N/A'
    
    # Store the extracted information in a dictionary and add it to the list
    course_info = {
        'Course Code': course_code,
        'Title': course_title,
        'Instructor': instructor_name,
        'Schedule': schedule
    }
    courses_data.append(course_info)

# Convert the list of course information dictionaries to a DataFrame
courses_df = pd.DataFrame(courses_data)

# Display the DataFrame
courses_df.head(50)

Unnamed: 0,Course Code,Title,Instructor,Schedule
0,CSCI-GA.1170-​001,Fundamental Algorithms,Yevgeniy Dodis,W 4:55-6:55PM
1,,Fundamental Algorithms Recitation,,R 5:55-6:45PM
2,CSCI-GA.1170-​003,Fundamental Algorithms Recitation,,F 4:55-5:45PM
3,CSCI-GA.1170-​004,Fundamental Algorithms,,T 4:55-6:55PM
4,,Fundamental Algorithms Recitation,,R 3:45-4:35PM
5,CSCI-GA.1180-​001,Mathematical Techniques For CS Applications,,W 7:10-9:10PM
6,CSCI-GA.2110-​001,Programming Languages,,M 4:55-6:55PM
7,CSCI-GA.2110-​002,Programming Languages Recitation,,R 7:10-8:00PM
8,,Programming Languages Recitation,,R 7:10-8:00PM
9,CSCI-GA.2130-​001,Compiler Construction,Joseph Tassarotti,M 7:10-9:10PM


## Part 2: Text Analysis

Now that we have all of this data data, we can now use it for something cool!

In [107]:
# URL
url = "https://www.gutenberg.org/files/4093/4093-h/4093-h.htm"
# Get data from the URL
data = requests.get(url)
data

<Response [200]>

In [81]:
# Find data text (the HTML!) 
html = data.text
html



In [82]:
# Parse data using the HTML that we extracted using the `html.parser`, and save output to soup.
soup = BeautifulSoup(html, 'html.parser')

In [83]:
# Now, let's print out the "prettified" version of that HTML!
print(soup.prettify())

<?xml version="1.0" encoding="us-ascii"?>
<!DOCTYPE html
   PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" >
<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <title>
   Hedda Gabler, by Henrik Ibsen
  </title>
  <style type="text/css" xml:space="preserve">
   body { margin:5%; background:#faebd0; text-align:justify}
    P { text-indent: 1em; margin-top: .25em; margin-bottom: .25em; }
    H1,H2,H3,H4,H5,H6 { text-align: center; margin-left: 15%; margin-right: 15%; }
    hr  { width: 50%; text-align: center;}
    .foot { margin-left: 20%; margin-right: 20%; text-align: justify; text-indent: -3em; font-size: 90%; }
    blockquote {font-size: 97%; font-style: italic; margin-left: 10%; margin-right: 10%;}
    .mynote    {background-color: #DDE; color: #000; padding: .5em; margin-left: 10%; margin-right: 10%; font-family: sans-serif; font-size: 95%;}
    .toc       { margin-left: 10%; margin-bottom: .75em;}
    .toc2      { mar

In [101]:
# Initialize a list to hold all character dialogues along with names
character_dialogues = []

# Temporary storage for character name and dialogue
current_character = None
current_dialogue = []

# Regular expression pattern to match stage directions in square brackets
pattern = re.compile(r'\[.*?\]')

# Loop through each paragraph in the HTML
for p in soup.find_all('p'):
    text = p.get_text().strip()
    # Remove stage directions from the text
    text = re.sub(pattern, '', text).strip()
    
    if text.isupper() and len(text.split()) <= 3:  # Assuming character names are in uppercase and short
        # If there's an ongoing dialogue, save it before moving to the next character
        if current_character and current_dialogue:
            dialogue_str = ' '.join(current_dialogue).replace('\n', ' ').strip()
            character_dialogues.append({'Character': current_character, 'Dialogue': dialogue_str})
            current_dialogue = []  # Reset the dialogue list for the next character

        current_character = text  # Update the current character

    elif current_character and text:  # If we have a current character, add non-empty text to their dialogue
        current_dialogue.append(text)

# Add the last dialogue if it exists
if current_character and current_dialogue:
    dialogue_str = ' '.join(current_dialogue).replace('\n', ' ').strip()
    character_dialogues.append({'Character': current_character, 'Dialogue': dialogue_str})

# Convert the list of dialogues into a DataFrame
df_dialogues = pd.DataFrame(character_dialogues)

# Display the first 50 rows of the DataFrame
df_dialogues

Unnamed: 0,Character,Dialogue
0,ACT FOURTH.,"From Munich, on June 29, 1890, Ibsen wrote to ..."
1,W. A.,Transcriber's Note: The inclusion or omission ...
2,MISS TESMAN.,"Upon my word, I don't believe they are s..."
3,BERTA.,"I told you so, Miss. Remember how late the ste..."
4,MISS TESMAN.,Well well—let them have their sleep out. But l...
...,...,...
1631,BRACK.,"Every blessed evening, with all the plea..."
1632,HEDDA.,"Yes, don't you flatter yourself we will, Judge..."
1633,TESMAN.,"Oh, now she is playing with those pistols again."
1634,TESMAN.,Shot herself! Shot herself in the temple! Fanc...


In [108]:
# Initialize the VADER sentiment intensity analyzer
sia = SentimentIntensityAnalyzer()

# Function to get sentiment scores
def get_sentiment_scores(text):
    return sia.polarity_scores(text)

# Applying the function to the 'Dialogue' column and creating new columns for each score
df_dialogues['Sentiment'] = df_dialogues['Dialogue'].apply(get_sentiment_scores)
df_dialogues[['Negative', 'Neutral', 'Positive', 'Compound']] = df_dialogues['Sentiment'].apply(pd.Series)

# Dropping the 'Sentiment' column as it's no longer needed
df_dialogues.drop(columns='Sentiment', inplace=True)

# Displaying the updated DataFrame
df_dialogues

Unnamed: 0,Character,Dialogue,Negative,Neutral,Positive,Compound
0,ACT FOURTH.,"From Munich, on June 29, 1890, Ibsen wrote to ...",0.061,0.813,0.126,0.9998
1,W. A.,Transcriber's Note: The inclusion or omission ...,0.000,0.945,0.055,0.4215
2,MISS TESMAN.,"Upon my word, I don't believe they are s...",0.000,1.000,0.000,0.0000
3,BERTA.,"I told you so, Miss. Remember how late the ste...",0.068,0.932,0.000,-0.3417
4,MISS TESMAN.,Well well—let them have their sleep out. But l...,0.000,0.725,0.275,0.8100
...,...,...,...,...,...,...
1631,BRACK.,"Every blessed evening, with all the plea...",0.000,0.675,0.325,0.8475
1632,HEDDA.,"Yes, don't you flatter yourself we will, Judge...",0.134,0.741,0.125,0.2033
1633,TESMAN.,"Oh, now she is playing with those pistols again.",0.000,0.816,0.184,0.2023
1634,TESMAN.,Shot herself! Shot herself in the temple! Fanc...,0.000,1.000,0.000,0.0000
