# Introduction to Web Scraping

Welcome to the web scraping workshop. In this workshop, you will learn the fundamental concepts and techniques used to extract valuable insight from websites and large Data structures. Data mining is a crucial skill in today's data-driven world, enabling you to uncover patterns, trends, and relationships within data that can inform decision-making and drive business success.

## Outline

1. **Introduction to Python Web Scraping**
    - Overview of Python libraries
    - Requests library
    - BeautifulSoup library
    - Pandas library
2. **Making HTTP Requests with Requests**
    - Basic GET request
    - Handling different HTTP methods (POST, PUT, DELETE)
    - Using headers with requests
3. **Parsing HTML with BeautifulSoup**
    - Creating a BeautifulSoup object
    - Navigating and searching the parse tree
    - Extracting data from HTML tags
4. **Data Extraction and Storage**
    - Extracting specific elements
    - Formatting data for storage
    - Saving data with Pandas
5. **Data Manipulation and Cleaning with Pandas**
    - Creating and saving DataFrames
    - Reading data from files
    - Data manipulation operations
    - Data cleaning operations
6.  **Exercises**
    - Basic GET request
    - Finding specific HTML elements
    - Creating and saving DataFrames
    - Data statistics and analysis

Run the cell below to install all the libraries that we will be needing for this workshop. To run a cell press a button that appears when you hover over the cell.

In [None]:
%pip install pandas requests beautifulsoup4

## 1 Introduction to Python Web Scraping

Here is an overview of all the python libraries that we will be using in this workshop:

- **Requests**: A simple HTTP library that allows you to send HTTP requests using Python. This will be used to send requests to the website we want to scrape.
  
- **BeautifulSoup**: A Python library for pulling data out of HTML and XML files. This will be used to parse the HTML content of the website and extract the data we need.

- **Pandas**: A powerful data manipulation and analysis library that provides data structures like DataFrames, which are ideal for handling structured data. Think of this as Excel but for python where each column is a series and each row is a record. This will be used so that we can easily manipulate and clean our data.

## 2 Introduction to requests

The requests library allows you to send HTTP requests using Python. It abstracts the complexities of making requests behind a simple API, making it easy to interact with web services and APIs.

### 2.1 Making a basic GET request

A get request is used to retrieve data from a server. Here is an example of how it would look like:

In [None]:
import requests

response = requests.get("https://api.github.com") 
# used to send a get request to the url specified
print(response.status_code) # used to check if the request is successful
# 200 is used to indicate that the request was successful
# you may know 404 as the error code for a page not found
print(response.json()) # used to print the content of the page
#alternatively you can use response.text to print the content of the page

The `requests.get` function sends a GET request to the specified URL. The response from the server is stored in the *response* variable.

`Response Object`: The response object contains all the information returned by the server, including:

`response.status_code`: The HTTP status code (e.g., 200 for success, 404 for not found).

`response.text`: The content of the response as a string.

`response.json()`: If the response content is in JSON format, this method parses it into a Python dictionary.

### 2.2 Handling Different HTTP Methods

The requests library supports various HTTP methods, including POST, PUT, DELETE, etc.

POST Request
A POST request is used to send data to a server to create/update a resource.

In [None]:
url = 'https://httpbin.org/post'
data = {'key': 'value'}

response = requests.post(url, data=data)

print(response.status_code)
print(response.json())

#### 2.3 PUT Request

A PUT request is used to update a resource.

In [None]:
url = 'https://httpbin.org/put'
data = {'key': 'value'}

response = requests.put(url, data=data)

print(response.status_code)
print(response.json())

#### 2.4 DELETE Request

A DELETE request is used to delete a resource.

In [None]:
url = 'https://httpbin.org/delete'

response = requests.delete(url)

print(response.status_code)
print(response.json())

### 2.5 Using Headers with Requests

When making HTTP requests, headers can be used to provide additional information to the server. 

Headers are key-value pairs that can include information such as the user agent, content type, and authorization tokens. Using headers can help you interact with web services and APIs more effectively.

Here is an example of how to use headers with the `requests` library in Python:



In [None]:
url = 'https://teamacademy.nl/faq/'

# Headers are used to send additional information to the server when processing a request
headers = {
    # This is used to identify the browser
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    # We use this to get the page in html
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 
    # We use this to get the page in english
    'Accept-Language': 'en-US,en;q=0.5',
    # Used to get the page in the original form
    'Accept-Encoding': 'identity'  
}

response = requests.get(url, headers=headers)

print(response.status_code)
print(response.content)

In this example, the `headers` dictionary contains several common headers:
- `User-Agent`: Identifies the client software making the request.
- `Accept`: Specifies the media types that are acceptable for the response.
- `Accept-Language`: Specifies the preferred languages for the response.
- `Accept-Encoding`: Specifies the content encoding that is acceptable for the response.

By including these headers in the request, you can ensure that the server receives the necessary information to process the request correctly.

## 3 Transition from Requests to BeautifulSoup

While the `requests` library is excellent for making HTTP requests and retrieving the content of web pages, it does not provide tools for manipulating and extracting data from the HTML content. This is where `BeautifulSoup` comes in. `BeautifulSoup` is a Python library used for parsing HTML and XML documents and extracting data from them in a hierarchical and readable manner.

### 3.1 Why Use BeautifulSoup?

- **HTML Parsing**: BeautifulSoup can get the  HTML content and create a tree structure of html tags that can be used to extract data from HTML tags.
- **Easy Navigation**: It provides methods to navigate and search that tree, making it easy to find and extract specific elements.
- **Data Extraction**: You can extract data from HTML tags, attributes, and text content.


In [None]:
from bs4 import BeautifulSoup # used to import the BeautifulSoup class from the bs4 module

content = response.content # used to get the content of the page

soup = BeautifulSoup(content, 'html.parser') # used to create a BeautifulSoup object

print(soup.prettify()) # used to print the content of the page in a readable format

### 3.2 Are we allowed to scrape any website?

Web scraping can be a powerful tool for data collection and analysis, but it is important to understand the legal implications and ethical considerations associated with it.

1. **Terms of Service**: Many websites have terms of service (ToS) that explicitly prohibit scraping. Violating these terms can lead to legal consequences, including being banned from the website or facing legal action.

2. **Copyright Laws**: The content on websites is often protected by copyright laws. Scraping and using this content without permission can infringe on the copyright holder's rights.

3. **Data Privacy**: Scraping personal data from websites can violate data privacy laws, such as the General Data Protection Regulation (GDPR) in the European Union. Ensure that you comply with relevant data protection regulations when scraping data.

4. **Robots.txt**: Websites often use a `robots.txt` file to indicate which parts of the site can be crawled by web crawlers. While this file is not legally binding, it is considered good practice to respect the directives specified in `robots.txt`.

All in all the basic rule of thumb is to keep scraping to a minimum to not overload the server.
Here are some nerds arguing if its legal or not ["reddit link"](https://www.reddit.com/r/webscraping/comments/yrs5eq/can_you_just_scrape_any_website/)


### 3.3 Get useful information from the HTML content

Now that we have this output we would want to do something useful with it like extracting the data from the the printed content. BeautifulSoup comes in clutch here. It allows us to specify what we want to extract from the HTML content using various methods like `find()`, `find_all()`, and `select()`.

In [None]:
# Find the first <h2> tag
first_h2 = soup.find('h2') 
# the h tag is used to define the header of a page 
# the number after the h indicates the size of the header
if first_h2:
    print(first_h2.text)
else:
    print('No <h2> tag found')

print('----------------------')

# find the next <h2> tag
next_h2 = first_h2.find_next('h2')
if next_h2:
    print(next_h2.text)
else:
    print('No next <h2> tag found')

print('----------------------')

# Find all <p> tags
all_p_tags = soup.find_all(['p']) # here we can also add more things to the list
# the p tag is used to define a paragraph
# using this we will find all the paragraphs in the page
for p in all_p_tags:
    print(p.text)

In [None]:
# Use the select method to find all <p> tags with a specific class
selected_paragraphs = soup.select('p.some-class')

# Print the text content of each selected paragraph
for paragraph in selected_paragraphs:
    print(paragraph.text)

All of this can be found on the webpage that we are trying to scrape. Using inspect elements shows us all the HTML content of the webpage. We can use this to find the tags that we want to extract. or things that we want to focus on.
Below is an example of how we can extract a specific element that we found using inspect elements.
To find specific elements, we can use methods like `find()`, `find_all()`, and `select()`. Here are some examples:

Common HTML elements include:
- `<a>`: Defines a hyperlink
- `<div>`: Defines a division or section
- `<p>`: Defines a paragraph
- `<h1>` to `<h6>`: Define HTML headings
- `<span>`: Defines a section in a document
- `<ul>`: Defines an unordered list
- `<ol>`: Defines an ordered list
- `<li>`: Defines a list item
- `<table>`: Defines a table
- `<tr>`: Defines a row in a table
- `<td>`: Defines a cell in a table
- `<th>`: Defines a header cell in a table
- `<form>`: Defines an HTML form for user input
- `<input>`: Defines an input control
- `<button>`: Defines a clickable button
- `<img>`: Embeds an image
- `<nav>`: Defines navigation links
- `<header>`: Defines a header for a document or section
- `<footer>`: Defines a footer for a document or section


In [None]:
# Extract title
title_div = soup.find('div', id='elementor-tab-title-26411')
# the title_div is used to find the div with the id elementor-tab-title-26411
# this div was found by inspecting the page
if title_div:
    #if we find it
    titles = title_div.find_all('a', class_='elementor-accordion-title')
    # we find all the a tags with the class elementor-accordion-title
    title_texts = [title.text for title in titles] # gets all the text from the a tags
    print("Titles:", title_texts)
else:
    print("No title div found")

# Extract content and links
content_div = soup.find('div', id='elementor-tab-content-26411')
# the content_div is used to find the div with the id elementor-tab-content-26411
if content_div:
    # if we find it
    paragraphs = content_div.get_text().strip() # gets all the text from the div
    links = [(a.text, a['href']) for a in content_div.find_all('a', href=True)] # finds all the links in the div
    print("\nContent:", paragraphs)
    print("\nLinks:")
    for text, url in links:
        print(f"- {text}: {url}")
else:
    print("No content div found")

We would now want to do something useful with the text that extracted from the webpage. Maybe save it to a file or interact with it in some productive way.

This can be done by first formatting the data so that it is easier to work with. We can use the `pandas` library to do this.

But first we need to decide what we would want to save. The data that can be saved here can be pairs of questions and answers. 

To do this we first need to select all the questions and answers from the webpage.

In [None]:
questions = [] # used to store the questions
answers = [] # used to store the answers

for title in soup.find_all('a', class_='elementor-accordion-title'): 
    # finds all the a tags with the class elementor-accordion-title
    # thats where the questions are stored
    questions.append(title.text)

for content in soup.find_all('div', class_='elementor-tab-content'):
    # finds all the divs with the class elementor-tab-content
    # thats where the answers are stored
    answers.append(content.text)

print(f"First Question: {questions[0]}")
print(f"First Answer: {answers[0]}")

But before we can finish this we need to know a little about `pandas` and how it works

## 4 What is Pandas?

Pandas is a powerful data manipulation and analysis library in Python. It provides data structures like DataFrames, which are ideal for handling structured data. Below is a demonstration of some basic operations you can perform using Pandas.
Below is a demonstration of creating a DataFrame using Pandas.

### 4.1 Creating a DataFrame 

A DataFrame is a 2d data structure that can store different types of data in columns. You can think of it as a spreadsheet in Excel.

In [None]:
import pandas as pd

data = { # We can use a dictionary to create a DataFrame
    'Name': ['Alice', 'Bob', 'Charlie'], #This will add values to the first column
    # The values in the list will be added to the DataFrame
    # and the value on the left (Key) will be the name of the column
    'Age': [25, 30, 35], #This will add values to the second column
    'Gender': ['F', 'M', 'M'] #This will add values to the third column
}
df = pd.DataFrame(data) #This will create a DataFrame from the data we have
df #This will display the DataFrame

### 4.2 Save to file

We can then save this to a file using the following code:

In [None]:
df.to_csv(path_or_buf='data/output.csv', index=False) 
# Here index is set to False as we dont need to save 
# the indexes of the data to the file

### 4.3 Reading Data from a File

Pandas can read data from various file formats, such as CSV, Excel, and SQL databases. Here is an example of reading data from a CSV file using the one we created above:

In [None]:
new_df = pd.read_csv('output.csv') #This will read the data from the file

# we can check if the two dataframes are the same by using the equals method
df.equals(other=new_df) #This will return True if the two DataFrames are the same

new_df

### 4.4 Data Manipulation

You can perform various data manipulation operations using Pandas. Here are examples of some common operations:

In [None]:
# Filtering rows based on a condition
filtered_df = df[df['Age'] > 25]
print("Filtered DataFrame (Age > 25):")
print(filtered_df)

# Adding a new column
df['Country'] = ['USA', 'Canada', 'UK']
print("\nDataFrame with new column 'Country':")
print(df)

# Dropping a column
df_dropped = df.drop(columns=['Gender'])
print("\nDataFrame after dropping 'Gender' column:")
print(df_dropped)

# Renaming columns
df_renamed = df.rename(columns={'Name': 'Full Name', 'Age': 'Years'})
print("\nDataFrame with renamed columns:")
print(df_renamed)

# Sorting the DataFrame by a column
df_sorted = df.sort_values(by='Age', ascending=False)
print("\nDataFrame sorted by 'Age' in descending order:")
print(df_sorted)

### 4.5 Data Cleaning

Data cleaning is an essential step in data preprocessing. It involves identifying and correcting errors in the data to improve its quality and reliability. There are several common data cleaning operations that can be performed with dataframes. It all depends on the data you are working with. Here are some common data cleaning operations alongside when to use them:

In [None]:
# Handling missing values
df_cleaned = df.copy()

# Adding some dummy data with missing values
df_cleaned.loc[3] = ['David', None, 'M', 'Australia']
df_cleaned.loc[4] = [None, 28, None, 'India']
# Loc is used to access a group of rows and columns by labels

print("DataFrame with missing values:")
print(df_cleaned)

# Dropping rows with missing values
df_dropped_na = df_cleaned.dropna()
print("\nDataFrame after dropping rows with missing values:")
print(df_dropped_na)


We use this when missing data represents a small portion of your dataset (<5%) or when the missing values would significantly bias your analysis
In the above example a better approach would be to fill the missing values with the replacing the values with either "Unknown" or the mean of the column.

In [None]:
# Fill the numeric columns with the mean value
df_cleaned['Age'] = df_cleaned['Age'].fillna(df_cleaned['Age'].mean())
print("\nDataFrame after filling missing values in 'Age' column with mean value:")
print(df_cleaned)

# Filling missing values with a default value
df_filled_na = df_cleaned.fillna(value='Unknown')
print("\nDataFrame after filling missing values with 'Unknown':")
df_filled_na

This approach is better as it does not remove any data from the dataset. So whatever we do with this data next will be more accurate.

## 5 Exercises

#### 1. **Basic GET Request**:
 
Write code to send a GET request to https://example.com and print the status code.

In [None]:
# Your code here to:
# 1. Send GET request to https://example.com
# 2. Print the status code

#### 2. **Find Second Header**: 
Write code using BeautifulSoup to find and print the text of the second h1 tag from this HTML string

In [None]:
html = """
<html>
    <h1>Welcome</h1>
    <h1>Second Header</h1>
    <p>Some text</p>
</html>
"""

# Your code here to:
# 1. Create BeautifulSoup object
# 2. Find first h1 tag
# 3. Find next h1 tag
# 4. Print its text

#### 3. **Create simple DataFrame**

Create a pandas DataFrame with 3 columns: 'Name', 'Age', and 'City' containing data for 2 people and save it to data/people.csv


In [None]:
# Your code here to:
# 1. Create dictionary with data
# 2. Convert to DataFrame
# 3. Display DataFrame
# 4. Save DataFrame to CSV file

#### 4. **Create the QA dataframe**

Based on the scraped data from before, create a pandas DataFrame with 2 columns: 'Question' and 'Answer' containing the questions and answers you extracted from the webpage.

In [None]:
# Your code here to:
# 1. Create DataFrame from lists
# 2. display the second element of the DataFrame
# 3. Save DataFrame to CSV file at location data/qa.csv

#### 5. **Data statistics**

Return the total amount of questions and answers in the DataFrame, the average answer length, the longest answer and the shortest question.



In [None]:
# Your code here to:
# 1. Count total Q&As
# 2. Calculate average answer length
# 3. Find longest answer and shortest question
# 4. Display all statistics
# hints, use len() to count the total Q&As
# use apply() to calculate the average answer length
# use idxmax() to find the longest answer
# use idxmin() to find the shortest question
# use print to display the statistics