# Introduction to Web Scraping

Welcome to the Web scraping workshop. In this workshop, you will learn the fundamental concepts and techniques used to extract valuable insight from websites and large Data structures. Data mining is a crucial skill in today's data-driven world, enabling you to uncover patterns, trends, and relationships within data that can inform decision-making and drive business success.

## Course Objectives

By the end of this course, you will be able to:

- Understand the basic principles of data mining and its importance.
- Preprocess and clean data to prepare it for analysis.
- Apply various data mining techniques such as classification, clustering, and association rule mining.
- Evaluate the performance of data mining models.
- Use popular data mining tools and software.

## Course Outline

1. **Introduction to Data Mining**

    - Definition and significance
    - Applications of data mining
    - Data mining process

2. **Website scraping**

    - Introduction to web scraping
    - Web scraping tools
    - Web scraping with Python
  
3. **Data Preprocessing**

    - Data cleaning
    - Data transformation
    - Data reduction
    - Data discretization

4. **Data Formatting**

    - Data types
    - Data encoding
    - Data normalization
    - Data standardization

Run the cell below to install all the libraries that we will be needing for this workshop.

In [1]:
%pip install pandas requests beautifulsoup4




## Introduction to Python Data Mining tools

Here is an overview of all the python libraries that we will be using in this workshop:

- **Pandas**: A powerful data manipulation and analysis library that provides data structures like DataFrames, which are ideal for handling structured data. Think of this as Excel but for python where each column is a series and each row is a record. This will be used so that we can easily manipulate and clean our data.

## Introduction to requests

The requests library allows you to send HTTP requests using Python. It abstracts the complexities of making requests behind a simple API, making it easy to interact with web services and APIs.

### Making a basic GET request

A get request is used to retrieve data from a server. Here is an example of how it would look like:

In [None]:
import requests

response = requests.get('https://teamacademy.nl/')


In [None]:

from bs4 import BeautifulSoup

# Step 1: Send a GET request to the website
url = 'http://example.com'  # Replace with the URL of the website you want to scrape
response = requests.get(url)

# Step 2: Check the status code of the response
if response.status_code == 200:
    print("Successfully fetched the webpage")
else:
    print(f"Failed to fetch the webpage. Status code: {response.status_code}")

# Step 3: Parse the HTML content of the webpage
soup = BeautifulSoup(response.content, 'html.parser')

# Step 4: Extract specific data from the HTML
# For example, let's extract all the headings (h1, h2, h3, etc.) from the webpage
headings = soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])

# Step 5: Print the extracted headings
for heading in headings:
    print(heading.text.strip())

# Step 6: Extract links from the webpage
links = soup.find_all('a')

# Step 7: Print the extracted links
for link in links:
    href = link.get('href')
    text = link.text.strip()
    print(f"Link text: {text}, URL: {href}")

# Step 8: Extract paragraphs from the webpage
paragraphs = soup.find_all('p')

# Step 9: Print the extracted paragraphs
for paragraph in paragraphs:
    print(paragraph.text.strip())

## Demonstration on How to Use Pandas

Pandas is a powerful data manipulation and analysis library in Python. It provides data structures like DataFrames, which are ideal for handling structured data. Below is a demonstration of some basic operations you can perform using Pandas.
Below is a demonstration of creating a DataFrame using Pandas.

### 1.1 Creating a DataFrame 

A DataFrame is a 2d data structure that can store different types of data in columns. You can think of it as a spreadsheet in Excel.

In [2]:
import pandas as pd

data = { # We can use a dictionary to create a DataFrame
    'Name': ['Alice', 'Bob', 'Charlie'], #This will add values to the first column
    'Age': [25, 30, 35], #This will add values to the second column
    'Gender': ['F', 'M', 'M'] #This will add values to the third column
}
df = pd.DataFrame(data) #This will create a DataFrame from the data we have
df #This will display the DataFrame

Unnamed: 0,Name,Age,Gender
0,Alice,25,F
1,Bob,30,M
2,Charlie,35,M


### 1.2 Save to file

We can then save this to a file using the following code:

In [3]:
df.to_csv(path_or_buf='output.csv', index=False) 
# Here index is set to False as we dont need to save 
# the indexes of the data to the file

### 1.3 Reading Data from a File

Pandas can read data from various file formats, such as CSV, Excel, and SQL databases. Here is an example of reading data from a CSV file using the one we created above:

In [4]:
new_df = pd.read_csv('output.csv') #This will read the data from the file

# we can check if the two dataframes are the same by using the equals method
df.equals(other=new_df) #This will return True if the two DataFrames are the same

new_df

Unnamed: 0,Name,Age,Gender
0,Alice,25,F
1,Bob,30,M
2,Charlie,35,M


### 1.4 Data Manipulation

You can perform various data manipulation operations using Pandas. Here are examples of some common operations:

In [5]:
# Filtering rows based on a condition
filtered_df = df[df['Age'] > 25]
print("Filtered DataFrame (Age > 25):")
print(filtered_df)

# Adding a new column
df['Country'] = ['USA', 'Canada', 'UK']
print("\nDataFrame with new column 'Country':")
print(df)

# Dropping a column
df_dropped = df.drop(columns=['Gender'])
print("\nDataFrame after dropping 'Gender' column:")
print(df_dropped)

# Renaming columns
df_renamed = df.rename(columns={'Name': 'Full Name', 'Age': 'Years'})
print("\nDataFrame with renamed columns:")
print(df_renamed)

# Sorting the DataFrame by a column
df_sorted = df.sort_values(by='Age', ascending=False)
print("\nDataFrame sorted by 'Age' in descending order:")
print(df_sorted)

Filtered DataFrame (Age > 25):
      Name  Age Gender
1      Bob   30      M
2  Charlie   35      M

DataFrame with new column 'Country':
      Name  Age Gender Country
0    Alice   25      F     USA
1      Bob   30      M  Canada
2  Charlie   35      M      UK

DataFrame after dropping 'Gender' column:
      Name  Age Country
0    Alice   25     USA
1      Bob   30  Canada
2  Charlie   35      UK

DataFrame with renamed columns:
  Full Name  Years Gender Country
0     Alice     25      F     USA
1       Bob     30      M  Canada
2   Charlie     35      M      UK

DataFrame sorted by 'Age' in descending order:
      Name  Age Gender Country
2  Charlie   35      M      UK
1      Bob   30      M  Canada
0    Alice   25      F     USA


### 1.5 Data Cleaning

Data cleaning is an essential step in data preprocessing. It involves identifying and correcting errors in the data to improve its quality and reliability. There are several common data cleaning operations that can be performed with dataframes. It all depends on the data you are working with. Here are some common data cleaning operations alongside when to use them:

In [6]:
# Handling missing values
df_cleaned = df.copy()

# Adding some dummy data with missing values
df_cleaned.loc[3] = ['David', None, 'M', 'Australia']
df_cleaned.loc[4] = [None, 28, None, 'India']
# Loc is used to access a group of rows and columns by labels

print("DataFrame with missing values:")
print(df_cleaned)

# Dropping rows with missing values
df_dropped_na = df_cleaned.dropna()
print("\nDataFrame after dropping rows with missing values:")
print(df_dropped_na)


DataFrame with missing values:
      Name   Age Gender    Country
0    Alice    25      F        USA
1      Bob    30      M     Canada
2  Charlie    35      M         UK
3    David  None      M  Australia
4     None    28   None      India

DataFrame after dropping rows with missing values:
      Name Age Gender Country
0    Alice  25      F     USA
1      Bob  30      M  Canada
2  Charlie  35      M      UK


We use this when missing data represents a small portion of your dataset (<5%) or when the missing values would significantly bias your analysis
In the above example a better approach would be to fill the missing values with the replacing the values with either "Unknown" or the mean of the column.

In [None]:
# Fill the numeric columns with the mean value
df_cleaned['Age'] = df_cleaned['Age'].fillna(df_cleaned['Age'].mean())
print("\nDataFrame after filling missing values in 'Age' column with mean value:")
print(df_cleaned)

# Filling missing values with a default value
df_filled_na = df_cleaned.fillna(value='Unknown')
print("\nDataFrame after filling missing values with 'Unknown':")
df_filled_na


DataFrame after filling missing values in 'Age' column with mean value:
      Name   Age Gender    Country
0    Alice  25.0      F        USA
1      Bob  30.0      M     Canada
2  Charlie  35.0      M         UK
3    David  29.5      M  Australia
4     None  28.0   None      India

DataFrame after filling missing values with 'Unknown':


Unnamed: 0,Name,Age,Gender,Country
0,Alice,25.0,F,USA
1,Bob,30.0,M,Canada
2,Charlie,35.0,M,UK
3,David,29.5,M,Australia
4,Unknown,28.0,Unknown,India


This approach is better as it does not remove any data from the dataset. So whatever we do with this data next will be more accurate.