<h1>Data Cookbook: Fundamentals of Web scraping & Data wrangling</h1>
<p>
Web scraping is an important part of data collection. A lot of the time, useful data won't be readily available to download. With web scraping, we can automate the process of gathering data from multiple web pages, saving us time and effort.

By scraping websites, we can access a wide range of data, including text, images, tables, and more. This data can be used for various purposes, such as market research, sentiment analysis, price comparison, content aggregation, and data analysis.
</p>

A basic website is set up for demo.<br>
run `python app.py` to start server before proceeding.

<h2>Basic web requests</h2>

<p>The libraries **requests** and **BeautifulSoup** are commonly used in Python for web scraping and web content retrieval tasks. 

requests: It is a powerful library that allows you to send HTTP requests to web pages and web services. With requests, you can easily retrieve HTML content, make GET and POST requests, handle cookies and sessions, and interact with web APIs. It's an essential tool for fetching web data and interacting with web resources programmatically.

BeautifulSoup: This library is a popular choice for parsing and navigating HTML and XML documents. It provides a convenient way to extract specific data from web pages by traversing the HTML document's structure. You can search for tags, access tag attributes, and extract text or data of interest. When used in combination with requests, BeautifulSoup becomes a powerful tool for web scraping and data extraction.

Together, these libraries enable you to access web content, retrieve information, and perform data extraction tasks efficiently.</p>

In [33]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

This is the most basic usage of request. requests.get() send a GET request to the designated url, and returns the entire site's content as response. 

In [34]:
url = "http://127.0.0.1:5000/"
response = requests.get(url)

print(response.text)

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Document</title>
</head>
<body>
    <h3>click link to see tables</h3>
    <a href="/table1">table 1</a>
    <a href="/table2">table 2</a>
    <a href="/">home</a>
    <h3>API guide</h3>
    <p>access subdomain 'table-data' and 'table2-data' to retrieve data directly</p>
    <p>stock image</p>
    <img alt="stock" src="/static/stock.png"></img>
</body>
</html>


Here we see an image present on the site: `<img alt="stock" src="/static/stock.png"></img>`. Pulling image is a bit different, since you can't print it out directly. In order to keep the image, you must save it to a file.

In [35]:
# Send a GET request to the image URL
response = requests.get(url + '/static/stock.png')

# Check if the request was successful
if response.status_code == 200:
    # Save the image locally
    with open("saved_image.png", "wb") as file:
        file.write(response.content)
        print("Image saved successfully.")
else:
    print("Failed to retrieve the image.")


Image saved successfully.


<h2>Using APIs</h2>
While some sites doesn't provide ways to retrieve data, requiring you to scrape data off the site directly, some do offer built in methods of fetching data directly(and various other automation functions) in the form of API. APIs are used in a wide range of applications, from web and mobile app development to cloud services integration, IoT (Internet of Things), and more. They enable developers to leverage existing services and functionality, reducing development time and effort while promoting interoperability and flexibility in software ecosystems.

The provide sample site has a simplistic API built in, which allows users to retrieve the data tables in the form of JSON. If you nagivate directly to the links, you'll see a raw JSON table. This is much easier to process compared to extracting the tables from html.

In [39]:
# URLs for the APIs
url_table_data = url+'table-data'
url_table2_data = url+'table2-data'

# Making GET requests to the APIs
response_table_data = requests.get(url_table_data)
response_table2_data = requests.get(url_table2_data)

# Assuming the APIs return JSON data, parse the JSON into Python dictionaries
table_data = response_table_data.json()
table2_data = response_table2_data.json()
# Turn the JSON output into a Pandas DataFrame for further analysis
table_dp = pd.DataFrame(table_data)
table2_dp = pd.DataFrame(table2_data)
table_dp.head()
table2_dp.head()


Unnamed: 0,Name,Email,Phone Number,Department,Manager
0,John Doe,john.doe@example.com,123-456-7890,Engineering,Jane Wilson
1,Jane Smith,jane.smith@example.com,234-567-8901,Design,Robert Black
2,Mike Johnson,mike.johnson@example.com,345-678-9012,Marketing,Emily Green
3,Sarah Williams,sarah.williams@example.com,456-789-0123,Data Science,Michael Brown
4,David Brown,david.brown@example.com,567-890-1234,Product,Lisa White
