## Web_Scraping using python


### Objective :

The objective of this project is to automate the extraction of data from websites. Web scraping can help transform unstructured web data into structured datasets that are easier to analyze and use.

### Import Liraries :

We import the 'requests' library to make HTTP requests and interact with web services and import 'BeautifulSoup' library to parse HTML and XML documents.

BeautifulSoup library is used to managing hierarchical data structures which simplifies the process of web scraping by converting complex HTML documents into a format that is easy to work with python.

In [26]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from sqlalchemy import create_engine
import logging

### Setup logging:

 The logging module in Python provides a flexible framework for emitting log messages from Python programs. The logging.basicConfig function is used to configure the logging system.

In [27]:
# Setup logging
logging.basicConfig(filename='scraping.log', level=logging.INFO, 
                    format='%(asctime)s %(levelname)s:%(message)s')

### Function to scrape data from a given URL
**scraped_data** is designed to fetch and parse HTML content from the given urls.

Create a try_except block for fetching urls,response.raise_for_status() checks if the request was successful (status code 200). If not, it raises an HTTPError.

Then using BeautifulSoup library extract the data from tables.

In [28]:

def scrape_data(url):
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
    except requests.exceptions.RequestException as e:
        logging.error(f"Error fetching URL {url}: {e}")
        return None
    
    try:
        soup = BeautifulSoup(response.text, 'html.parser')
        tables = soup.find_all('table', class_='table')
        data_frames = []
        
        for table in tables:
            rows = table.find_all('tr')
            headers = [header.text.strip() for header in rows[0].find_all('th')]
            data = []
            for row in rows[1:]:
                cols = row.find_all('td')
                data.append([col.text.strip() for col in cols])
            df = pd.DataFrame(data, columns=headers)
            data_frames.append(df)
        
        return data_frames
    except AttributeError as e:
        logging.error(f"Error parsing HTML from {url}: {e}")
        return None

### Function to save DataFrame to PostgreSQL:
Create save_to_postgresql funtion to save dataframe to postgresql database.

The **create_engine** function is used to create an Engine object, which serves as the primary interface to the database.The Engine manages connections to the database and provides a high-level interface for executing SQL statements.

In [29]:

def save_to_postgresql(df, table_name, username, password, host, dbname):
    try:
        engine = create_engine(f'postgresql+psycopg2://{username}:{password}@{host}/{dbname}')
        df.to_sql(table_name, engine, if_exists='replace', index=False)
        logging.info(f"Data saved to PostgreSQL table {table_name} successfully")
    except Exception as e:
        logging.error(f"Error saving data to PostgreSQL table {table_name}: {e}")

### Fetched the URLs from the Website

Website link: https://www.scrapethissite.com/pages/ajax-javascript/#2015

https://www.scrapethissite.com/pages/forms/

https://www.scrapethissite.com/pages/advanced/

Store the database connection details and the base table name.

Call the main function to scrape the data from the URLs and save it to the PostgreSQL database.

In [30]:
def main():
    urls = [
        "https://www.scrapethissite.com/pages/ajax-javascript/#2015",
        "https://www.scrapethissite.com/pages/forms/",
        "https://www.scrapethissite.com/pages/advanced/"
    ]
    
    username = 'postgres'
    password = 'sql123'
    host = 'localhost'
    dbname = 'web_scraping'
    table_name = 'scraped_data'

    for url in urls:
        data_frames = scrape_data(url)
        if data_frames:
            for idx, df in enumerate(data_frames):
                table_name = f"team_data_{idx}"
                save_to_postgresql(df, table_name, username, password, host, dbname)

if __name__ == '__main__':
    main()

The scraped_data is saved to the postgresql database web_scraping.
