# Automated Document Scraper for Multiple Formats

### Project Overview
In this notebook, we build a document scraper designed to search the web, locate documents in various formats, and download them locally. This automated approach allows for the efficient collection and organization of a large volume of documents across multiple file types, including PDF, DOCX, TXT, CSV, and HTML, based on a specific subject area.

### Objective
The goal of this project is to create a tool that:
1. **Takes a user-defined subject area** as input.
2. **Searches the web** for documents related to the subject area, focusing on one file format at a time.
3. **Downloads and organizes** the documents locally in a structured manner.
4. **Processes** the content of each file, converting it to text format if necessary (e.g., PDFs and DOCX files).

This notebook will guide you through the steps involved in setting up the scraper, executing it, and managing the downloaded files effectively.

### Steps to Follow
1. **Input Subject Area**: You will be prompted to enter the topic or keyword related to the documents you wish to scrape.
2. **Format-Specific Search**: The program searches Google for files of each specified format (PDF, DOCX, TXT, CSV, and HTML) relevant to the subject area.
3. **File Download and Organization**: Files are downloaded into a specified directory, and their content is extracted if applicable (e.g., converting PDFs and DOCX files to plain text).
4. **Error Handling**: The code includes error handling to manage issues like network errors, unsupported formats, or inaccessible URLs.

### Prerequisites
The following Python packages are used in this notebook:
- `requests`: for downloading files.
- `googlesearch-python`: to search Google for relevant document URLs.
- `pdfplumber`: for extracting text from PDF files.
- `python-docx`: for extracting text from DOCX files.
- `pandas`: for handling CSV data.
- `beautifulsoup4`: for parsing HTML content.

Please ensure these packages are installed by running:
```python
!pip install requests googlesearch-python pdfplumber python-docx pandas beautifulsoup4
```

### Important Notes
1. **Terms of Use**: Scraping Google search results and downloading documents must comply with the terms of service of the respective websites and Google itself.
2. **File Organization**: All files will be organized in a local directory named "downloaded_files" within the notebook’s working directory.
3. **Scalability**: Adjustments can be made to the number of search results per file format to better suit the scale of your project.

After completing the setup, run the code below to begin the search and download process.


In [8]:
import os
import requests
from googlesearch import search
import pdfplumber
from docx import Document
import pandas as pd
from bs4 import BeautifulSoup
import re


In [9]:
# Define the directory where files will be saved
save_dir = "downloaded_files"
os.makedirs(save_dir, exist_ok=True)

# Ask the user for the subject area
subject_area = input("Enter the subject area you want to search: ")


# Define the formats to search for and download
formats = ["pdf", "docx", "txt", "csv", "html"]


In [11]:
# Define the formats to search for and download
formats = ["pdf", "docx", "txt", "csv", "html"]

# Sanitize file names
def sanitize_filename(filename):
    return re.sub(r'[<>:"/\\|?*]', '_', filename)

# Function to download and save documents based on format
def download_file(url, save_path):
    try:
        response = requests.get(url)
        if response.status_code == 200:
            with open(save_path, "wb") as f:
                f.write(response.content)
            print(f"Downloaded: {save_path}")
        else:
            print(f"Failed to download: {url}")
    except Exception as e:
        print(f"Error downloading {url}: {e}")

# Function to extract and organize file content
def process_document(url, file_extension, format_dir):
    file_name = sanitize_filename(url.split('/')[-1])
    file_path = os.path.join(format_dir, file_name)  # Keep original extension

    # Download the file
    download_file(url, file_path)

    # Process based on file format
    if file_extension == "pdf":
        try:
            with pdfplumber.open(file_path) as pdf:
                text = "\n".join(page.extract_text() for page in pdf.pages if page.extract_text())
                with open(f"{file_path}.txt", "w", encoding="utf-8") as f:
                    f.write(text)
        except Exception as e:
            print(f"Error processing PDF {file_name}: {e}")

    elif file_extension == "docx":
        try:
            doc = Document(file_path)
            text = "\n".join([para.text for para in doc.paragraphs])
            with open(f"{file_path}.txt", "w", encoding="utf-8") as f:
                f.write(text)
        except Exception as e:
            print(f"Error processing DOCX {file_name}: {e}")

    elif file_extension == "csv":
        try:
            df = pd.read_csv(file_path)
            df.to_csv(file_path, index=False)
        except Exception as e:
            print(f"Error processing CSV {file_name}: {e}")

    elif file_extension == "html":
        try:
            response = requests.get(url)
            soup = BeautifulSoup(response.text, "html.parser")
            with open(f"{file_path}.txt", "w", encoding="utf-8") as f:
                f.write(soup.get_text())
        except Exception as e:
            print(f"Error processing HTML {file_name}: {e}")

# Function to search and download 1000 documents by format
def search_and_download_by_format(subject, file_format):
    query = f"{subject} {file_format} filetype:{file_format}"
    downloaded_count = 0
    
    # Create a subfolder for the current file format
    format_dir = os.path.join(base_dir, file_format)
    os.makedirs(format_dir, exist_ok=True)

    for url in search(query, num_results=100):  # Fetching in batches
        if 'robots.txt' in url:
            continue
        print(f"Found URL: {url}")
        process_document(url, file_format, format_dir)
        downloaded_count += 1
        if downloaded_count >= 10:
            break
    print(f"Completed downloading {downloaded_count} {file_format.upper()} files.")

# Iterate through formats, searching and downloading each one
for file_format in formats:
    print(f"\nSearching and downloading {file_format.upper()} files for: {subject_area}")
    search_and_download_by_format(subject_area, file_format)



Searching and downloading PDF files for: Geological Engineering
Found URL: https://vardhaman.org/wp-content/uploads/2021/03/ENGINEERING-GEOLOGY-1.pdf
Downloaded: downloaded_files\pdf\ENGINEERING-GEOLOGY-1.pdf
Found URL: https://edisciplinas.usp.br/pluginfile.php/5587989/mod_resource/content/2/A_Geology_for_Engineers_Seventh_Edition.pdf
Downloaded: downloaded_files\pdf\A_Geology_for_Engineers_Seventh_Edition.pdf
Found URL: https://geomuseu.ist.utl.pt/SEMINAR2012/Livros/EngenhariaGeologica.pdf
Downloaded: downloaded_files\pdf\EngenhariaGeologica.pdf
Found URL: https://hostnezt.com/cssfiles/geology/Engineering%20Geology%20-%20Principles%20and%20Practice.pdf
Downloaded: downloaded_files\pdf\Engineering%20Geology%20-%20Principles%20and%20Practice.pdf
Found URL: https://www.routledge.com/rsc/downloads/SW3524_Sample.pdf?srsltid=AfmBOop0zuU8HMRuFj1J7wzOPj7pmJPU0fnE8no_NU-k4lKR5CbiR9cT
Downloaded: downloaded_files\pdf\SW3524_Sample.pdf_srsltid=AfmBOop0zuU8HMRuFj1J7wzOPj7pmJPU0fnE8no_NU-k4lKR5C