Here is the description formatted in Markdown:

# IMF Publication Data Scraping Script

## Description

This Python script automates the process of scraping publication data from the International Monetary Fund (IMF) website. By utilizing the IMF's API, it collects metadata about publications and downloads available documents. The script is designed to handle network issues robustly and efficiently, storing results in a format that facilitates data analysis and archival.

## Key Features

- **Session Management with Retry**
  - Uses a `requests.Session` with a retry strategy to manage transient errors such as timeouts or server errors, increasing the robustness of the scraping process.

- **Progress Tracking**
  - Utilizes `tqdm` to display a progress bar, providing visual feedback during the scraping process and enhancing user interaction.

- **Data Extraction**
  - Parses JSON responses from the IMF API to extract relevant publication metadata.
  - Scrapes individual article pages to locate and download publication files.

- **Data Storage**
  - Downloads publication files and saves them locally.
  - Stores metadata in a JSON format, allowing for organized storage and easy retrieval for future analysis.

You can copy and paste this Markdown text into any Markdown editor or viewer to see the formatted output. If you have any additional requests or need further modifications, feel free to ask!

In [None]:
import requests
import json
from bs4 import BeautifulSoup
import os
from tqdm import tqdm
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

# Base URL for the IMF API
base_url = "https://www.imf.org/api/imf/countrysearch/search"

# Initialize a list to store all results
all_results = []

# Create a directory named 'save' if it does not exist
save_directory = "save"
os.makedirs(save_directory, exist_ok=True)

# Create a session object and set up a retry strategy
session = requests.Session()
retry_strategy = Retry(
    total=5,  # Total number of retry attempts
    backoff_factor=1,  # Time delay between retries
    status_forcelist=[429, 500, 502, 503, 504],  # HTTP status codes that trigger a retry
    allowed_methods=["HEAD", "GET", "OPTIONS"]  # Methods allowed to retry
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("https://", adapter)
session.mount("http://", adapter)

# Total number of pages to scrape ,Total 371
total_pages = 20

# Use tqdm to add a progress bar
for page in tqdm(range(1, total_pages + 1), desc="Scraping Pages", unit="page"):
    # Define parameters for the API request
    params = {
        "language": "en",
        "country": "0f4439dc-b751-4b90-9c3a-b9923544c3da",
        "selectedFilters": "Publications",
        "currentPage": page,
        "pageSize": 10,
        "excludeItemId": "89af01a4-70e0-498e-8edd-4d7c9946d752"
    }

    # Send a GET request to retrieve the response
    try:
        response = session.get(base_url, params=params, timeout=10)
        response.raise_for_status()
    except requests.exceptions.RequestException as e:
        print(f"Error fetching page {page}: {e}")
        continue

    # Ensure the request was successful
    if response.status_code == 200:
        data = response.json()

        # Extract title, date, URL, and tags from each result
        for item in data['results']:
            title = item['title']
            date = item['date']
            url = item['url']
            tags = item['tags']

            # Initialize file name as an empty string
            file_name = ""

            # Download files from each article page
            article_url = f"https://www.imf.org{url}"  # Ensure URL is constructed correctly
            try:
                article_response = session.get(article_url, timeout=10)
                article_response.raise_for_status()
                soup = BeautifulSoup(article_response.content, 'html.parser')
                download_link = soup.select_one('.piwik_download')

                if download_link and download_link.get('href'):
                    download_url = f"https://www.imf.org{download_link.get('href')}"  # Ensure download URL is correct
                    file_name = download_url.split('/')[-1]
                    file_path = os.path.join(save_directory, file_name)

                    # Send GET request to download file
                    file_response = session.get(download_url, timeout=10)
                    file_response.raise_for_status()
                    # Save file locally
                    with open(file_path, 'wb') as f:
                        f.write(file_response.content)
                    print(f"Downloaded: {file_name}")
                else:
                    print(f"No download link found for article: {title}")
            except requests.exceptions.RequestException as e:
                print(f"Error fetching article {title}: {e}")

            # Append extracted information and file name to results list
            all_results.append({
                "title": title,
                "date": date,
                "url": f"https://www.imf.org{url}",
                "tags": tags,
                "filename": file_name[:-4] + "txt"  # Add file name to result
            })

    else:
        print(f"Failed to retrieve data for page {page}")

# Print or save all results to a file
json_path = os.path.join(save_directory, 'imf_data8.json') # The file is named imf data+ numbers +.json
with open(json_path, 'w', encoding='utf-8') as f:
    json.dump(all_results, f, ensure_ascii=False, indent=4)

print(f"Total articles scraped: {len(all_results)}")


# PDF Text Extraction and Cleaning Script

## Description

This script processes `.ashx` files, converts them to PDFs, extracts text, and saves the cleaned text into separate text files. It is particularly useful for automating the extraction of text from PDF files generated from `.ashx` files, which can be cumbersome if done manually, especially in bulk processing tasks.

## Key Features

- **PDF Extraction and Cleaning**
  - The script extracts text from PDF files using `PyPDF2` and cleans the extracted text to remove unwanted characters and patterns. This ensures that the text is in a readable format.

- **File Handling**
  - Converts `.ashx` files to `.pdf` files by copying and renaming them. It then processes these PDF files, extracting and cleaning the text, and saves the results in a specified directory.

- **Directory Management**
  - Automatically creates directories for storing converted PDFs and cleaned text files if they do not exist, ensuring the script runs smoothly even if these directories are not pre-created.

- **Error Handling**
  - Includes error handling to manage exceptions that might occur during file reading, writing, or text extraction, providing informative messages for debugging purposes.


In [None]:
import re
import PyPDF2
import os
import shutil


def extract_text_from_pdf(file_path):
    """
    Extracts text from a PDF file.

    :param file_path: The path to the PDF file.
    :return: Extracted text or None if an error occurs.
    """
    try:
        # Open the PDF file
        with open(file_path, 'rb') as file:
            # Create a PDF reader object
            reader = PyPDF2.PdfReader(file)
            # Initialize a variable to store extracted text
            text = ''
            # Loop through each page
            for page in reader.pages:
                # Extract text and append to the string
                text += page.extract_text()
        return text
    except Exception as e:
        print(f"Failed to process PDF file {file_path}: {e}")
        return None

def clean_text(text):
    """
    Cleans extracted text by removing unwanted characters and patterns.

    :param text: The raw extracted text.
    :return: Cleaned text.
    """
    # Remove unwanted characters or patterns
    text = re.sub(r'\x0c', '', text)  # Remove form feed characters
    text = re.sub(r'[\r\n]+', '\n', text)  # Normalize newlines
    text = re.sub(r'\s+', ' ', text)  # Normalize spaces to single spaces
    text = re.sub(r'\s*-\s*\n', '', text)  # Handle hyphenated line breaks
    text = re.sub(r'_', '', text)  # Remove underscores
    text = re.sub(r'\�', '', text)  # Remove replacement character
    text = re.sub(r'\. ', '', text)  # Remove misplaced periods
    text = re.sub(r'\', '', text)  # Remove backspace characters
    text = re.sub(r'\.{2}', '', text)  # Remove two consecutive periods
    text = re.sub(r'\-{2}', '', text)  # Remove two consecutive hyphens
    text = re.sub(r'[\x0c\x0e]', '', text)  # Remove form feed and \x0e characters
    # Remove known headers or footers
    text = re.sub(r'INTERNATIONAL MONETARY FUND', '', text, flags=re.IGNORECASE)
    text = re.sub(r'\bPage\s*\d+\b', '', text, flags=re.IGNORECASE)  # Remove "Page" numbering

    text = re.sub(r'[\x00-\x1F\x7F]', '', text)  # Remove ASCII non-printable characters
    return text.strip()

def save_text_to_file(text, output_path):
    """
    Saves cleaned text to a new text file.

    :param text: The cleaned text.
    :param output_path: Path where the text file will be saved.
    """
    with open(output_path, 'w', encoding='utf-8') as file:
        file.write(text)

def process_pdf(file_path, output_text_path):
    """
    Extracts and cleans text from a PDF, then saves it to a text file.

    :param file_path: Path to the PDF file.
    :param output_text_path: Path where the cleaned text will be saved.
    :return: The path to the saved text file, or None if extraction fails.
    """
    # Extract text from the PDF
    extracted_text = extract_text_from_pdf(file_path)
    if extracted_text is not None:
        # Clean the extracted text
        cleaned_text = clean_text(extracted_text)
        # Save the cleaned text to a file
        save_text_to_file(cleaned_text, output_text_path)
        return output_text_path
    else:
        return None

# Directory paths
input_directory = 'save'
pdf_directory = 'savepdf'
text_directory = 'saveText'

# Create target directories if they do not exist
os.makedirs(pdf_directory, exist_ok=True)
os.makedirs(text_directory, exist_ok=True)

# Iterate over all .ashx files in the save directory
for filename in os.listdir(input_directory):
    if filename.endswith('.ashx'):
        # Source file path
        source_file_path = os.path.join(input_directory, filename)
        # Force conversion to .pdf file path
        pdf_file_name = os.path.splitext(filename)[0] + '.pdf'
        pdf_file_path = os.path.join(pdf_directory, pdf_file_name)

        # Copy the file and rename it as .pdf
        shutil.copy(source_file_path, pdf_file_path)
        print(f"Converted to PDF file: {pdf_file_path}")

        # Output text file path
        text_file_name = os.path.splitext(pdf_file_name)[0] + '.txt'
        text_file_path = os.path.join(text_directory, text_file_name)

        # Process the PDF and save the cleaned text
        try:
            result = process_pdf(pdf_file_path, text_file_path)
            if result:
                print(f"Cleaned text saved to {text_file_path}")
            else:
                print(f"Failed to extract text from PDF file: {pdf_file_path}")
        except Exception as e:
            print(f"Error processing PDF file {pdf_file_path}: {e}")


# IMF Publication Metadata Processing Script

## Description

This script processes JSON files containing publication metadata from the International Monetary Fund (IMF) and segregates entries based on whether they have a valid filename field. The results are saved as both JSON and Excel files for easy access and analysis.

## Key Features

- **File Processing**
  - Uses `glob` to find all JSON files matching the pattern `imf_data*.json`, allowing it to process multiple files in one run.

- **Data Segregation**
  - **With Filename**: Entries with a non-empty filename field that is not just `'txt'`.
  - **Without Filename**: Entries missing the filename field or having it set to an empty string or `'txt'`.

- **Data Output**
  - The results are saved into both JSON and Excel formats, providing flexible options for further analysis or reporting:
    - **JSON Files**: `with_filename.json` and `without_filename.json` store the segregated data.
    - **Excel Files**: `with_filename.xlsx` and `without_filename.xlsx` provide a tabular view, which is especially useful for data analysis using spreadsheet software.

- **DataFrame Conversion**
  - By converting the lists to Pandas DataFrames, the script can easily export the data to Excel format, which is suitable for handling tabular data and conducting further analyses.

This script efficiently categorizes and exports IMF publication metadata, allowing users to easily handle and analyze the data in both JSON and Excel formats.


In [None]:
import json
import pandas as pd
import glob

with_filename = []
without_filename = []

# Iterate over all JSON files matching the pattern 'imf_data*.json'
for filename in glob.glob('imf_data*.json'):
    with open(filename, 'r', encoding='utf-8') as file:
        data = json.load(file)
        for entry in data:
            # Check if the entry has a 'filename' field that is not empty and not equal to 'txt'
            if 'filename' in entry and entry['filename'] and entry['filename'] != 'txt':
                with_filename.append(entry)
            else:
                without_filename.append(entry)

# Define output JSON filenames
json_with_filename = 'with_filename.json'
json_without_filename = 'without_filename.json'

# Save entries with valid filenames to a JSON file
with open(json_with_filename, 'w', encoding='utf-8') as f:
    json.dump(with_filename, f, ensure_ascii=False, indent=4)
print(f"Created JSON file with valid filenames: {json_with_filename}")

# Save entries without valid filenames to another JSON file
with open(json_without_filename, 'w', encoding='utf-8') as f:
    json.dump(without_filename, f, ensure_ascii=False, indent=4)
print(f"Created JSON file without valid filenames: {json_without_filename}")

# Convert lists to DataFrames
df_with_filename = pd.DataFrame(with_filename)
df_without_filename = pd.DataFrame(without_filename)

# Define output Excel filenames
excel_with_filename = 'with_filename.xlsx'
excel_without_filename = 'without_filename.xlsx'

# Export entries with valid filenames to an Excel file
df_with_filename.to_excel(excel_with_filename, index=False)
print(f"Created Excel file with valid filenames: {excel_with_filename}")

# Export entries without valid filenames to another Excel file
df_without_filename.to_excel(excel_without_filename, index=False)
print(f"Created Excel file without valid filenames: {excel_without_filename}")


JSON Data Processing Script
Description

This script processes data from a JSON file, segregates the entries based on a specific condition regarding the filename field, and exports the separated data into both JSON and Excel formats. This allows for easy data management and analysis.
Key Features

    Loading JSON Data: The script reads data from a specified JSON file.
    Data Separation: It separates the entries based on whether the filename field equals "indextxt".
    Exporting Data: The separated data is then saved into separate JSON and Excel files for further analysis and reporting.

In [None]:
import json
import pandas as pd

# Load the JSON data from the specified file
file_path = 'with_filename.json'  # Path to your JSON file
with open(file_path, 'r', encoding='utf-8') as file:
    data = json.load(file)  # Load JSON data

# Separate the entries based on the condition for filename
# Filter entries where the filename is "indextxt"
to_remove = [entry for entry in data if entry.get('filename') == "indextxt"]
# Get the remaining entries
remaining_data = [entry for entry in data if entry.get('filename') != "indextxt"]

# Export the separated data to JSON files
removed_json_path = 'removed_entries.json'  # Path for JSON file of removed entries
remaining_json_path = 'remaining_entries.json'  # Path for JSON file of remaining entries

# Save removed entries to JSON file
with open(removed_json_path, 'w', encoding='utf-8') as file:
    json.dump(to_remove, file, ensure_ascii=False, indent=4)

# Save remaining entries to JSON file
with open(remaining_json_path, 'w', encoding='utf-8') as file:
    json.dump(remaining_data, file, ensure_ascii=False, indent=4)

# Convert the data to DataFrames and save as Excel files
removed_df = pd.DataFrame(to_remove)  # Convert removed entries to DataFrame
remaining_df = pd.DataFrame(remaining_data)  # Convert remaining entries to DataFrame

removed_excel_path = 'removed_entries.xlsx'  # Path for Excel file of removed entries
remaining_excel_path = 'remaining_entries.xlsx'  # Path for Excel file of remaining entries

# Save removed entries to Excel file
removed_df.to_excel(removed_excel_path, index=False)

# Save remaining entries to Excel file
remaining_df.to_excel(remaining_excel_path, index=False)

# Print the paths of the saved files
print(f"Removed entries saved to JSON: {removed_json_path}")
print(f"Remaining entries saved to JSON: {remaining_json_path}")
print(f"Removed entries saved to Excel: {removed_excel_path}")
print(f"Remaining entries saved to Excel: {remaining_excel_path}")
