# Extract Relevant Form 990s

## Goal of this script: 

The goal of this script is to store the Form 990s of the NPOs you are interested in a specific folder for later extraction and analysis. The previous script outputted the Entity Identification Numbers (EINs) of the NPOs we are interested in into a .txt file, which we will use as an input for this code.

## Step 1: Download XML files

The IRS used to use Amazon Web Services (AWS) to store the Form 990s of NPOs. However, now, the IRS directly links all Form 990 XML forms by year, from 2018-2024, on its website: https://www.irs.gov/charities-non-profits/form-990-series-downloads

Unfortunately, you have to (from my knowledge) download 990 data for ALL NPOs for the years you are interested in, and then extract the specific Form 990s are interested in. This is because the IRS does not have any tools to identify NPO Form 990s you are interested in online, and the IRS does not let you download select Form 990s. **Note that the Form 990 data amounts to around 20 GB per year, so across 2018-2023, this amount to around 120 GB.**

So, without further adue, to downlaod the XML files,:

- Go to: https://www.irs.gov/charities-non-profits/form-990-series-downloads, and download the years you are interested in.
- Once downloaded, unzip the data to a folder of your choosing using extraction software like 7-Zip. I have found unzipping using your Windows local unzipping function takes an much longer time than using 7-Zip.


2019-16


## Step 2: Extract your NPO Form 990s

- Lastly, run the below code to extract the NPO Form 990s you care about. The only part of the code you need to change and replace with your own directories are  **_nonprofits_directory_**, **_source_directory_**, and **_target_eins_file_**
- The code simply checks each Form 990 XML File for the filer EIN and sees if it matches any of the EINs you care about.
- The detect_encoding function is important because it seems that there are various encodings a file can take, such as UTF-8 vs. something else. This is important when we want to actually read the file, because sometimes the XML files won't open unless you specify the encoding of the file.

In [None]:
import os
import shutil
import chardet
from bs4 import BeautifulSoup
import time

# Path to the directory for storing selected nonprofits' files
nonprofits_directory = "990_Research\\important_990s\\nonprofits_directory"

# Path to the directory containing the files to be processed
source_directory = "990_Research\\form_990s\\2023"

# Path to the file containing the target EINs
target_eins_file = "990_Research\\nonprofit_data\\EINs.txt"

# Load target EINs from a file and add <EIN> tags
def load_target_eins(file_path):
    with open(file_path, 'r') as file:
        target_eins = {line.strip() for line in file}
    return target_eins

# Function to detect the encoding of a file using chardet library
def detect_encoding(file_path):
    # Open the file in binary mode
    with open(file_path, 'rb') as f:
        # Read the entire file content
        raw_data = f.read()
        # Detect the encoding
        result = chardet.detect(raw_data)
        # Extract the detected encoding
        encoding = result['encoding']
        # Return the detected encoding
        return encoding

# Function to process a single file
def process_file(file_path):
    try:
        # Detect the file's encoding
        encoding = detect_encoding(file_path)
        # Open the file with the detected encoding
        with open(file_path, 'r', encoding=encoding) as file:
            contents = file.read()
            # Use BeautifulSoup to parse the XML content
            soup = BeautifulSoup(contents, 'xml')
            # Find the <Filer> section
            EIN = soup.find('Filer').find('EIN').text
            if EIN in target_EINs:
                # Copy the file if target EIN is found
                shutil.copy(file_path, nonprofits_directory)
                # Remove the EIN from the set
                target_EINs.discard(EIN)
                # Return the path if EIN found
                return file_path
    except Exception as e:
        # Return an error message if an exception occurs
        return f"Error processing {file_path}: {e}"

    # Return None if no target EIN is found
    return None

# Generator function to get all file paths in the directory
def get_all_files(directory):
    # Walk through directory
    for root, _, files in os.walk(directory):
        # Iterate over each file
        for filename in files:
            # Yield the full file path
            yield os.path.join(root, filename)

# Load the target EINs into a set
target_EINs = load_target_eins(target_eins_file)

# Gather all file paths into a list
all_files = list(get_all_files(source_directory))

# Process each file
for file_path in all_files:
    result = process_file(file_path)
    if result:
        print(f"Processed: {result}")

print("done")
