# **Subset NPOs**

## Goal of this Script:

The primary use case of this script is to subset the NPOs whose Form 990 data you want to analyze.
There are MANY ways to choose the NPOs you want, including:

    - EIN (Unique Identifier of NPOs)
    - City
    - State
    - Classification Code(s)
    - Asset Amount
    - NATIONAL TAXONOMY OF EXEMPT ENTITIES (NTEE) - COMMON CODES 
    - NTEE - CORE CODES 
    - and more...

Subsetting based on any variables above is simple, as we will see in step 2

## **Step 1: Download list of ALL NPOs and Thier Attributes (Less than 0.5 GB)**

The IRS used to store data on NPOs and Form 990s on amazon aws based on my research

Now, the IRS stores information on NPOs and thier attributes in .csv files on their website here: **https://www.irs.gov/charities-non-profits/exempt-organizations-business-master-file-extract-eo-bmf**

On its website, the IRS splits the data by state. It also splits the data by geographic regions:
    Northeast
    Mid-Atlantic and Great Lakes
    Gulf Coast and Pacific Coast
    International
    Puerto Rico

I assume we want to analyze NPOs across the United States, excluding Puerto Rico, so I load in **Northeast, Mid-Atlantic and Great Lakes, and Gulf Coast and Pacific Coast NPOs** regions below

In [6]:
import pandas as pd
import pydoc

# Paths to three files
file_paths = [
    "eo1.csv",
    "eo2.csv",
    "eo3.csv"
]

# Read all datasets into a list of DataFrames
dataframes = [pd.read_csv(file_path) for file_path in file_paths]

In [8]:
# # Append the datasets into a single DataFrame. variable names are the same in all of them.
combined_df = pd.concat(dataframes, ignore_index=True)
combined_df.head()

Unnamed: 0,EIN,NAME,ICO,STREET,CITY,STATE,ZIP,GROUP,SUBSECTION,AFFILIATION,...,ASSET_CD,INCOME_CD,FILING_REQ_CD,PF_FILING_REQ_CD,ACCT_PD,ASSET_AMT,INCOME_AMT,REVENUE_AMT,NTEE_CD,SORT_NAME
0,19818,PALMER SECOND BAPTIST CHURCH,,1050 THORNDIKE ST,PALMER,MA,01069-1507,3125,3,9,...,0,0,6,0,12,,,,,5662.0
1,29215,ST GEORGE CATHEDRAL,,523 E BROADWAY,SOUTH BOSTON,MA,02127-4415,2365,3,9,...,0,0,6,0,12,,,,,
2,587764,IGLESIA BETHESDA INC,,13 CUMMINGHAM ST,LOWELL,MA,01852-0000,0,3,3,...,0,0,6,0,12,,,,X21,
3,635913,MINISTERIO APOSTOLICO JESUCRISTO ES EL SENOR INC,,454 ESSEX ST,LAWRENCE,MA,01840-1242,0,3,3,...,0,0,6,0,12,,,,X21,
4,765634,MERCY CHAPEL INTERNATIONAL,,75 MORTON VILLAGE DR APT 408,MATTAPAN,MA,02126-2433,0,3,3,...,0,0,6,0,12,,,,X20,


## Step 2: **Subset and Export Your NPO EINs**
We just loaded in all NPOs within the United States as of present day, as well as their attributes. The dataset looks scary, but thankfully, the IRS has a handy data dictionary found here: **https://www.irs.gov/pub/irs-soi/eo-info.pdf**

Say we only care about NPOs that satisfy two conditions:
    - They are in **New York** (East Coast Best Coast)
    - Their NTEE_CD is **P32**, indicating the NPO is a Human Services NPO focused on Foster Care (See data dictionary link above).

We can easily subset the data in pandas, as seen below. We only care about their Entity Identification Number (EIN), because this is the unique identifier with which we will fetch the NPOs Form 990s with.

**Make sure you read the IRS data dictionary carefully, as it contains a wealth of information for subsetting the NPOs you care about!**

In [43]:
subset_dataset = combined_df.loc[
    (combined_df['NTEE_CD'] == "P32") & (combined_df['STATE'] == "NY"),
    "EIN"]

In [44]:
subset_dataset.head()

30461     50573745
47083    113044834
65055    133986940
74317    141810672
90606    202242921
Name: EIN, dtype: int64

In [45]:
# Export the Series to a text file
output_path = "subset_dataset.txt"
subset_dataset.to_csv(output_path, index=False, header=False)

## **Conclusion & Note** 

The purpose of this file is to find Entity Identification Number (EINs) of NPOs whose form 990s you are looking to analyze. You can subset the NPOs based on 28 variables, including name, state, NTEE code, and asset amount, using the following data dictionary as a guide: https://www.irs.gov/pub/irs-soi/eo-info.pdf 