# Extracting file names, EINs, and Tax Years

## Extracting file names, EINs, and Tax Years from the index files

Of course, according to [this source](https://appliednonprofitresearch.com/posts/2020/06/skip-the-irs-990-efile-indices/), the indices are not reliable, but they should be good enough. If necessary, we can just pass through the entire directory and extract the information from each return, but that seems wasteful for now.

In [1]:
import pandas as pd
from tqdm import tqdm
import os
import asyncio
import aiofiles
import logging

logging.basicConfig(format='%(message)s', filename='extracting.log', level=logging.DEBUG)

Note: file `data/index_2014.csv` has a typo on line 39569, which pandas does not like. I manually edited the file to change `,AMAGEMENT` to what I think it should be, `MANAGEMENT`. Additionally, `data/index_2019.csv` has a tax period of `210805` on line `247851`, which I have not corrected.

In [2]:
years = list(range(2011, 2021))
frames = [pd.read_csv(f'data/index_{year}.csv') for year in years]
index = pd.concat(frames)

index

Unnamed: 0,RETURN_ID,FILING_TYPE,EIN,TAX_PERIOD,SUB_DATE,TAXPAYER_NAME,RETURN_TYPE,DLN,OBJECT_ID
0,9091250,EFILE,591971002,201009,11/30/2011 1:06:39 AM,ANGELUS INC,990,93493316003251,201103169349300325
1,9091274,EFILE,251713602,201106,11/30/2011 1:09:14 AM,TOUCH-STONE SOLUTIONS INC,990,93493313012311,201113139349301231
2,9091275,EFILE,232705170,201012,11/30/2011 1:09:16 AM,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA ...,990,93493313013011,201113139349301301
3,9091276,EFILE,581805618,201106,11/30/2011 1:09:19 AM,TORRINGTON VOA ELDERLY HOUSING INC BELL PARK T...,990,93493313013111,201113139349301311
4,9091277,EFILE,581876019,201106,11/30/2011 1:09:21 AM,HOUSTON VOA INDEPENDENT HOUSING INC HEIGHTS MANOR,990,93493313013161,201113139349301316
...,...,...,...,...,...,...,...,...,...
146851,17092613,EFILE,300272104,201812,1/31/2020 6:29:36 AM,HEALTHY NEIGHBORHOODS INC,990,93493319122009,201903199349312200
146852,17092616,EFILE,300056848,201812,1/31/2020 6:29:38 AM,DETROIT PUBLIC SAFETY FOUNDATION,990,93493319122159,201903199349312215
146853,17094197,EFILE,160730210,201812,1/31/2020 8:34:49 AM,DUNKIRK VETERANS OF WORLD WAR II INC,990O,93493319122209,201903199349312220
146854,17068165,EFILE,311781473,201812,1/24/2020 7:09:51 PM,OHIO FARM BUREAU FOUNDATION,990,93493319122409,201903199349312240


In [3]:
index = index[['EIN', 'TAX_PERIOD', 'OBJECT_ID']]
index = index.sort_values(['EIN', 'TAX_PERIOD'])
index = index.drop_duplicates()
# index.to_csv('data/sorted_index.csv', header=True, index=False)

## Extracting EINs and Tax Years from the xml files
Of course, this will be much slower than the above method, but it will be perfectly accurate.  
Note: All the information we need is in the header, so instead of parsing the entire xml file, we will just read the first few lines until we get the information we need. 

In [2]:
columns = ['EIN', 'TAX_YEAR', 'OBJECT_ID']

### Extract info from one file

In [3]:
async def extract_one(name, path=None):
    if path is None:
        path = f'data/{name}_public.xml'
    
    ein, tax_year = None, None
    async with aiofiles.open(path) as file:
        i = 0
        async for line in file:
            i += 1 # Unfortunately, enumerate does not work with aiofiles objects
            if '<EIN>' in line:
                ein = int(line.strip()[5:-6]) # remove the EIN tags from the line
            if not tax_year and '<TaxYr>' in line:
                tax_year = int(line.strip()[7:-8]) # remove the TaxYr tags from the line
            if not tax_year and '<TaxYear>' in line:
                tax_year = int(line.strip()[9:-10]) # remove the TaxYr tags from the line

            if ein and tax_year: 
                break
            if i > 100:
                logging.error(name)
                return
    
    return pd.DataFrame([[ein, tax_year, name]], columns=columns)

In [4]:
def extract_one_synch(name, path=None):
    if path is None:
        path = f'data/{name}_public.xml'
    
    ein, tax_year = None, None
    with open(path) as file:
        i = 0
        for line in file:
            i += 1 # Unfortunately, enumerate does not work with aiofiles objects
            if '<EIN>' in line:
                ein = int(line.strip()[5:-6]) # remove the EIN tags from the line
            if not tax_year and '<TaxYr>' in line:
                tax_year = int(line.strip()[7:-8]) # remove the TaxYr tags from the line
            if not tax_year and '<TaxYear>' in line:
                tax_year = int(line.strip()[9:-10]) # remove the TaxYr tags from the line

            if ein and tax_year: 
                break
            if i > 100:
                logging.error(name)
                return
    
    return pd.DataFrame([[ein, tax_year, name]], columns=columns)

### Example Usage

In [5]:
df = pd.DataFrame(columns=columns)

one_row = await extract_one('201602159349301240')
df = df.append(one_row)
df

Unnamed: 0,EIN,TAX_YEAR,OBJECT_ID
0,251753030,2015,201602159349301240


### Get a list of xml files in the data directory

In [6]:
all_files = [file for file in os.listdir('data') if file.endswith('_public.xml')]
all_object_ids = [int(name[:-11]) for name in all_files]
len(all_object_ids)

3261648

### Extract the EIN and Tax Year from all these files
The next cell takes around 4 hours to run on my machine, and the synchronous version is faster solely due to list comprehensions, so I've prevented it from running by changing the cell type to markdown. If you want to run it, change it back. 

```python
df = pd.DataFrame(columns=columns)

future_rows = [extract_one(name) for name in all_object_ids]
chunks_future_rows = [future_rows[x:x+100] for x in range(0, len(future_rows), 100)]

# After a point, because asyncio.as_completed starts all these tasks up at the same time, 
# We will get an OSError: [Errno 24] Too many open files. To avoid this, I'm splitting this up
# into 100 file chunks at a time.

len(chunks_future_rows)

rows = []

for chunk in tqdm(chunks_future_rows, total=len(chunks_future_rows)):
    rows = []
    # pd.concat([one_row if (one_row := await future) is not None for future in asyncio.as_completed(chunk)])
    for future in asyncio.as_completed(chunk):
        one_row = await future
        if one_row is None: continue
        rows.append(one_row)

df = pd.concat(rows)

df
```

In [9]:
df = pd.concat([one_row for name in tqdm(all_object_ids) if (one_row := extract_one_synch(name)) is not None])
df

100%|██████████| 3261648/3261648 [59:09<00:00, 918.83it/s]  


Unnamed: 0,EIN,TAX_YEAR,OBJECT_ID
0,66070253,2016,201801359349311605
0,222925068,2016,201731309349100113
0,741509867,2014,201521969349300722
0,621769084,2012,201332079349200318
0,411443583,2018,201943189349308384
...,...,...,...
0,946174771,2014,201523229349300647
0,396607008,2013,201403239349100210
0,530167933,2016,201811009349300246
0,203616217,2012,201341359349201289


In [10]:
df = df.sort_values(['EIN', 'TAX_YEAR'])
df.to_csv('data/full_index.csv', header=True, index=False)