# Extracting file names, EINs, and Tax Years

## Extracting file names, EINs, and Tax Years from the index files

Of course, according to [this source](https://appliednonprofitresearch.com/posts/2020/06/skip-the-irs-990-efile-indices/), the indices are not reliable, but they should be good enough. If necessary, we can just pass through the entire directory and extract the information from each return, but that seems wasteful for now.

In [None]:
import pandas as pd
from tqdm import tqdm
import os
import asyncio
import aiofiles
import logging

logging.basicConfig(format='%(asctime)s: %(message)s', filename='extracting.log', level=logging.DEBUG)

Note: file `data/index_2014.csv` has a typo on line 39569, which pandas does not like. I manually edited the file to change `,AMAGEMENT` to what I think it should be, `MANAGEMENT`. Additionally, `data/index_2019.csv` has a tax period of `210805`  on line `247851`, which I have not corrected.

In [None]:
years = list(range(2011, 2021))
frames = [pd.read_csv(f'data/index_{year}.csv') for year in years]
index = pd.concat(frames)

index

In [None]:
index = index[['EIN', 'TAX_PERIOD', 'OBJECT_ID']]
index = index.sort_values(['EIN', 'TAX_PERIOD'])
index = index.drop_duplicates()
# index.to_csv('index/sorted_index.csv', header=True, index=False)

## Extracting EINs and Tax Years from the xml files
Of course, this will be much slower than the above method, but it will be perfectly accurate.  
Note: All the information we need is in the header, so instead of parsing the entire xml file, we will just read the first few lines until we get the information we need. 

In [None]:
columns = ['EIN', 'TAX_YEAR', 'OBJECT_ID']

### Extract info from one file

In [None]:
def extract_one(name, path=None):
    if path is None:
        path = f'data/{name}_public.xml'
    
    ein, tax_year = None, None
    with open(path) as file:
        i = 0
        for line in file:
            i += 1 # Unfortunately, enumerate does not work with aiofiles objects
            if '<EIN>' in line:
                ein = int(line.strip()[5:-6]) # remove the EIN tags from the line
            if not tax_year and '<TaxYr>' in line:
                tax_year = int(line.strip()[7:-8]) # remove the TaxYr tags from the line
            if not tax_year and '<TaxYear>' in line:
                tax_year = int(line.strip()[9:-10]) # remove the TaxYr tags from the line
            
            if ein and tax_year: 
                break
            if i > 100:
                logging.error(name)
                return
    
    return pd.DataFrame([[ein, tax_year, name]], columns=columns)

### Example Usage

In [None]:
df = pd.DataFrame(columns=columns)

one_row = extract_one('201602159349301240')
df = df.append(one_row)
df

### Get a list of xml files in the data directory

In [None]:
all_files = [file[:-11] for file in os.listdir('data') if file.endswith('_public.xml')]
len(all_files)

### Extract the EIN and Tax Year from all these files

In [None]:
df = pd.concat(tqdm([one_row for name in tqdm(all_files) if (one_row := extract_one(name)) is not None]))
df

In [None]:
df = df.sort_values(['EIN', 'TAX_YEAR'])
df.to_csv('index/full_index.csv', header=True, index=False)