## Reducing the num file

As a result of the large file size reading in the dataframe can take an a long time. Therefore, the dataset is filtered and saved as a reduced version. The tsv files are converted to parquet file which are more suitable to larger datasets.

In [50]:
import pandas as pd
import os

# read in files and change to descending order, earliest date first
files = os.listdir('./datasets')
files.reverse()

# read only important columns and specify datatype so pandas doesn't to infer
columns_to_use = ['adsh', 'tag', 'qtrs', 'dimh', 'value', 'ddate', 'iprx']
dtype = {'adsh': 'category',
 'tag': 'category', 
 'qtrs': 'Int32', 
 'dimh': 'category', 
 'value': 'float64', 
 'ddate': 'Int32', 
 'iprx': 'Int32'}

# open each num tsv fiie, the file stores each xbrl value for the sec document
for file_name in files:
    path_values = f'./datasets/{file_name}/num.tsv'    
    df_num = pd.read_table(path_values, usecols=columns_to_use, dtype=dtype)
    
    # remove any sub varibles dimh, and duplicate values iprx
    reduced_df_num = df_num[(df_num['dimh'] == '0x00000000') & (df_num['iprx'] == 0)]
    
    reduced_df_num.to_parquet(f'./datasets/{file_name}/reduced_num.parquet', compression="snappy")


Definitions for each columns can be found here: https://www.sec.gov/files/aqfsn_1.pdf