# Sec Financial Statement Data Sets Tools - Quickstart

## TL;DR

This notebook gives a first introduction into using the secfsdstools (Sec Financial Data Sets Tools) python package: https://pypi.org/project/secfsdstools/

It is designed to work with the data provided by the "Sec Financial Statement Data Sets" (SFSDS)(https://www.sec.gov/dera/data/financial-statement-data-sets).

The SFSDS contains data from all reports that were filed with the sec since 2012. For instance all anual and quarter reports. The main assets that can be retrieved from this data set are the financial statemens (balance sheet, income statement, and cash flow).

First, this notebook shows how the library is installed and configured. After that, it shows how the financial statements can be extracted from the data set.

For a detailed definition of the data set see https://www.sec.gov/files/aqfs.pdf.

## Principles / Concepts

The goal is bulk processing of the data.

To improve efficiency, the zip files are downloaded and indexed using a SQLite database table.
The index table contains information on all filed reports, over 500,000 in total. The first
download will take a couple of minutes but after that, all the data is on your local harddisk.

Using the index in the sqlite db allows for direct extraction of data for a specific report from the
appropriate zip file, reducing the need to open and search through each zip file.

Moreover, the downloaded zip files are converted to the parquet format which provides faster read access
to the data compared to reading the csv files inside the zip files.

The library is designed to have a low memory footprint, only parsing and reading the data for a specific
report into pandas dataframe tables.

## Installation
In order to install the library, just use pip install:
```
pip install secfsdstools
```

## Configure logging in Jupyter

In [1]:
# to ensure that the logging statements are shown in juypter output, run this cell
import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

## Configure table output for Pandas

In [2]:
import pandas as pd
# ensure that all columns are shown and that colum content is not cut
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width',1000)

## Configuration / Setup

In order to be used, the library needs to know where to store the compressed files from the Financial Statement Data Sets and where to store the sqlite database file. This is configured in a configuration file.

If you don't provide a config file, one will be created the first time you use the api. The configuration file will be created inside your home
directory. You can then change the content of it or directly start with the downloading of the data.

```
[DEFAULT]
downloaddirectory = <userhome>/secfsdstools/data/dld
parquetdirectory = <userhome>/secfsdools/data/parquet
dbdirectory = <userhome>/secfsdstools/data/db
useragentemail = your.email@goeshere.com
```

The downloaddirectory is the folder in which the compressed data files are downloaded.
The parquetdirectory is the folder in which the transfomred parquet version is stored.
The dbdirectory will contain sqlite db file.
The useragentemail is set inside the header when requests to sec.gov are made. This should be your email-address, however, since we are only making very few requests, it doesn't really matter if you change it or not.

If you want to start the download of the data "manually", just call the update method.

In [1]:
from secfsdstools.update import update

update()

2023-12-15 14:40:24,997 [INFO] configmgt  reading configuration from /home/sebacastillo/.secfsdstools.cfg
2023-12-15 14:40:24,999 [INFO] configmgt  SQLite db directory does not exist, creating it at /home/sebacastillo/genai0/data/db
2023-12-15 14:40:25,000 [INFO] configmgt  Download directory does not exist, creating it at /home/sebacastillo/genai0/secfsdstools/data/dld
2023-12-15 14:40:25,001 [INFO] configmgt  Parquet directory does not exist, creating it at /home/sebacastillo/genai0/data/parquet
2023-12-15 14:40:25,003 [INFO] updateprocess  Check if new report zip files are available...
2023-12-15 14:40:25,483 [INFO] updateprocess  check if there are new files to download from sec.gov ...
2023-12-15 14:40:27,679 [INFO] parallelexecution      items to process: 59
2023-12-15 14:40:27,684 [INFO] basedownloading      start to download 2023q3.zip 
2023-12-15 14:40:27,699 [INFO] basedownloading      start to download 2023q2.zip 
2023-12-15 14:40:27,700 [INFO] basedownloading      start to 

No rapid-api-key is set: 
If you are interested in daily updates, please have a look at https://rapidapi.com/hansjoerg.wingeier/api/daily-sec-financial-statement-dataset


2023-12-15 14:46:59,371 [INFO] toparquettransforming  processing 2009q2.zip
2023-12-15 14:46:59,376 [INFO] toparquettransforming  processing 2019q4.zip
2023-12-15 14:46:59,385 [INFO] toparquettransforming  processing 2022q2.zip
2023-12-15 14:46:59,400 [INFO] toparquettransforming  processing 2018q1.zip
2023-12-15 14:47:00,552 [INFO] toparquettransforming  processing 2021q3.zip
2023-12-15 14:47:23,640 [INFO] toparquettransforming  processing 2011q3.zip
2023-12-15 14:47:26,986 [INFO] toparquettransforming  processing 2010q4.zip
2023-12-15 14:47:28,728 [INFO] toparquettransforming  processing 2012q2.zip
2023-12-15 14:47:29,642 [INFO] toparquettransforming  processing 2019q2.zip
2023-12-15 14:47:32,945 [INFO] toparquettransforming  processing 2012q3.zip
2023-12-15 14:47:48,827 [INFO] toparquettransforming  processing 2015q2.zip
2023-12-15 14:47:54,025 [INFO] toparquettransforming  processing 2011q1.zip
2023-12-15 14:48:00,566 [INFO] toparquettransforming  processing 2021q2.zip
2023-12-15 1

The following tasks will be executed:
1. All currently available zip-files are downloaded form sec.gov (these are over 50 files that will need over 2 GB of space on your local drive)
2. All the zipfiles are transformed and stored as parquet files. Per default, the zipfile is deleted afterwards. If you want to keep the zip files, set the parameter 'KeepZipFiles' in the config file to True.
3. An index inside a sqlite db file is created

This may take a few minutes.

If you don't call update "manually", then the first time you call a function from the library, a download will be triggered.

Moreover, at most once a day, it is checked if there is a new zip file available on sec.gov. If there is, a download will be started automatically. 
If you don't want 'auto-update', set the 'AutoUpdate' in your config file to False.
The new quarter zip files are available by the beginning of every quarter (January, April, July, October), hence, yo have to run the update() at the beginning of every quarter to get the data for the reprots from last quarter.

Note: the first time downloading data will take a couple of minutes, since over 2 GB of data will be downloaded and converted into parquet format.

Note: **If you plan to use Jupyter, make sure that you configure the directories at a location where your Jupyter process has access. The used default directory (your user home directory) will work.**

## A first simple example
Goal: present the information in the balance sheet of Apple's 2022 10-K report in the same way as it appears in the
original report on page 31 ("CONSOLIDATED BALANCE SHEETS"): https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm


In [2]:
from secfsdstools.e_collector.reportcollecting import SingleReportCollector
from secfsdstools.e_filter.rawfiltering import ReportPeriodAndPreviousPeriodRawFilter
from secfsdstools.e_presenter.presenting import StandardStatementPresenter

# the unique identifier for apple's 10-K report of 2022
apple_10k_2022_adsh = "0000320193-22-000108"

# us a Collector to grab the data of the 10-K report. filter for balancesheet information
collector: SingleReportCollector = SingleReportCollector.get_report_by_adsh(
      adsh=apple_10k_2022_adsh,
      stmt_filter=["BS"]
)  
rawdatabag = collector.collect() # load the data from the disk
  
bs_df = (rawdatabag
         # ensure only data from the period (2022) of the previous period (2021) is in the data
         .filter(ReportPeriodAndPreviousPeriodRawFilter())
         # join the the content of the pre_txt and num_txt together
         .join()  
         # format the data in the same way as it appears in the report
         .present(StandardStatementPresenter())) 
print(bs_df) 

2023-12-15 14:59:44,758 [INFO] configmgt  reading configuration from /home/sebacastillo/.secfsdstools.cfg


                    adsh coreg  \
0   0000320193-22-000108         
1   0000320193-22-000108         
2   0000320193-22-000108         
3   0000320193-22-000108         
4   0000320193-22-000108         
5   0000320193-22-000108         
6   0000320193-22-000108         
7   0000320193-22-000108         
8   0000320193-22-000108         
9   0000320193-22-000108         
10  0000320193-22-000108         
11  0000320193-22-000108         
12  0000320193-22-000108         
13  0000320193-22-000108         
14  0000320193-22-000108         
15  0000320193-22-000108         
16  0000320193-22-000108         
17  0000320193-22-000108         
18  0000320193-22-000108         
19  0000320193-22-000108         
20  0000320193-22-000108         
21  0000320193-22-000108         
22  0000320193-22-000108         
23  0000320193-22-000108         
24  0000320193-22-000108         
25  0000320193-22-000108         
26  0000320193-22-000108         
27  0000320193-22-000108         
28  0000320193

## Overview
The following diagram gives an overview on SECFSDSTools library.

![Overview](https://github.com/HansjoergW/sec-fincancial-statement-data-set/raw/main/docs/images/overview.png)

It mainly exists out of two main processes. The first one ist the "Date Update Process" wich is responsible for the
download of the Financial Statement Data Sets zip files from the sec.gov website, transforming the content into parquet
format, and indexing the content of these files in a simple SQLite database. Again, this whole process can be started
"manually" by calling the update method, or it is done automatically, as it is described above.

The second main process is the "Data Processing Process", which is working with the data that is stored inside the
sub.txt, pre.txt, and num.txt files from the zip files. The "Data Processing Process" mainly exists out of four steps:

* **Collect** <br/> Collect the rawdata from one or more different zip files. For instance, get all the data for a single
report, or get the data for all 10-K reports of a single report from several zip files.
* **Raw Processing** <br/> Once the data is collected, the collected data for sub.txt, pre.txt, and num.txt is available
as a pandas dataframe. Filters can be applied, the content can directly be saved and loaded.
* **Joined Processing** <br/> From the "Raw Data", a "joined" representation can be created. This joins the data from
the pre.txt and num.txt content together based on the "adhs", "tag", and "version" attributes. "Joined data" can also be
filtered, concatenated, directly saved and loaded.
* **Present** <br/> Produce a single pandas dataframe out of the data and use it for further processing.

The diagramm also shows the main classes with which a user interacts. The use of them  is described in the following chapters.


## General
Most of the classes you can interact with have a factory method which name starts with "get_". All this factory method
take at least one **optional** parameter called configuration which is of type "Configuration".

If you do not provide this parameter, the class will read the configuration info from you configuration file in your home
directory. If, for whatever reason, you do want to provide an alternative configuration, you can overwrite it.

However, normally you do not have to provide the "configuration" parameter.

## Index: working with the index
The first class that interacts with the index is the `IndexSearch` class. It provides a single method `find_company_by_name`
which executes a SQL Like search on the name of the available companies and returns a pandas dataframe with the columns
'name' and 'cik' (the central index key, or the unique id of a company in the financial statements data sets).
The main purpose of this class is to find the cik for a company (of course, you can also directly search the cik on https://www.sec.gov/edgar/searchedgar/companysearch).

In [3]:
from secfsdstools.c_index.searching import IndexSearch

index_search = IndexSearch.get_index_search()
results = index_search.find_company_by_name("apple")
print(results)

2023-12-15 15:00:33,207 [INFO] configmgt  reading configuration from /home/sebacastillo/.secfsdstools.cfg


                             name      cik
0       APPLE GREEN HOLDING, INC.  1510976
1    APPLE HOSPITALITY REIT, INC.  1418121
2                       APPLE INC   320193
3       APPLE ISPORTS GROUP, INC.  1134982
4          APPLE REIT EIGHT, INC.  1387361
5           APPLE REIT NINE, INC.  1418121
6          APPLE REIT SEVEN, INC.  1329011
7              APPLE REIT SIX INC  1277151
8            APPLE REIT TEN, INC.  1498864
9          APPLETON PAPERS INC/WI  1144326
10  DR PEPPER SNAPPLE GROUP, INC.  1418135
11   MAUI LAND & PINEAPPLE CO INC    63330
12          PINEAPPLE ENERGY INC.    22701
13  PINEAPPLE EXPRESS CANNABIS CO  1710495
14        PINEAPPLE EXPRESS, INC.  1654672
15       PINEAPPLE HOLDINGS, INC.    22701
16                PINEAPPLE, INC.  1654672


In [4]:
index_search.find_company_by_name("globant")


Unnamed: 0,name,cik
0,GLOBANT S.A.,1557860


Once you have the cik of a company, you can use the `CompanyIndexReader` to get information on available reports of a company.
To get an instance of the class, you use the get `get_company_index_reader` method and provide the cik parameter.

In [15]:
from secfsdstools.c_index.companyindexreading import CompanyIndexReader

apple_cik = 320193
apple_index_reader = CompanyIndexReader.get_company_index_reader(cik=apple_cik)

2023-12-15 15:05:45,326 [INFO] configmgt  reading configuration from /home/sebacastillo/.secfsdstools.cfg


First, you could use the method `get_latest_company_filing` which returns a dictionary with the latest filing of the company:

In [16]:
print(apple_index_reader.get_latest_company_filing())

{'adsh': '0000320193-23-000075', 'cik': 320193, 'name': 'APPLE INC', 'sic': 3571.0, 'countryba': 'US', 'stprba': 'CA', 'cityba': 'CUPERTINO', 'zipba': '95014', 'bas1': 'ONE APPLE PARK WAY', 'bas2': None, 'baph': '(408) 996-1010', 'countryma': 'US', 'stprma': 'CA', 'cityma': 'CUPERTINO', 'zipma': '95014', 'mas1': 'ONE APPLE PARK WAY', 'mas2': None, 'countryinc': 'US', 'stprinc': 'CA', 'ein': 942404110, 'former': 'APPLE INC', 'changed': 20070109.0, 'afs': '1-LAF', 'wksi': 0, 'fye': '0930', 'form': '8-K', 'period': 20230731, 'fy': nan, 'fp': None, 'filed': 20230803, 'accepted': '2023-08-03 16:30:00.0', 'prevrpt': 0, 'detail': 0, 'instance': 'aapl-20230803_htm.xml', 'nciks': 1, 'aciks': None}


Next there are two methods which return the metadata of the reports that a company has filed. The result is either
returned as a list of `IndexReport` instances, if you use the method `get_all_company_reports` or as pandas dataframe if
you use the method `get_all_company_reports_df`. Both method can take an optional parameter forms, which defines the
type of the report that shall be returned. For instance, if you are only interested in the annual and quarterly report,
set forms to `["10-K", "10-Q"]`.

In [17]:
# only show the annual reports of apple
print(apple_index_reader.get_all_company_reports_df(forms=["10-K"]))

                    adsh     cik       name  form     filed    period  \
0   0000320193-22-000108  320193  APPLE INC  10-K  20221028  20220930   
1   0000320193-21-000105  320193  APPLE INC  10-K  20211029  20210930   
2   0000320193-20-000096  320193  APPLE INC  10-K  20201030  20200930   
3   0000320193-19-000119  320193  APPLE INC  10-K  20191031  20190930   
4   0000320193-18-000145  320193  APPLE INC  10-K  20181105  20180930   
5   0000320193-17-000070  320193  APPLE INC  10-K  20171103  20170930   
6   0001628280-16-020309  320193  APPLE INC  10-K  20161026  20160930   
7   0001193125-15-356351  320193  APPLE INC  10-K  20151028  20150930   
8   0001193125-14-383437  320193  APPLE INC  10-K  20141027  20140930   
9   0001193125-13-416534  320193  APPLE INC  10-K  20131030  20130930   
10  0001193125-12-444068  320193  APPLE INC  10-K  20121031  20120930   
11  0001193125-11-282113  320193  APPLE INC  10-K  20111026  20110930   
12  0001193125-10-238044  320193  APPLE INC  10-K  

Note: the entries in the url column above directly open the filing of that report on the sec.gov website.

## Collect: collecting the data for reports
The previously introduced `IndexSearch` and `CompanyIndexReader` let you know what data is available, but they do not
return the real data of the financial statements. This is what the `Collector` classes are used for.

All the `Collector` classes have their own factory method(s) which instantiates the class. Most of these factory methods
also provide parameters to filter the data directly when being loaded from the parquet files.
These are
* the `forms_filter` <br> lets you select which report type should be loaded (e.g. "10-K" or "10-Q").<br>
  Note: the fomrs filter affects all dataframes (sub, pre, num).
* the `stmt_filter` <br> defines the statements that should be loaded (e.g., "BS" if only "Balance Sheet" data should be loaded) <br>
  Note: the stmt filter only affects the pre dataframe.
* the `tag_filter` <br> defines the tags, that should be loaded (e.g., "Assets" if only the "Assets" tag should be loaded) <br>
  Note: the tag filter affects the pre and num dataframes.

It is also possible to apply filter for these attributes after the data is loaded, but since the `Collector` classes
apply this filters directly during the load process from the parquet files (which means that fewer data is loaded from
the disk) this is generally more efficient.

All `Collector` classes have a `collect` method which then loads the data from the parquet files and returns an instance
of `RawDataBag`. The `RawDataBag` instance contains then a pandas dataframe for the `sub` (subscription) data,
`pre` (presentation) data, and `num` (the numeric values) data.

The framework provides the following collectors:

---
* `SingleReportCollector` <br> As the name suggests, this `Collector` returns the data of a single report. It is 
  instantiated by providing the `adsh` of the desired report as parameter of the `get_report_by_adsh` factory method, 
  or by using an instance of the `IndexReport` as parameter of the `get_report_by_indexreport`. (As a reminder: 
  instances of `IndexReport` are returned by the `CompanyIndexReader` class).

In [18]:
from secfsdstools.e_collector.reportcollecting import SingleReportCollector

apple_10k_2022_adsh = "0000320193-22-000108"

collector: SingleReportCollector = SingleReportCollector.get_report_by_adsh(adsh=apple_10k_2022_adsh)
rawdatabag = collector.collect()

# as expected, there is just one entry in the submission dataframe
print(rawdatabag.sub_df, '\n')

# just print the size of the pre and num dataframes
print(rawdatabag.pre_df.shape)
print(rawdatabag.num_df.shape)

2023-12-15 15:07:12,854 [INFO] configmgt  reading configuration from /home/sebacastillo/.secfsdstools.cfg


                   adsh     cik       name     sic countryba stprba  \
0  0000320193-22-000108  320193  APPLE INC  3571.0        US     CA   

      cityba  zipba                bas1  bas2  ...    period      fy  fp  \
0  CUPERTINO  95014  ONE APPLE PARK WAY  None  ...  20220930  2022.0  FY   

      filed               accepted prevrpt detail               instance  \
0  20221028  2022-10-27 18:01:00.0       0      1  aapl-20220924_htm.xml   

  nciks  aciks  
0     1   None  

[1 rows x 36 columns] 

(185, 10)
(503, 9)


In [19]:
rawdatabag.pre_df

Unnamed: 0,adsh,report,line,stmt,inpth,rfile,tag,version,plabel,negating
0,0000320193-22-000108,1,46,CP,0,H,AmendmentFlag,dei/2022,Amendment Flag,0
1,0000320193-22-000108,1,29,CP,0,H,CityAreaCode,dei/2022,City Area Code,0
2,0000320193-22-000108,1,18,CP,0,H,CurrentFiscalYearEndDate,dei/2022,Current Fiscal Year End Date,0
3,0000320193-22-000108,1,17,CP,0,H,DocumentAnnualReport,dei/2022,Document Annual Report,0
4,0000320193-22-000108,1,48,CP,0,H,DocumentFiscalPeriodFocus,dei/2022,Document Fiscal Period Focus,0
...,...,...,...,...,...,...,...,...,...,...
180,0000320193-22-000108,1,12,CP,0,H,A3.050NotesDue2029Member,0000320193-22-000108,3.050% Notes due 2029,0
181,0000320193-22-000108,1,14,CP,0,H,A3.600NotesDue2042Member,0000320193-22-000108,3.600% Notes due 2042,0
182,0000320193-22-000108,4,7,CI,0,H,OtherComprehensiveIncomeLossDerivativeInstrume...,0000320193-22-000108,Total change in unrealized gains/losses on der...,0
183,0000320193-22-000108,4,5,CI,0,H,OtherComprehensiveIncomeLossDerivativeInstrume...,0000320193-22-000108,Change in fair value of derivative instruments,0


---
* `MultiReportCollector` <br> Contrary to the `SingleReportCollector`, this `Collector` can collect data from several
  reports. Moreover, the data of the reports are loaded in parallel, this  especially improves the performance if the
  reports are from different quarters (resp. are in different zip files). The class provides the factory methods 
  `get_reports_by_adshs` and `get_reports_by_indexreports`. The first takes a list of adsh strings, the second a list
  of `IndexReport` instances.

In [20]:
from secfsdstools.e_collector.multireportcollecting import MultiReportCollector
apple_10k_2022_adsh = "0000320193-22-000108"
apple_10k_2012_adsh = "0001193125-12-444068"

# load only the assets tags that are present in the 10-K report of apple in the years
# 2022 and 2012
collector: MultiReportCollector = MultiReportCollector.get_reports_by_adshs(
                                              adshs=[apple_10k_2022_adsh, apple_10k_2012_adsh],
                                              tag_filter=['Assets'])
rawdatabag = collector.collect()
# as expected, there are just two entries in the submission dataframe
print(rawdatabag.sub_df, '\n')
print(rawdatabag.num_df)  

2023-12-15 15:07:54,210 [INFO] configmgt  reading configuration from /home/sebacastillo/.secfsdstools.cfg
2023-12-15 15:07:54,212 [INFO] parallelexecution      items to process: 2


0000320193-22-000108
0001193125-12-444068
0000320193-21-0001050000320193-20-0000960000320193-19-000119
0000320193-22-000108


0000320193-18-000145
0000320193-17-000070
0001628280-16-020309
0001193125-15-356351
0001193125-14-383437
0001193125-13-416534
0001193125-12-444068
0001193125-11-282113
0001193125-10-238044
0001193125-09-214859


2023-12-15 15:07:57,130 [INFO] parallelexecution      commited chunk: 0


                   adsh     cik       name     sic countryba stprba  \
0  0000320193-22-000108  320193  APPLE INC  3571.0        US     CA   
1  0001193125-12-444068  320193  APPLE INC  3571.0        US     CA   

      cityba  zipba                bas1  bas2  ...    period      fy  fp  \
0  CUPERTINO  95014  ONE APPLE PARK WAY  None  ...  20220930  2022.0  FY   
1  CUPERTINO  95014   ONE INFINITE LOOP  None  ...  20120930  2012.0  FY   

      filed               accepted prevrpt detail               instance  \
0  20221028  2022-10-27 18:01:00.0       0      1  aapl-20220924_htm.xml   
1  20121031  2012-10-31 17:07:00.0       0      1      aapl-20120929.xml   

  nciks  aciks  
0     1   None  
1     1   None  

[2 rows x 36 columns] 

                   adsh     tag       version coreg     ddate  qtrs  uom  \
0  0000320193-22-000108  Assets  us-gaap/2022        20210930     0  USD   
1  0000320193-22-000108  Assets  us-gaap/2022        20220930     0  USD   
2  0001193125-12-444068 

---
* `ZipCollector` <br> This `Collector` collects the data of one single zip (resp. the folder that contains the parquet
  files of this zip file). And since the original zip file contains the data for one quarter, the name you provide
  in the `get_zuip_by_name` factory method reflects the quarter which data you want to load: e.g. `2022q1.zip`.

In [21]:
from secfsdstools.e_collector.zipcollecting import ZipCollector

# only collect the Balance Sheet of annual reports that
# were filed during the first quarter in 2022
collector: ZipCollector = ZipCollector.get_zip_by_name(name="2022q1.zip",
                                                       forms_filter=["10-K"],
                                                       stmt_filter=["BS"])

rawdatabag = collector.collect()

# only show the size of the data frame
# .. over 4000 companies filed a 10 K report in q1 2022
print(rawdatabag.sub_df.shape)
print(rawdatabag.pre_df.shape)
print(rawdatabag.num_df.shape)    

2023-12-15 15:08:02,623 [INFO] configmgt  reading configuration from /home/sebacastillo/.secfsdstools.cfg
2023-12-15 15:08:02,625 [INFO] parallelexecution      items to process: 1
2023-12-15 15:08:02,626 [INFO] zipcollecting  processing /home/sebacastillo/genai0/data/parquet/quarter/2022q1.zip


2023-12-15 15:08:07,957 [INFO] parallelexecution      commited chunk: 0


(4875, 36)
(232863, 10)
(2404949, 9)


---
* `CompanyReportCollector` <br> This class returns reports for one or more companies. The factory method 
  `get_company_collector` provides the parameter `ciks` which takes a list of cik numbers.

In [22]:
from secfsdstools.e_collector.companycollecting import CompanyReportCollector

apple_cik = 320193
# load the data for all 10-K (annual) reports of apple
collector = CompanyReportCollector.get_company_collector(ciks=[apple_cik],
                                                         forms_filter=["10-K"])

rawdatabag = collector.collect()

# all filed 10-K reports for apple since 2010 are in the databag
print(rawdatabag.sub_df)

print(rawdatabag.pre_df.shape)
print(rawdatabag.num_df.shape) 

2023-12-15 15:08:12,253 [INFO] configmgt  reading configuration from /home/sebacastillo/.secfsdstools.cfg


2023-12-15 15:08:15,741 [INFO] parallelexecution      items to process: 14
2023-12-15 15:08:26,843 [INFO] parallelexecution      commited chunk: 0


                    adsh     cik       name     sic countryba stprba  \
0   0000320193-22-000108  320193  APPLE INC  3571.0        US     CA   
1   0000320193-21-000105  320193  APPLE INC  3571.0        US     CA   
2   0000320193-20-000096  320193  APPLE INC  3571.0        US     CA   
3   0000320193-19-000119  320193  APPLE INC  3571.0        US     CA   
4   0000320193-18-000145  320193  APPLE INC  3571.0        US     CA   
5   0000320193-17-000070  320193  APPLE INC  3571.0        US     CA   
6   0001628280-16-020309  320193  APPLE INC  3571.0        US     CA   
7   0001193125-15-356351  320193  APPLE INC  3571.0        US     CA   
8   0001193125-14-383437  320193  APPLE INC  3571.0        US     CA   
9   0001193125-13-416534  320193  APPLE INC  3571.0        US     CA   
10  0001193125-12-444068  320193  APPLE INC  3571.0        US     CA   
11  0001193125-11-282113  320193  APPLE INC  3571.0        US     CA   
12  0001193125-10-238044  320193  APPLE INC  3571.0        US   

## Raw Processing: working with the raw data
When the `collect` method of a `Collector` class is called, the data for the sub, pre, and num dataframes are loaded
and being stored in the sub_df, pre_df, and num_df attributes inside an instance of `RawDataBag`.

The `RawDataBag` provides the following methods:
* `save`, `load`<br> The content of a `RawDataBag` can be saved into a directory. Within that directory, 
   parquet files are stored for the content of the sub_df, pre_df, and num_df. In order to load this 
   data directly, the static method `RawDataBag.load()` can be used.
* `concat`<br> Several instances of a `RawDataBag` can be concatenated in one single instance. In order to do 
   that, the static method `RawDataBag.concat()` takes a list of RawDataBag as parameter.
* `join` <br> This method produces a `JoinedRawDataBag` by joining the content of the pre_df and num_df
   based on the columns adsh, tag, and version. It is an inner join. The joined dataframe appears as pre_num_df in
   the `JoinedRawDataBag`.
* `filter` <br> The filter method takes a parameter of the type `FilterRaw`, applies it to the data and
   produces a new instance of `RawDataBag` with the filtered data. Therefore, filters can also be chained like
   `a_filtered_RawDataBag = a_RawDataBag.filter(filter1).filter(filter2)`. Moreover, the `__get__item` method
   is forwarded to the filter method, so you can also write `a_filtered_RawDataBag = a_RawDataBag[filter1][filter2]`.

It is simple to write your own filters, just get some inspiration from the once that are already present in the
Framework (module `secfsdstools.e_filter.rawfiltering`:

* `AdshRawFilter` <br> Filters the `RawDataBag` instance based on the list of adshs that were provided in the constructor. <br>
   ````
   a_filtered_RawDataBag = a_RawDataBag.filter(AdshRawFilter(adshs=['0001193125-09-214859', '0001193125-10-238044']))
   ````
* `StmtRawFilter` <br> Filters the `RawDataBag`instance based on the list of statements ('BS', 'CF', 'IS', ...). <br>
   ````
   a_filtered_RawDataBag = a_RawDataBag.filter(StmtRawFilter(stmts=['BS', 'IS']))
   ````
* `TagRawFilter` <br> Filters the `RawDataBag`instance based on the list of tags that is provided. <br>
   ````
   a_filtered_RawDataBag = a_RawDataBag.filter(TagRawFilter(tags=['Assets', 'Liabilities']))
   ````
* `MainCoregRawFilter` <br> Filters the `RawDataBag` so that data of subsidiaries are removed.
   ````
   a_filtered_RawDataBag = a_RawDataBag.filter(MainCoregRawFilter()) 
   ````
* `ReportPeriodAndPreviousPeriodRawFilter` <br> The data of a report usually also contains data from previous years.
  However, often you want just to analyze the data of the current and the previous year. This filter ensures that
  only data for the current period and the previous period are contained in the data.
   ````
   a_filtered_RawDataBag = a_RawDataBag.filter(ReportPeriodAndPreviousPeriodRawFilter()) 
   ````
* `ReportPeriodRawFilter` <br> If you are just interested only in the data of a report that is from the current period
  of the report then you can use this filter. For instance, if you use a `CompanyReportCollector` to collect all
  10-K reports of this company, you want to ensure that every report only contains data for its period and not for
  previous periods.
   ````
   a_filtered_RawDataBag = a_RawDataBag.filter(ReportPeriodRawFilter()) 
   ````


## Joined Processing: working with joined data
When the `join` method of a `RawDataBag` instance is called an instance of `JoinedDataBag` is returned. The returned
instance contains an attribute sub_df, which is a reference to the same sub_df that is in the `RawDataBag`.
In addition to that, the `JoinedDataBag` contains an attribut pre_num_df, which is an inner join of the pre_df and 
the num_df based on the columns adsh, tag, and version. Note that an entry in the pre_df can be joined with more than 
one entry in the num_df.

The `JoinedDataBag` provides the following methods:
* `save`, `load`<br> The content of a `JoinedDataBag` can be saved into a directory. Within that directory,
  parquet files are stored for the content of the sub_df, pre_df, and num_df. In order to load this
  data directly, the static method `JoinedDataBag.save()` can be used.
* `concat`<br> Several instances of a `JoinedDataBag` can be concatenated in one single instance. In order to do
  that, the static method `JoinedDataBag.concat()` takes a list of RawDataBag as parameter.
* `filter` <br> The filter method takes a parameter of the type `FilterJoined`, applies it to the data and
  produces a new instance of `JoinedDataBag` with the filtered data. Therefore, filters can also be chained like
  `a_filtered_JoinedDataBag = a_JoinedDataBag.filter(filter1).filter(filter2)`. Moreover, the `__get__item` method
  is forwarded to the filter method, so you can also write `a_filtered_JoinedDataBag = a_JoinedDataBag[filter1][filter2]`.
  **Note**: There aren't any filters for the JoinedDataBag in the framework yet. However, you can write them in the same
  way as a filter for a `RawDataBag` is being written.
* `present` <br> The idea of the present method is to make a final presentation of the data as pandas dataframe. 
  The method has a parameter presenter of type Presenter.

## Present
It is simple to writer your own presenter classes. So far, the framework provides the following Presenter 
implementations (module `secfsdstools.e_presenter.presenting`):

* `StandardStatementPresenter` <br> This presenter provides the data in the same form, as you are used to see in
  the reports itself. For instance, the primary financial statements balance sheet, income statement, and cash flow
  display the different positions in rows and the columns contain the different dates/periods of the data.
  Let us say you want to recreate the BS information of the apples 10-K report of 2022, you would write:
 

In [23]:
from secfsdstools.e_collector.reportcollecting import SingleReportCollector
from secfsdstools.e_filter.rawfiltering import ReportPeriodAndPreviousPeriodRawFilter
from secfsdstools.e_presenter.presenting import StandardStatementPresenter

apple_10k_2022_adsh = "0000320193-22-000108"

collector: SingleReportCollector = SingleReportCollector.get_report_by_adsh(
    adsh=apple_10k_2022_adsh,
    stmt_filter=["BS"]
)
rawdatabag = collector.collect()
bs_df = (rawdatabag.filter(ReportPeriodAndPreviousPeriodRawFilter())
                .join()
                .present(StandardStatementPresenter()))
print(bs_df) 

2023-12-15 15:08:36,476 [INFO] configmgt  reading configuration from /home/sebacastillo/.secfsdstools.cfg


                    adsh coreg  \
0   0000320193-22-000108         
1   0000320193-22-000108         
2   0000320193-22-000108         
3   0000320193-22-000108         
4   0000320193-22-000108         
5   0000320193-22-000108         
6   0000320193-22-000108         
7   0000320193-22-000108         
8   0000320193-22-000108         
9   0000320193-22-000108         
10  0000320193-22-000108         
11  0000320193-22-000108         
12  0000320193-22-000108         
13  0000320193-22-000108         
14  0000320193-22-000108         
15  0000320193-22-000108         
16  0000320193-22-000108         
17  0000320193-22-000108         
18  0000320193-22-000108         
19  0000320193-22-000108         
20  0000320193-22-000108         
21  0000320193-22-000108         
22  0000320193-22-000108         
23  0000320193-22-000108         
24  0000320193-22-000108         
25  0000320193-22-000108         
26  0000320193-22-000108         
27  0000320193-22-000108         
28  0000320193

 If you compare this with the real report at https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm
  you will notice, that order of the tags and the values are the same.

## What to do next
Definitely checkout the notebook "03_explore_with_interactive_notebook.ipynb" which shows some example on how the data can be explored in an interactive way in Jupyter.