# Working with Snapshot Files

This notebook shows how to read and use content from Snapshot files, either from a local folder or from a specific file.

In this notebook...
* [Dependencies and Initialisation](#Dependencies-and-Initialisation)
* [Supported File Formats]()
* [Load a single AVRO file to a Pandas DataFrame]()
* [Load all AVRO files in a folder to a Pandas DataFrame]()


## Dependencies and Initialisation
Import statements and environment initialisation using the package `dotenv`. More details in the [Configuration notebook](0.2_configuration.ipynb).

In [1]:
from factiva.news import SnapshotFiles
print('Done!')

Done!


## Supported File Formats

At the moment, `factiva-*` packages only support `AVRO` files. The support for additional formats is currently under development.

## Load a single AVRO file to a Pandas DataFrame

This operations uses the function `<SnapshotFiles>.read_file`.

* **`filepath`**: Relative or absolute file path
* **`only_stats`**: _Optional_. Specifies if only file metadata is loaded (True), or if the full article content is loaded (False). On average, only_stats loads about 1/10 and is recommended for quick metadata-based analysis. (Default is False)
* **`merge_body`**: _Optional_. Specifies if the body field should be merged with the snippet and this last column being dropped. (default is False)

In [2]:
sf = SnapshotFiles()
articles = sf.read_file('./data/part-000000000000.avro', only_stats=True)

In [3]:
articles.columns

Index(['an', 'company_codes', 'company_codes_about', 'company_codes_occur',
       'industry_codes', 'ingestion_datetime', 'language_code',
       'modification_datetime', 'publication_datetime', 'publisher_name',
       'region_codes', 'region_of_origin', 'source_code', 'source_name',
       'subject_codes', 'title', 'word_count'],
      dtype='object')

In [4]:
articles[['an', 'publication_datetime', 'title', 'industry_codes', 'language_code']].head()

Unnamed: 0,an,publication_datetime,title,industry_codes,language_code
0,T000000020170922ed9m0006v,2017-09-22 00:00:00.000,Battery business plugs into electric car market,",i3432,i35104,i353,i351,iaut,iindele,iindstrls...",en
1,DJDN000020160805ec8500186,2016-08-05 09:00:00.000,Press Release: Magna Announces Record Second Q...,",i3432,i353,iaut,iindele,iindstrls,itech,",en
2,CHNDLY0020210630eh6u00006,2021-06-30 00:00:00.000,CATL prospects brighten on Tesla deal,",i3432,i35104,i351,iaut,iindele,iindstrls,itech,",en
3,RTDJGE0020161130ecbu000ig,2016-11-30 15:35:43.463,Batteriehersteller/Varta verschiebt geplanten ...,",i3432,iindele,iindstrls,itech,",de
4,DJDN000020200923eg9n0020e,2020-09-23 14:38:27.054,Global Energy Roundup: Market Talk,",i3432,i1,i25,i342,i35101,i35104,iindstrls,i35...",en


## Load all files within a folder

This operations loads all files with the same extention into a single Pandas DataFrame.

* **`folderpath`**: Relative or absolute folder path
* **`file_format`**: `AVRO` is the only format available at the moment
* **`only_stats`**: _Optional_. Specifies if only file metadata is loaded (True), or if the full article content is loaded (False). On average, only_stats loads about 1/10 and is recommended for quick metadata-based analysis. (Default is False)
* **`merge_body`**: _Optional_. Specifies if the body field should be merged with the snippet and this last column being dropped. (default is False)

In [5]:
from factiva.news import SnapshotFiles
sf = SnapshotFiles()
articles = sf.read_folder(folderpath='./data/ztj2gkbldt', only_stats=False, merge_body=True)

In [6]:
articles.columns

Index(['copyright', 'subject_codes', 'modification_datetime', 'body',
       'company_codes_occur_ticker_exchange', 'company_codes_occur',
       'company_codes_about', 'company_codes_lineage',
       'company_codes_ticker_exchange',
       'company_codes_relevance_ticker_exchange', 'market_index_codes',
       'section', 'company_codes_association_ticker_exchange',
       'currency_codes', 'company_codes_about_ticker_exchange',
       'region_of_origin', 'company_codes_lineage_ticker_exchange',
       'ingestion_datetime', 'modification_date', 'source_name',
       'language_code', 'region_codes', 'company_codes_association',
       'person_codes', 'byline', 'dateline', 'company_codes_relevance',
       'source_code', 'an', 'word_count', 'company_codes', 'industry_codes',
       'title', 'publication_datetime', 'publisher_name', 'action'],
      dtype='object')

In [7]:
articles[['an', 'publication_datetime', 'title', 'industry_codes', 'language_code']].head()

Unnamed: 0,an,publication_datetime,title,industry_codes,language_code
0,T000000020170922ed9m0006v,2017-09-22 00:00:00.000,Battery business plugs into electric car market,",i3432,i35104,i353,i351,iaut,iindele,iindstrls...",en
1,DJDN000020160805ec8500186,2016-08-05 09:00:00.000,Press Release: Magna Announces Record Second Q...,",i3432,i353,iaut,iindele,iindstrls,itech,",en
2,CHNDLY0020210630eh6u00006,2021-06-30 00:00:00.000,CATL prospects brighten on Tesla deal,",i3432,i35104,i351,iaut,iindele,iindstrls,itech,",en
3,RTDJGE0020161130ecbu000ig,2016-11-30 15:35:43.463,Batteriehersteller/Varta verschiebt geplanten ...,",i3432,iindele,iindstrls,itech,",de
4,DJDN000020200923eg9n0020e,2020-09-23 14:38:27.054,Global Energy Roundup: Market Talk,",i3432,i1,i25,i342,i35101,i35104,iindstrls,i35...",en
