# Snapshot Extraction

This notebook shows how to run a Snapshot Explain operation with the minimal steps and a simple query.

In this notebook...
* [Dependencies and Initialisation](#Dependencies-and-Initialisation)
* [The Where Statement](#The-Where-Statement)
* [Load a Saved Query](#Load-a-Saved-Query)
* [Save the query for other operations](#Save-the-query-for-other-operations)
* [Extraction Query Options](#Extraction-Query-Options)
* [Running the Extraction Operation](#Running-the-Extraction-Operation)
* [Load the downloaded AVRO files to a Pandas DataFrame](#Load-the-downloaded-AVRO-files-to-a-Pandas-DataFrame)

## Dependencies and Initialisation
Import statements and environment initialisation using the package `dotenv`. More details in the [Configuration notebook](0.2_configuration.ipynb).

In [1]:
from factiva.news import Snapshot
from dotenv import load_dotenv
load_dotenv()
print('Done!')

Done!


## The Where Statement

This notebook uses a simple query for illustration purposes. For more tips about queries, or guidance on how to build complex or large queries, checkout the [Complex and Large Queries](2.1_complex_large_queries.ipynb) notebook.

In [2]:
# Industry i3432 is for Batteries
where_statement = (
    r" publication_datetime >= '2016-01-01 00:00:00' "
    r" AND LOWER(language_code) IN ('en', 'de', 'fr') "
    r" AND REGEXP_CONTAINS(industry_codes, r'(?i)(^|,)(i3432)($|,)') "
)

s = Snapshot(query=where_statement)

## Load a Saved Query

Loads a query that was saved to a JSON file. **To be implemented!**

In [None]:
# Load saved query code

## Save the query for other operations

Implement this method!

In [None]:
# Save query code

## Extraction Query Options

An extraction query can use more parameters:

* **`file_format`**: _Optional_, _Default: `'avro'`_. File format to be used for Extractions. Possible values are `'avro'`, `'csv'` or `'json'`. Used only by the Extraction operation.
* **`limit`**: _Optional_, _Default: `0` (No limit)_. Positive integer that limits the amount of documents to extract. Used only by the Extraction operation.

In [3]:
s.query.file_format = 'avro'
# s.query.limit = 1000     # Uncomment this line to set a max number of extracted documents

## Running the Extraction Operation

This operation returns document volume time-series matching provided query in the Factiva Analytics archive. The goal of this operation is to have a more precise idea of the document volume and time distribution. When used iteratively, helps deciding on the used criteria to add/delete/modify the criteria to verify the impact on the matching items.

**This operation will decrement 1 extraction from your allowance**

An Extraction job decrements the account's allowance, and therefore, it's highly recommended to be executed after verifying the [Explain](1.4_snapshot_explain.ipynb) and/or [Analytics](1.5_snapshot_analytics.ipynb) jobs return values under the expected volumes.

The `<Snapshot>.process_analytics()` function directly returns the time-series dataset. If a more manual process is required (send job, monitor job, get results), please see the [detailed package documentation](https://factiva-news-python.readthedocs.io/).

>**Note**: the sum of broken-down volumes in Analytics may not add up to the total displayed with the Explains operation, as Analytics will filter estimations according to the top sources by volume.

In [4]:
%%time
s.process_extraction()
print('Done!')

Done!
CPU times: user 5.27 s, sys: 5.08 s, total: 10.3 s
Wall time: 5min


## Load the downloaded AVRO files to a Pandas DataFrame
Restuls are stored in the folder named as the Job ID property (`<Snapshot>.last_extraction_job.id`). A custom tool allows to load its contents to a DataFrame.

In [12]:
s.last_extraction_job.job_id

'ztj2gkbldt'

In [1]:
from factiva.news import SnapshotFiles
sf = SnapshotFiles()
articles = sf.read_folder(s.last_extraction_job.job_id)

In [4]:
articles.columns

Index(['copyright', 'subject_codes', 'modification_datetime', 'body',
       'company_codes_occur_ticker_exchange', 'company_codes_occur',
       'company_codes_about', 'company_codes_lineage',
       'company_codes_ticker_exchange', 'snippet',
       'company_codes_relevance_ticker_exchange', 'market_index_codes',
       'section', 'company_codes_association_ticker_exchange',
       'currency_codes', 'company_codes_about_ticker_exchange',
       'region_of_origin', 'company_codes_lineage_ticker_exchange',
       'ingestion_datetime', 'modification_date', 'source_name',
       'language_code', 'region_codes', 'company_codes_association',
       'person_codes', 'byline', 'dateline', 'company_codes_relevance',
       'source_code', 'an', 'word_count', 'company_codes', 'industry_codes',
       'title', 'publication_datetime', 'publisher_name', 'action'],
      dtype='object')

In [3]:
articles[['an', 'publication_datetime', 'title', 'industry_codes', 'language_code']].head()

Unnamed: 0,an,publication_datetime,title,industry_codes,language_code
0,T000000020170922ed9m0006v,2017-09-22 00:00:00.000,Battery business plugs into electric car market,",i3432,i35104,i353,i351,iaut,iindele,iindstrls...",en
1,DJDN000020160805ec8500186,2016-08-05 09:00:00.000,Press Release: Magna Announces Record Second Q...,",i3432,i353,iaut,iindele,iindstrls,itech,",en
2,CHNDLY0020210630eh6u00006,2021-06-30 00:00:00.000,CATL prospects brighten on Tesla deal,",i3432,i35104,i351,iaut,iindele,iindstrls,itech,",en
3,RTDJGE0020161130ecbu000ig,2016-11-30 15:35:43.463,Batteriehersteller/Varta verschiebt geplanten ...,",i3432,iindele,iindstrls,itech,",de
4,DJDN000020200923eg9n0020e,2020-09-23 14:38:27.054,Global Energy Roundup: Market Talk,",i3432,i1,i25,i342,i35101,i35104,iindstrls,i35...",en
