# Snapshot Extraction

This notebook shows how to run a Snapshot Explain operation with the minimal steps and a simple query.

In this notebook...
* [Dependencies and Initialisation](#Dependencies-and-Initialisation)
* [The Where Statement](#The-Where-Statement)
* [Load a Saved Query](#Load-a-Saved-Query)
* [Save the query for other operations](#Save-the-query-for-other-operations)
* [Extraction Query Options](#Extraction-Query-Options)
* [Running the Extraction Operation](#Running-the-Extraction-Operation)
* [Load the downloaded AVRO files to a Pandas DataFrame](#Load-the-downloaded-AVRO-files-to-a-Pandas-DataFrame)

## Dependencies and Initialisation
Import statements and environment initialisation using the package `dotenv`. More details in the [Configuration notebook](0.2_configuration.ipynb).

In [2]:
from factiva.news import Snapshot
from dotenv import load_dotenv
load_dotenv()
print('Done!')

Done!


## The Where Statement

This notebook uses a simple query for illustration purposes. For more tips about queries, or guidance on how to build complex or large queries, checkout the [Complex and Large Queries](2.1_complex_large_queries.ipynb) notebook.

In [2]:
# Industry i3432 is for Batteries
where_statement = (
    r" publication_datetime >= '2016-01-01 00:00:00' "
    r" AND LOWER(language_code) IN ('en', 'de', 'fr') "
    r" AND REGEXP_CONTAINS(industry_codes, r'(?i)(^|,)(i3432)($|,)') "
)

s = Snapshot(query=where_statement)

## Load a Saved Query

Loads a query that was saved to a JSON file. **To be implemented!**

In [None]:
# Load saved query code

## Save the query for other operations

Implement this method!

In [None]:
# Save query code

## Extraction Query Options

An extraction query can use more parameters:

* **`file_format`**: _Optional_, _Default: `'avro'`_. File format to be used for Extractions. Possible values are `'avro'`, `'csv'` or `'json'`. Used only by the Extraction operation.
* **`limit`**: _Optional_, _Default: `0` (No limit)_. Positive integer that limits the amount of documents to extract. Used only by the Extraction operation.

In [3]:
s.query.file_format = 'avro'
# s.query.limit = 1000     # Uncomment this line to set a max number of extracted documents

## Running the Extraction Operation   `**(decremental)**`

This operation builds a collection of files containing the articles selected according to query conditions.

**This operation will decrement 1 extraction from your allowance**

An Extraction job decrements the account's allowance, and therefore, it's highly recommended to be executed after verifying the same query using [Explain](1.4_snapshot_explain.ipynb) and/or [Analytics](1.5_snapshot_analytics.ipynb) jobs return values in line with the expected volumes.

The `<Snapshot>.process_extraction()` function directly submits, monitors the job and download the content. If a more manual process is required (send job, monitor job, get results), please see the [detailed package documentation](https://factiva-news-python.readthedocs.io/).

To review the **Snapshot History**, please see the notebook [User Statistics](1.1_user_statistics.ipynb).

In [4]:
%%time
s.process_extraction()
print('Done!')

Done!
CPU times: user 8.65 s, sys: 8.25 s, total: 16.9 s
Wall time: 5min 13s


## Load the downloaded AVRO files to a Pandas DataFrame
Restuls are stored in the folder named as the Job ID property (`<Snapshot>.last_extraction_job.id`). A custom tool allows to load its contents to a DataFrame.

In [5]:
s.last_extraction_job.job_id

'c1o5cbkuft'

In [4]:
from factiva.news import SnapshotFiles

# If using the previously executed extraction job
data_folder = f'data/{s.last_extraction_job.job_id}'

# If using a custom location
# data_folder = f'data/c1o5cbkuft'

sf = SnapshotFiles()
articles = sf.read_folder(data_folder)

In [5]:
articles.columns

Index(['copyright', 'subject_codes', 'modification_datetime', 'body',
       'company_codes_occur_ticker_exchange', 'company_codes_occur',
       'company_codes_about', 'company_codes_lineage',
       'company_codes_ticker_exchange', 'snippet',
       'company_codes_relevance_ticker_exchange', 'market_index_codes',
       'section', 'company_codes_association_ticker_exchange',
       'currency_codes', 'company_codes_about_ticker_exchange',
       'region_of_origin', 'company_codes_lineage_ticker_exchange',
       'ingestion_datetime', 'modification_date', 'source_name',
       'language_code', 'region_codes', 'company_codes_association',
       'person_codes', 'byline', 'dateline', 'company_codes_relevance',
       'source_code', 'an', 'word_count', 'company_codes', 'industry_codes',
       'title', 'publication_datetime', 'publisher_name', 'action'],
      dtype='object')

In [8]:
articles[['an', 'publication_datetime', 'title', 'industry_codes', 'language_code']].head()

Unnamed: 0,an,publication_datetime,title,industry_codes,language_code
0,FLYWAL0020191125efbp00jll,2019-11-25,09:00 EST FuelCell announces completion of ini...,",i3432,iindele,iindstrls,itech,",en
1,LABPRA0020191111efbb00003,2019-11-11,Polymerkathoden: Die unlösliche Problemlösung?,",i3432,iindele,iindstrls,itech,",de
2,TDLY000020210226eh2q0027j,2021-02-26,"Prague; Czechia; Accumulators, primary cells a...",",i3432,iindele,iindstrls,itech,",en
3,HNDBLT0020210211eh2b0000v,2021-02-11,HEDGEFONDS; Mehr Transparenz,",iinv,ihedge,i81502,ialtinv,ifinal,i3432,iinde...",de
4,MRKRE00020210202eh1f0024t,2021-01-15,2021 Global Forecast for All other round and p...,",i3432,iindele,iindstrls,itech,",en
