# Snapshot Explain

This notebook shows how to run a Snapshot Explain operation with the minimal steps and a simple query.

In this notebook...
* [Dependencies and Initialisation](#dependencies-and-initialisation)
* [The Where Statement](#the-where-statement)
* [Running the Explain Operation](#running-the-explain-operation)
* [Next Steps](#next-steps)

**_NOTE_**: This notebook was tested using the [`factiva-analytics`](https://factiva-analytics-python.readthedocs.io/) Python package version **0.3.13** (Sep 2024) and [Factiva Analytics v3.0](https://developer.dowjones.com/site/docs/factiva_apis/factiva_analytics_apis/factiva_snapshots_api/snapshots_3_0/index.gsp#) endpoints.

## Dependencies and Initialisation
Import statements and environment initialisation using the package `dotenv`. More details in the [Configuration notebook](0.2_configuration.ipynb).

In [1]:
import factiva.analytics as fa
from factiva.analytics import SnapshotExplain
from dotenv import load_dotenv
load_dotenv()
print(f"Using the Factiva Analytics Python Package version {fa.__version__}")

Using the Factiva Analytics Python Package version 0.3.13


## The Where Statement

This notebook uses a simple query for illustration purposes. For more tips about queries, or guidance on how to build complex or large queries, checkout the [query reference](2.1_complex_large_queries.ipynb) notebook.

In [2]:
# Industry i3432 is for Batteries
where_statement = (
    r" publication_datetime >= '2016-01-01 00:00:00' "
    r" AND LOWER(language_code) IN ('en', 'de', 'fr') "
    r" AND REGEXP_CONTAINS(industry_codes, r'(?i)(^|,)(i3432)($|,)') "
)

expl = SnapshotExplain(query=where_statement)

## Running the Explain Operation

[Product Documentation - Snapshot Explain](https://developer.dowjones.com/site/docs/factiva_apis/factiva_analytics_apis/factiva_snapshots_api/snapshots_3_0/index.gsp#snapshotexplains-13)

This operation returns the number of documents matching the provided query in the Factiva Analytics archive.

The goal of this operation is to have a rough idea of the document volume. When used iteratively, helps deciding on the used criteria to add/delete/modify the criteria to verify the impact on the matching items.


The `<SnapshotExplain>.process_job()` function directly returns this value. If a more manual process is required (run submit, monitor or get job results as separate operations), please see the [detailed package documentation](https://factiva-analytics-python.readthedocs.io/).

In [3]:
%%time
expl.process_job()

CPU times: user 94.8 ms, sys: 21.9 ms, total: 117 ms
Wall time: 59.4 s


True

In [4]:
print(f'Explain operation ID: {expl.job_response.job_id}')
print(f'Document volume estimate: {expl.job_response.volume_estimate}')

Explain operation ID: 72e8e01e-fcce-4b23-8a2d-2fbb65e84fc2
Document volume estimate: 321078


## Getting Explain Samples

[Product documentation - Snapshot Explain Samples](https://developer.dowjones.com/site/docs/factiva_apis/factiva_analytics_apis/factiva_snapshots_api/snapshots_3_0/index.gsp#snapshotsamples-15)

As an extension of the Explain operation, it is possible to request a set of random article metadata samples matching the Explain criteria. The main requirement in this case is just using the previously obtained Explain Job ID. It uses the ID from the `last_explain_job` instance within the `Snapshot` instance.

The operation `get_samples(num_samples=10)` runs the API operation and stores the result into the same `<SnapshotExplain>`'s `samples` property. The response is essentially metadata content from a random selection of items. It accepts the parameter `num_samples` which can be an integer between `1` and `100`.

**_NOTE:_: It's important to run `<SnapshotExplain>.get_samples()` within the next 30 minutes of an explain. Otherwise no samples will be retrieved as server cache is flushed frequently.**

In [5]:
expl.get_samples(num_samples=10)

True

By checking the value of `<SnapshotExplain>.samples.data.columns` it's possible to see the list of columns that can be used for a quick analysis. The property `data` is a `Pandas.DataFrame`.

In [6]:
expl.samples.data.columns

Index(['an', 'company_codes', 'company_codes_about', 'company_codes_occur',
       'industry_codes', 'ingestion_datetime', 'modification_datetime',
       'publication_datetime', 'publisher_name', 'region_codes',
       'region_of_origin', 'source_code', 'source_name', 'subject_codes',
       'title', 'word_count', 'newswires_codes', 'restrictor_codes'],
      dtype='object')

Displaying a subset of the content.

In [7]:
expl.samples.data[['an', 'source_name', 'title', 'word_count']]

Unnamed: 0,an,source_name,title,word_count
0,B000000020180317ee3j0005l,Barron's,A Cheaper Electric-Car Play Than Tesla,863
1,B000000020210130eh2100001,Barron's,An EV Play Driven Solely By Promise --- Quantu...,1141
2,B000000020210925eh9r000b5,Barron's,EV Battery Recycler Starts With a Jolt --- New...,998
3,B000000020210911eh9d0002t,Barron's,The Great EV Battery Race,771
4,B000000020170225ed2r0008f,Barron's,This Week: Preview,394
5,B000000020190928ef9u000bl,Barron's,Electric-Car Battery Dilemma,618
6,B000000020211030ehb100003,Barron's,Building A Better Battery --- Auto makers and ...,2571
7,B000000020170429ed510008p,Barron's,Cleaning Up for a Buyer A Closer Look --- Edge...,999
8,B000000020201225egcs000ul,Barron's,Epic Rise of a Battery Maker,845
9,B000000020160507ec590000l,Barron's,Sizing Up Small-Caps: Edgewell Could Fetch a 4...,880


# Next Steps

* Run an [analytics](1.5_snapshot_analytics.ipynb) to get a detailed time-series dataset of the estimates.
* Run an [extraction](1.6_snapshot_extraction.ipynb) and download the matching content.
* Fine-tune the query by adding/modifying the query criteria (where_statement) according to the [query reference](2.1_complex_large_queries.ipynb).