# Snapshot Explain

This notebook shows how to run a Snapshot Explain operation with the minimal steps and a simple query.

In this notebook...
* [Dependencies and Initialisation](#dependencies-and-initialisation)
* [The Where Statement](#the-where-statement)
* [Running the Explain Operation](#running-the-explain-operation)
* [Next Steps](#next-steps)

## Dependencies and Initialisation
Import statements and environment initialisation using the package `dotenv`. More details in the [Configuration notebook](0.2_configuration.ipynb).

In [1]:
from factiva.news import Snapshot
from dotenv import load_dotenv
load_dotenv()
print('Done!')

Done!


## The Where Statement

This notebook uses a simple query for illustration purposes. For more tips about queries, or guidance on how to build complex or large queries, checkout the [query reference](2.1_complex_large_queries.ipynb) notebook.

In [2]:
# Industry i3432 is for Batteries
where_statement = (
    r" publication_datetime >= '2016-01-01 00:00:00' "
    r" AND LOWER(language_code) IN ('en', 'de', 'fr') "
    r" AND REGEXP_CONTAINS(industry_codes, r'(?i)(^|,)(i3432)($|,)') "
)

s = Snapshot(query=where_statement)

## Running the Explain Operation

This operation returns the number of documents matching the provided query in the Factiva Analytics archive.

The goal of this operation is to have a rough idea of the document volume. When used iteratively, helps deciding on the used criteria to add/delete/modify the criteria to verify the impact on the matching items.


The `<Snapshot>.process_explain()` function directly returns this value. If a more manual process is required (send job, monitor job, get results), please see the [detailed package documentation](https://factiva-news-python.readthedocs.io/).

In [4]:
%%time
s.process_explain()
print(f'Explain operation ID: {s.last_explain_job.job_id}')
print(f'Document volume estimate: {s.last_explain_job.document_volume}')

Explain operation ID: df708daa-0726-43a5-b0b6-a50085924491
Document volume estimate: 177118
CPU times: user 298 ms, sys: 16.1 ms, total: 314 ms
Wall time: 47.9 s


## Getting Explain Samples
As an extension of the Explain operation, it is possible to request a set of random article metadata samples matching the Explain criteria. The main requirement in this case is just using the previously obtained Explain Job ID. It uses the ID from the `last_explain_job` instance within the `Snapshot` instance.

The operation `get_explain_job_samples(num_samples=10)` returns a Pandas DataFrame with the metadata content from a random selection of items. It accepts the parameter `num_samples` that is an integer between `1` and `100`.

In [5]:
samples = s.get_explain_job_samples(num_samples=10)

DataFrame size: (10, 16)
Columns: Index(['an', 'company_codes', 'company_codes_about', 'company_codes_occur',
       'industry_codes', 'ingestion_datetime', 'modification_datetime',
       'publication_datetime', 'publisher_name', 'region_codes',
       'region_of_origin', 'source_code', 'source_name', 'subject_codes',
       'title', 'word_count'],
      dtype='object')


The following code shows the list of columns that can be used for a quick analysis.

In [6]:
samples.columns

Index(['an', 'company_codes', 'company_codes_about', 'company_codes_occur',
       'industry_codes', 'ingestion_datetime', 'modification_datetime',
       'publication_datetime', 'publisher_name', 'region_codes',
       'region_of_origin', 'source_code', 'source_name', 'subject_codes',
       'title', 'word_count'],
      dtype='object')

Displaying a subset of the content.

In [8]:
samples[['an', 'source_name', 'title', 'word_count']]

Unnamed: 0,an,source_name,title,word_count
0,OFBOAR0020220404ei3s0001d,The Official Board,FMC - The organizational chart displays its 33...,643
1,AFNR000020180322ee3n0000x,The Australian Financial Review,Buffett's Chinese battery to take on Tesla loc...,707
2,ALLZTG0020180301ee310003c,Aller-Zeitung,Bosch baut keine Batteriezellen; Konzern nennt...,240
3,ATS0000020180301ee310018h,ATS - Agence Télégraphique Suisse,Leclanché revoit sa perte nette pour 2017,220
4,HUGNFR0020180322ee3m001be,Nasdaq / Globenewswire,Blue Solutions : résultats 2017,588
5,DEALNEW020180321ee3e0000h,The Deal,"As Shortage Looms, Canada's First Cobalt Picks...",675
6,CWNS000020180322ee3m0040m,Postmedia Breaking News,Major international lithium and battery manufa...,1012
7,STA0000020180326ee3q0005l,STA,Another record year for battery maker TAB,194
8,OFBOAR0020220321ei3f000ee,The Official Board,EnerSys - The organizational chart displays it...,544
9,SNLMMDE020180309ee380000z,SNL Metals & Mining Daily: East Edition,EU mandate lures refocused Australian junior H...,750


# Next Steps

* Run an [analytics](1.5_snapshot_analytics.ipynb) to get a detailed time-series dataset of the estimates.
* Run an [extraction](1.6_snapshot_extraction.ipynb) and download the matching content.
* Fine-tune the query by adding/modifying the query criteria (where_statement) according to the [query reference](2.1_complex_large_queries.ipynb).