## MediaCollector

### Set environmental variables

In order to properly load modules within this notebook from outside the repository folder, set the script **PATH** below,  e.g. ```C:/MediaCollector```:

In [None]:
PATH = "/path/to/MediaCollector" # <-- optional if running from native path

In [None]:
import importlib.util, os

if not os.path.isdir(PATH):
    PATH = os.getcwd()
PATH = os.path.realpath(PATH)

spec = importlib.util.spec_from_file_location("__init__", PATH+'/__init__.py')
init = importlib.util.module_from_spec(spec)
spec.loader.exec_module(init)

%matplotlib inline
%load_ext autoreload
%autoreload 2

### Import functions

In [None]:
from MediaCollector import MediaCollector
from articles import NewsApiArticles
from content import NewsContent
from hyperlinks import NewsHyperlinks
from stories import MediaCloudStories

#### Import API credentials

In [None]:
from config import MCLOUD_KEY
from config import NEWSAPI_KEY

#### Override API credentials

User definitions stored in ```config.py``` make this step optional.

In [None]:
#MCLOUD_KEY  = "" # <-- MediaCloud application key
#NEWSAPI_KEY = "" # <-- News API application key

#### Set parameters

Set parameters to query either MediaCloud or News API. You may also pass a file containing articles or a list of URLs as input for content/hyperlinks.

**Note:** the `category` and `country` ([ISO 3166-1 alpha-2 code](https://www.iso.org/obp/ui/#search) e.g. "br" for Brazil) parameters are only available when querying [News API headlines](https://newsapi.org/docs/endpoints/top-headlines).

In [None]:
input_query = ""        # string or text file with URLs
language = ""           # language code e.g. "en" for English
method = "mediacloud"   # "mediacloud" or "newsapi"

start_date = ""         # as in "YYYY-MM-DD"
end_date = ""           # as in "YYYY-MM-DD"
limit = None            # maximum articles to get

content = False         # get page content
hyperlinks = False      # network of pages
max_workers = 20        # concurrent workers

category = ""           # for News API headlines only
country = ""            # for News API headlines only

output_folder = "MEDIA" # output folder name
merge_output = False    # for multiple collections

### Collect media

Get news media articles, optionally gather page content and build a network of hyperlinks. If no date is set, get today's articles.

In [None]:
output = MediaCollector(input_query,
                        newsapi_key=NEWSAPI_KEY,
                        mcloud_key=MCLOUD_KEY,
                        method=method,
                        lang=language,
                        limit=limit,
                        since=start_date,
                        until=end_date,
                        content=content,
                        hyperlinks=hyperlinks,
                        max_workers=max_workers,
                        merge_output=merge_output,
                        output_folder=output_folder,
                        ext='csv')

### Advanced usage

Step-by-step execution of the main function by importing and executing classes.

In [None]:
M = MediaCloudStories(MCLOUD_KEY)
N = NewsApiArticles(NEWSAPI_KEY)
C = NewsContent(max_workers=max_workers)
H = NewsHyperlinks(max_workers=max_workers)

#### Search MediaCloud stories

**Note:** language parameter here works as a post-query filter, once MediaCloud does not seem to support it on search time.

In [None]:
output = M.stories(input_query,
                   lang=language,
                   limit=limit,
                   start_date=start_date,
                   end_date=end_date)

#### Search News API articles

**Note:** free accounts are limited to a maximum of 100 articles for both `everything` and `headlines` endpoints.

In [None]:
output = N.articles(input_query,
                    category=category,
                    country=country,
                    endpoint='everything',#'headlines',
                    lang=language,
                    limit=limit,
                    start_date=start_date,
                    end_date=end_date)

#### Check News API sources

Returns a subset of news publishers that are available for `headlines`.

In [None]:
output = N.sources()

#### Get page content

Gather the content of every previously collected article through `news-please`.

In [None]:
output = C.from_articles(output)

#### Get page hyperlinks

Builds a network from page hyperlinks through `NewsPaper3k`. **Tip**: set a higher `depth` to also consider returned pages up to N levels.

In [None]:
H.from_articles(output, output_folder=output_folder, depth=1)

### Data frame from output

Optionally build a data frame to inspect output objects. **Note:** requires importing `Pandas` beforehand.

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame.from_dict(output); df

#### Compress output →  `output.zip`

In [None]:
!zip -r output.zip MEDIA

### [Download output files](output.zip)

___
### References

* Beautiful Soup: https://pypi.org/project/beautifulsoup4/
* MediaCloud API Client: https://pypi.org/project/mediacloud/
* news-Please: https://pypi.org/project/news-please/
* newsapi-python: https://pypi.org/project/newsapi-python/
* Newspaper: https://pypi.org/project/newspaper3k/