## MediaCollector

### Set environmental variables

In order to properly load modules within this notebook from outside the repository folder, set the script **PATH** below,  e.g. ```C:/MediaCollector```:

In [None]:
PATH = "/path/to/MediaCollector" # <-- optional if running from native path

In [None]:
import importlib.util, os

if not os.path.isdir(PATH):
    PATH = os.getcwd()
PATH = os.path.realpath(PATH)

spec = importlib.util.spec_from_file_location("__init__", PATH+'/__init__.py')
init = importlib.util.module_from_spec(spec)
spec.loader.exec_module(init)

%matplotlib inline
%load_ext autoreload
%autoreload 2

### Import functions

In [None]:
from MediaCollector import MediaCollector
from articles import news_articles
from content import news_content
from hyperlinks import news_hyperlinks
from stories import mc_stories

#### Set default API credentials

In [None]:
from config import MCLOUD_KEY
from config import NEWSAPI_KEY

#### Override API credentials (optional)

User definitions stored in ```config.py``` make this step optional.

In [None]:
#MCLOUD_KEY  = "" # <-- MediaCloud application key
#NEWSAPI_KEY = "" # <-- News API application key

#### Set parameters

Set parameters to query either News API articles or MediaCloud stories.

In [None]:
query      = ""      # string or text file with URLs
content    = True    # scrape page content
hyperlinks = True    # network of pages
days       = 30      # to dig for news
limit      = 100     # maximum articles
start_date = None    # "YYYY-MM-DD"
end_date   = None    # "YYYY-MM-DD"
lang       = 'all'   # language code
method     = 'all'   # API endpoint
output     = '.'     # folder name

### Collect media

Get articles, scrape page content and build a network of hyperlinks.

In [None]:
articles = MediaCollector(query,
                          newsapi_key=NEWSAPI_KEY,
                          mcloud_key=MCLOUD_KEY,
                          method=method,
                          days=days,
                          lang=lang,
                          limit=limit,
                          since=start_date,
                          until=end_date,
                          content=content,
                          hyperlinks=hyperlinks,
                          output_folder=output)

### Alternative usage

Step-by-step execution of the main function.

#### 1/3) From News API `headlines`

Returns latest news containing query only, up to a maximum of 20 articles.

In [None]:
articles = news_articles(query,
                         NEWSAPI_KEY,
                         method='headlines',
                         lang=lang)

#### 2/3) From News API `everything`

**Note:** developer accounts are limited to a maximum of 100 results, otherwise returning an error code `maximumResultsReached`.

In [None]:
articles = news_articles(query,
                         NEWSAPI_KEY,
                         days=days,
                         lang=lang,
                         pages=(limit/20 if limit>20 else 1),
                         start_date=start_date,
                         end_date=end_date)

#### 3/3) From MediaCloud `stories`

**Note:** language parameter here works as a post-query filter, once MediaCloud does not seem to support it on search time.

In [None]:
articles = mc_stories(query,
                      MCLOUD_KEY,
                      days=days,
                      lang=lang,
                      limit=limit,
                      start_date=start_date,
                      end_date=end_date)

#### Get page content

Calls `NewsPlease` and take the content of every previously collected article.

In [None]:
articles = news_content(articles)

#### Get page hyperlinks

Builds a network of articles' hyperlinks using `NewsPaper` and `BeautifulSoup` to scrape pages. **Tip**: try and set a higher number of `levels` (depth).

In [None]:
levels = 1 # <-- set network depth

G = news_hyperlinks(articles, levels=levels)

#### Compress output →  `output.zip`

In [None]:
!zip output.zip *json *csv *gml

### [Download output files](output.zip)

___
### References

* Beautiful Soup: https://pypi.org/project/beautifulsoup4/
* MediaCloud API Client: https://pypi.org/project/mediacloud/
* news-Please: https://pypi.org/project/news-please/
* newsapi-python: https://pypi.org/project/newsapi-python/
* Newspaper: https://pypi.org/project/newspaper3k/