## MediaCollector

### Set environmental variables

In order to properly load modules within this notebook from outside the repository folder, set the script **PATH** below,  e.g. ```C:/MediaCollector```:

In [1]:
PATH = "/media/data/scripts/chn@git/chn-tools/tools/MediaCollector" # <-- optional if running from native path

In [2]:
import importlib.util, os

if not os.path.isdir(PATH):
    PATH = os.getcwd()
PATH = os.path.realpath(PATH)

spec = importlib.util.spec_from_file_location("__init__", PATH+'/__init__.py')
init = importlib.util.module_from_spec(spec)
spec.loader.exec_module(init)

%matplotlib inline
%load_ext autoreload
%autoreload 2

### Import functions

In [3]:
from mclib import mc_stories
from newslib import news_articles
from newslib import news_content
from newslib import news_hyperlinks

#### Set default API credentials

In [4]:
from config import MCLOUD_KEY
from config import NEWSAPI_KEY

#### Override API credentials (optional)

User definitions stored in ```config.py``` make this step optional. **Note**: variables passed to functions on execution override the predefined settings.

In [None]:
#MCLOUD_KEY  = "" # <-- MediaCloud application key
#NEWSAPI_KEY = "" # <-- News API application key

### Get news articles

Set parameters to query either News API articles or MediaCloud stories. **Note:** both `lang` and `method` arguments below are for News API only.

In [9]:
query      = "rap"      # string to search
days       = 30      # to dig for news
limit      = 100     # maximum articles
start_date = None    # "YYYY-MM-DD"
end_date   = None    # "YYYY-MM-DD"
lang       = 'all'   # language code

#### 1/3) From News API `headlines`

In [10]:
articles = news_articles(query, NEWSAPI_KEY, limit, days, start_date, end_date, lang, 'headlines')

Connecting to News API...

Got 20 articles.


#### 2/3) From News API `all`

**Note:** developer accounts are lmited to a maximum of 100 results, otherwise returning an error code `maximumResultsReached`.

In [None]:
articles = news_articles(query, NEWSAPI_KEY, limit, days, start_date, end_date, lang, 'all')

Connecting to News API...
Start date: 2019-11-03 
End date:   2019-10-04


#### 3/3) From MediaCloud `stories`

In [None]:
articles = mc_stories(query, MCLOUD_KEY, limit, days, start_date, end_date, lang)

### Get content

Calls `NewsPlease` and take the content of every previously collected article.

In [None]:
content = news_content(articles)

### Get hyperlinks

Builds a network of articles' hyperlinks using `NewsPaper` and `BeautifulSoup` to scrape pages. **Tip**: try and set a higher number of `levels` (depth).

In [None]:
levels = 1 # <-- set network depth

G = news_hyperlinks(articles, levels=levels)

#### Compress output →  `output.zip`

In [None]:
!zip output.zip *{json,csv,gdf}

### [Download output files](output.zip)

___
### References

* Beautiful Soup: https://pypi.org/project/beautifulsoup4/
* MediaCloud API Client: https://pypi.org/project/mediacloud/
* news-Please: https://pypi.org/project/news-please/
* newsapi-python: https://pypi.org/project/newsapi-python/
* Newspaper: https://pypi.org/project/newspaper3k/