- Overview
- Media
- Media Health
- Feeds
- Stories
- Sentences
- Word Counting
- Tags and Tag Sets
- Registration and Authentication
- Register
- Reset password
- Log in
- User Profile
- Stats
- Util
- Extended Examples
- Output Format / JSON
- Create a CSV file with all media sources.
- Grab all processed stories from US Mainstream Media as a stream
- Grab stories by querying stories_public/list
- Grab all stories in The New York Times during October 2012
- Get word counts for top words for sentences matching 'trayvon' in US Mainstream Media during April 2012
- Get word counts for top words for sentences with the tag 'odd' in
tag_set = 'ts'
- Find the tags_id for
'odd'
given thetag_sets_id
- Grab stories from 10 January 2014 with the tag 'foo:bar'
- Grab stories by querying stories_public/list
Every call below includes a key
parameter which will authenticate the user to the API service. The key parameter is excluded from the examples in the below sections for brevity.
To get a key, register for a user:
https://core.mediacloud.org/login/register
Once you have an account go here to see your key:
https://core.mediacloud.org/admin/profile
https://api.mediacloud.org/api/v2/media/single/1?key=KRN4T5JGJ2A
A Python client for our API is now available. Users who develop in Python will probably find it easier to use this client than to make web requests directly. The Python client is available here.
Note: by default the API only returns a subset of the available fields in returned objects. The returned fields are those that we consider to be the most relevant to users of the API. If the all_fields
parameter is provided and is non-zero, then a more complete list of fields will be returned. For space reasons, we do not list the all_fields
parameter on individual API descriptions.
The following language are supported (by 2 letter language code):
ca
(Catalan)da
(Danish)de
(German)en
(English)es
(Spanish)fi
(Finnish)fr
(French)ha
(Hausa)hi
(Hindi)hu
(Hungarian)it
(Italian)ja
(Japanese)lt
(Lithuanian)nl
(Dutch)no
(Norwegian)pt
(Portuguese)ro
(Romanian)ru
(Russian)sv
(Swedish)tr
(Turkish)zh
(Chinese)
The Media Cloud returns an appropriate HTTP status code for any error, along with a JSON document in the following format:
{ "error": "error message" }
Each user is limited to 1,000 API calls and 20,000 stories returned in any 7 day period. Requests submitted beyond this limit will result in a status 403 error. Users who need access to more requests should email info@mediacloud.org.
The Media API calls provide information about media sources. A media source is a publisher of content, such as the New York Times or Instapundit. Every story belongs to a single media source. Each media source can have zero or more feeds.
URL | Function |
---|---|
api/v2/media/single/<media_id> |
Return the media source in which media_id equals <media_id> |
None.
Fetching information on The New York Times
URL: https://api.mediacloud.org/api/v2/media/single/1
Response:
[
{
"url": "http://nytimes.com",
"name": "New York Times",
"media_id": 1,
"is_healthy": 1,
"is_monitored": 1,
"public_notes": "all the news that's fit to print",
"editor_nnotes": "first media source",
"num_stories_90": 123,
"num_sentences_90": 1234,
"start_date": "2016-01-01",
"media_source_tags": [
{
"tag_sets_id": 5,
"show_on_stories": null,
"tags_id": 8875027,
"show_on_media": 1,
"description": "Top U.S. mainstream media according Google Ad Planner's measure of unique monthly users.",
"tag_set": "collection",
"tag": "ap_english_us_top25_20100110",
"label": "U.S. Mainstream Media"
}
],
"activities": [
{
"date": "2015-08-12 18:17:35.922523",
"field": "name",
"new_value": "New York Times",
"old_value": "nytimes.com"
}
]
}
]
URL | Function |
---|---|
api/v2/media/list |
Return multiple media sources |
Parameter | Default | Notes |
---|---|---|
last_media_id |
0 | Return media sources with a media_id greater than this value |
rows |
20 | Number of media sources to return. Cannot be larger than 100 |
name |
none | Name of media source for which to search |
tag_name |
none | Name of tag for which to return belonging media |
timespans_id |
null | Return media within the given timespan |
topic_mode |
null | If set to 'live', return media from live topics |
tags_id |
null | Return media associate with the given tag |
q |
null | Return media with at least one sentence that matches the Solr query |
include_dups |
0 | Include duplicate media among the results |
unhealthy |
none | Only return media that are currently marked as unhealthy (see mediahealth/list) |
similar_media_id |
none | Return media with the most tags in common |
sort |
id | sort order of media: id , or num_stories |
If the name parameter is specified, the call returns only media sources that match a case insensitive search specified value. If the specified value is less than 3 characters long, the call returns an empty list.
By default, media are sorted by media_id. If the sort parameter is set to 'num_stories', the media will be sorted by decreasing number of stories in the past 90 days.
By default, calls that specify a name parameter will only return media that are not duplicates of some other media source. Media Cloud has many media sources that are either subsets of other media sources or are just holders for spidered media from a given media source, both of which are marked as duplicate media and are not included in the default results. If the 'include_dups' parameter is set to 1, those duplicate sources will be included in the results.
If the timespans_id
parameter is specified, return media within the given time slice,
sorted by descending inlink_count within the timespan. If topic_mode
is set to
'live', return media from the live topic stories rather than from the frozen snapshot.
If the q
parameter is specified, return only media that include at least on sentence that matches the given Solr query. For a description of the Solr query format, see the stories_public/list
call.
URL: https://api.mediacloud.org/api/v2/media/list?last_media_id=1&rows=2
Output format is the same as for api/v2/media/single above.
URL | Function |
---|---|
api/v2/media/submit_suggestion |
Suggest a media source for Media Cloud to crawl |
This API end point allows the user to send a suggest a new media source to the Media Cloud team for regular crawling.
Field | Description |
---|---|
url | URL of the media source home page (required) |
name | Human readable name of media source (optional) |
feed_url | URL of RSS, RDF, or Atom syndication feed for the source (optional) |
reason | Reason media source should be added to the system (optional) |
tags_ids | list of suggested tags to add to the source (optional ) |
URL: https://api.mediacloud.org/api/v2/media/submit_suggestion
Input:
{
"name": "Cameroon Tribue",
"url": "http://www.cameroon-tribune.cm"
}
Output:
{ "success": 1 }
The Media Health API call provides information about the health of a media source, meaning to what degree we are capturing all of the stories published by that media source. Media Cloud collects its data via automatically detected RSS feeds on the open web. This means first that the system generally has data for a given media source from the time we first enter that source into our database. Second, Media Cloud data for a given media source is only as good as the set of feeds we have for that source. Our feed scraper is not perfect and so sometimes misses feeds it should be collecting. Third, feeds change over time. We periodically rescrape every media source for new feeds, but this takes time and is not perfect.
The only way we have of judging the health is judging the relative number of stories over time. This media call provides a set of metrics that compare the current number of stories being collected by the media source with the number of stories collected over the past 90 days, and also compares coverage over time with the expected volume. More details are in the field descriptions below
URL | Function |
---|---|
api/v2/mediahealth/list |
Return media health data for the given media sources |
Parameter | Default | Notes |
---|---|---|
media_id |
none | Return health data for the given media sources. May be specified multiple times. |
Field | Description |
---|---|
media_id |
The id of the media source |
is_healthy |
Is the media source currently returning at least 25% of the 90 day averages of stories and sentences |
has_active_feed |
Does the media source have at least one active syndicated feed (which may not be returning any stories) |
num_stories |
Number of stories collected yesterday |
num_stories_w |
Average number of stories collected in the last 7 days |
num_stories_90 |
Average number of stories collected in the last 90 days |
num_stories_y |
Average number of stories collected in the last year |
num_sentences |
Number of sentences collected yesterday |
num_sentences_w |
Average number of sentences collected in the last 7 days |
num_sentences_90 |
Average number of sentences collected in the 90 days |
num_sentences_y |
Average number of sentences collected in the last year |
expected_stories |
Average number of stories collected for each of the 20 days with the highest number of stories |
expected_sentences |
Average number of sentences collected or each of the 20 days with the highest number of sentences |
start_date |
First week on which at least 25% of expected_stories and expected_sentences were collected |
end_date |
Last week on which at least 25% of expected_stories and expected_sentences were collected |
coverage_gaps |
Number of weeks between start_date and end_date for which fewer than 25% of expected_stories or expected_sentences were collected |
coverage_gaps_list |
List of weeks between start_date and end_date for which fewer than 25% of expected_stories or expected_sentences were collected |
Fetch media health information for media source 4438:
https://api.mediacloud.org/api/v2/mediahealth/list?media_id=4438
Response:
[
{
"media_id": "4438",
"is_healthy": 1,
"has_active_feed": 1,
"num_stories": 42,
"num_stories_w": "28.57",
"num_stories_90": "30.54",
"num_stories_y": "33.00",
"num_sentences": 1200,
"num_sentences_w": "873.86",
"num_sentences_90": "877.16",
"num_sentences_y": "926.83",
"start_date": "2011-01-03 00:00:00-05",
"end_date": "2016-02-22 00:00:00-05",
"expected_stories": "49.97",
"expected_sentences": "1166.22",
"coverage_gaps": 1,
"coverage_gaps_list": [
{
"media_id": "4438",
"stat_week": "2013-12-23 00:00:00-05",
"num_stories": "12.43",
"num_sentences": "350.29",
"expected_stories": "49.97",
"expected_sentences": "1166.22",
}
]
}
]
A feed is either a syndicated feed, such as an RSS feed, or a single web page. Each feed is downloaded between once an hour and once a day depending on traffic. Each time a syndicated feed is downloaded, each new URL found in the feed is added to the feed's media source as a story. Each time a web page feed is downloaded, that web page itself is added as a story for the feed's media source.
Each feed belongs to a single media source. Each story can belong to one or more feeds from the same media source.
URL | Function |
---|---|
api/v2/feeds/single/<feeds_id> |
Return the feed for which feeds_id equals <feeds_id> |
None.
URL: https://api.mediacloud.org/api/v2/feeds/single/1
[
{
"name": "Bits",
"url": "http://bits.blogs.nytimes.com/rss2.xml",
"feeds_id": 1,
"type": "syndicated",
"media_id": 1
}
]
URL | Function |
---|---|
api/v2/feeds/list |
Return multiple feeds |
Parameter | Default | Notes |
---|---|---|
last_feeds_id |
0 | Return feeds in which feeds_id is greater than this value |
rows |
20 | Number of feeds to return. Cannot be larger than 100 |
media_id |
(required) | Return feeds belonging to the media source |
URL: https://api.mediacloud.org/api/v2/feeds/list?media_id=1
Output format is the same as for api/v2/feeds/single above.
A story represents a single published piece of content. Each unique URL downloaded from any syndicated feed within a single media source is represented by a single story. For example, a single New York Times newspaper story is a Media Cloud story, as is a single Instapundit blog post. Only one story may exist for a given title for each 24 hours within a single media source.
The following table describes the meaning and origin of fields returned by both api/v2/stories_public/single and api/v2/stories_public/list.
Field | Description |
---|---|
stories_id |
The internal Media Cloud ID for the story. |
media_id |
The internal Media Cloud ID for the media source to which the story belongs. |
media_name |
The name of the media source to which the story belongs. |
media_url |
The URL of the media source to which the story belongs. |
publish_date |
The publish date of the story as specified in the RSS feed. |
tags |
A list of any tags associated with this story, including those written through the write-back api. |
collect_date |
The date the RSS feed was actually downloaded. |
url |
The URL field in the RSS feed. |
guid |
The GUID field in the RSS feed. Defaults to the URL if no GUID is specified in the RSS feed. |
language |
The language of the story as detected by the chromium compact language detector library. |
title |
The title of the story as found in the RSS feed. |
ap_syndicated |
Whether our detection algorithm thinks that this is an English language syndicated AP story |
URL | Function |
---|---|
api/v2/stories_public/single/<stories_id> |
Return the story for which stories_id equals <stories_id> |
Note: This fetches data on the CC licensed Global Voices story "Myanmar's new flag and new name" from November 2010.
URL: https://api.mediacloud.org/api/v2/stories_public/single/27456565
[
{
"collect_date": "2010-11-24 15:33:39",
"url": "http://globalvoicesonline.org/2010/10/26/myanmars-new-flag-and-new-name/comment-page-1/#comment-1733161",
"guid": "http://globalvoicesonline.org/?p=169660#comment-1733161",
"publish_date": "2010-11-24 04:05:00",
"media_id": 1144,
"media_name": "Global Voices Online",
"media_url": "http://globalvoicesonline.org/",
"stories_id": 27456565,
"story_tags": [ 1234235 ],
}
]
URL | Function |
---|---|
api/v2/stories_public/list |
Return multiple processed stories |
Parameter | Default | Notes |
---|---|---|
last_processed_stories_id |
0 | Return stories in which the processed_stories_id is greater than this value. |
rows |
20 | Number of stories to return, max 10,000. |
feeds_id |
null | Return only stories that match the given feeds_id, sorted my descending publish date |
q |
null | If specified, return only results that match the given Solr query. Only one q parameter may be included. |
fq |
null | If specified, file results by the given Solr query. More than one fq parameter may be included. |
sort |
processed_stories_id |
Returned results sort order. Supported values:
|
random
- order results randomly but consistently for a given searchThe last_processed_stories_id
parameter can be used to page through these results. The API will return stories with aprocessed_stories_id
greater than this value. To get a continuous stream of stories as they are processed by Media Cloud, the user must make a series of calls to api/v2/stories_public/list in which last_processed_stories_id
for each
call is set to the processed_stories_id
of the last story in the previous call to the API. A single call can only
return up to 10,000 results, but you can get the full list of results by paging through the full list using
last_processed_stories_id
.
Note: stories_id
and processed_stories_id
are separate values. The order in which stories are processed is different than the stories_id
order. The processing pipeline involves downloading, extracting, and vectoring stories. Requesting by the processed_stories_id
field guarantees that the user will receive every story (matching the query criteria if present) in
the order it is processed by the system.
The q
and fq
parameters specify queries to be sent to a Solr server that indexes all Media Cloud stories. The Solr
server provides full text search indexing of each sentence collected by Media Cloud. All content is stored as individual
sentences. The api/v2/stories_public/list call searches for sentences matching the q
and / or fq
parameters if specified and
the stories that include at least one sentence returned by the specified query.
The q
and fq
parameters are passed directly through to Solr. Documentation of the format of the q
and fq
parameters is here.
Below are the fields that may be used as Solr query parameters, for example 'text:obama AND media_id:1':
Field | Description |
---|---|
sentence | the text of the sentence |
stories_id | a story ID |
media_id | the Media Cloud media source ID of a story |
publish_date | the publish date of a story |
tags_id_story | the ID of a tag associated with a story |
tags_id_media | the ID of a tag associated with a media source |
processed_stories_id | the processed_stories_id as returned by stories_public/list |
Be aware that ':' is usually replaced with '%3A' in programmatically generated URLs.
Solr range queries may only be used within the fq parameter. Using a range query in the main q query will result in an error.
In addition, there following fields may be entered as pseudo queries within the Solr query:
Pseudo Query Field | Description |
---|---|
topic | a topic id |
timespan | a timespan id |
link_from_tag | a tag id, returns stories linked from stories associated with the tag |
link_to_story | a story id, returns stories that link to the story |
link_from_story | a story id, returns stories that are linked from the story |
link_to_medium | a medium id, returns stories that link to stories within the medium |
link_from_medium | link_from_medium, returns stories that are linked from stories within the medium |
To include one of these fields in a larger Solr query, delineate with {~ }, for example:
{~ topic:1 } and media_id:1
The API will translate the given pseudo query into a stories_id: clause in the larger Solr query. So the above query will be translated into the following, including topic 1 consists of stories with ids 1, 2, 3, and 4.
stories_id:( 1 2 3 4 ) and media_id:1
If '-1' is appended to the timespan query field value, the pseudo query will match stories from the live topic matching the given time slice rather than from the dump. For example, the following will live stories from timespan 1234:
{~ timespan:1234-1 }
The link_* pseudo query fields all must be within the same {~ } clause as a timespan query and return links from the associated timespan. For example, the following returns stories that link to story 5678 within the specified time slice:
{~ timespan:1234-1 link_to_story:5678 }
The output of these calls is in exactly the same format as for the api/v2/stories_public/single call.
URL: https://api.mediacloud.org/api/v2/stories_public/list?last_processed_stories_id=8625915
Return a stream of all stories processed by Media Cloud, greater than the last_processed_stories_id
.
Return a stream of all stories from The New York Times mentioning 'obama'
greater than the given last_processed_stories_id
.
Parameter | Default | Notes |
---|---|---|
q |
n/a | q ("query") parameter which is passed directly to Solr |
fq |
null |
fq ("filter query") parameter which is passed directly to Solr |
split |
null |
if set to 1 or true, split the counts into date ranges |
split_period |
day |
return counts for these date periods: day, week, month, year |
The q and fq parameters are passed directly through to Solr (see description of q and fq parameters in api/v2/stories_public/list section above).
The call returns the number of stories returned by Solr for the specified query.
If split is specified, split the counts into periods set by split_period.
Count stories containing the word 'obama' in The New York Times.
URL: https://api.mediacloud.org/api/v2/stories_public/count?q=obama&fq=media_id:1
{
"count": 6620
}
Count stories containing 'africa' in the New York Times for each week from 2014-01-01 to 2014-03-01:
{
"counts": [
{
"count": 25,
"date": "2013-12-30 00:00:00"
},
{
"count": 59,
"date": "2014-01-06 00:00:00"
},
{
"count": 70,
"date": "2014-01-13 00:00:00"
},
{
"count": 71,
"date": "2014-01-20 00:00:00"
},
{
"count": 80,
"date": "2014-01-27 00:00:00"
},
{
"count": 57,
"date": "2014-02-03 00:00:00"
},
{
"count": 54,
"date": "2014-02-10 00:00:00"
},
{
"count": 45,
"date": "2014-02-17 00:00:00"
},
{
"count": 44,
"date": "2014-02-24 00:00:00"
}
]
}
Parameter | Default | Notes |
---|---|---|
q |
n/a | q ("query") parameter which is passed directly to Solr |
fq |
null |
fq ("filter query") parameter which is passed directly to Solr |
limit |
1000 | number of tags to fetch from Solr |
tag_sets_id |
null |
return only tags belonging to this tag set |
The q and fq parameters are passed directly through to Solr (see description of q and fq parameters in api/v2/stories_public/list section above).
The call returns list of the tags most commonly associated with stories that match the given query. The limit parameter s applied before the tag_sets_id parameter, so fewer than limit (or zero) results may be returned for a given tag set even if tags from that tag set are associated with stories matching the query.
Count tags in stories containing the word 'obama' in The New York Times.
URL: https://api.mediacloud.org/api/v2/stories_public/tag_count?q=obama&fq=media_id:1&limit=3
[
{
"count": 20240,
"description": "politics and government",
"is_static": false,
"label": "politics and government",
"show_on_media": null,
"show_on_stories": null,
"tag": "politics and government",
"tag_set_label": "nyt_labels",
"tag_set_name": "nyt_labels",
"tag_sets_id": 1963,
"tags_id": 9360836
},
{
"count": 17491,
"description": "Obama",
"is_static": false,
"label": "Obama",
"show_on_media": null,
"show_on_stories": null,
"tag": "Obama",
"tag_set_label": "cliff_people",
"tag_set_name": "cliff_people",
"tag_sets_id": 2389,
"tags_id": 9362721
},
{
"count": 15904,
"description": "united states politics and government",
"is_static": false,
"label": "united states politics and government",
"show_on_media": null,
"show_on_stories": null,
"tag": "united states politics and government",
"tag_set_label": "nyt_labels",
"tag_set_name": "nyt_labels",
"tag_sets_id": 1963,
"tags_id": 9360846
}
]
Parameter | Default | Notes |
---|---|---|
q |
n/a | q ("query") parameter which is passed directly to Solr |
fq |
null |
fq ("filter query") parameter which is passed directly to Solr |
rows |
1000 | number of stories to return from solr, max 100,000 |
max_words |
n/a | max number of non-zero count word stems to return for each story |
stopword_length |
n/a | if set to 'tiny', 'short', or 'long', eliminate stop word list of that length |
The q and fq parameters are passed directly through to Solr (see description of q and fq parameters in api/v2/stories_public/list section above).
If stopword_length is specified, eliminate the 'tiny', 'short', or 'long' list of stopwords from the results, if the system has stopwords for the language of each story. See Supported Languages for a list of supported languages and their codes.
Field | Description |
---|---|
word_matrix | a dictionary of stories_ids, each pointing to a dictionary of word counts |
word_list | the list of word stems counted, in the order of the index used for the word counts |
The word_matrix is a dictionary with the stories_id as the key and the word count dictionary of as the value. For each word count dictionary, the key is the word index of the word in the word_list and the value is the count of the word in that story.
The word list is a list of lists. The overall list includes the stems in the order that is referenced by the word index in the word_matrix word count dictionary for each story. Each individual list member includes the stem counted and the most common full word used with that stem in the set.
For the following two stories:
story id 1: 'foo bar bars' story id 2: 'foo bars foos foo'
the returned data would look like:
{
"word_matrix": {
"1": {
"0": 1,
"1": 2
},
"2": {
"0": 3,
"1": 1
}
},
"word_list": [
["foo", "foo"],
["bar", "bars"]
]
}
The text of every story processed by Media Cloud is parsed into individual sentences. Duplicate sentences within the same media source in the same week are dropped (the large majority of those duplicate sentences are navigational snippets wrongly included in the extracted text by the extractor algorithm).
This call has been removed. Consider using api/v2/stories_public/count
instead.
This call has been removed. Consider using api/v2/stories_public/tag_count
instead.
Returns word frequency counts of the most common words in a randomly sampled set of all sentences returned by querying Solr using the q
and fq
parameters, with stopwords removed by default. Words are stemmed before being counted. For each word, the call returns the stem and the full term most used with the given stem in the specified Solr query (for example, in the below example, 'democrat' is the stem that appeared 58 times and 'democrats' is the word that was most commonly stemmed into 'democract').
Parameter | Default | Notes |
---|---|---|
q |
n/a | q ("query") parameter which is passed directly to Solr |
fq |
null |
fq ("filter query") parameter which is passed directly to Solr |
num_words |
500 | Number of words to return |
sample_size |
1000 | Number of sentences to sample, max 100,000 |
random_seed |
1 | Seed value to use when generating random sample |
include_stopwords |
0 | Set to 1 to disable stopword removal |
include_stats |
0 | Set to 1 to include stats about the request as a whole (such as total number of words) |
See above /api/v2/stories_public/list
for Solr query syntax.
To provide quick results, the API counts words in a randomly sampled set of sentences returned by the given query. By default, the request will sample 1000 sentences and return 500 words. You can make the API sample more sentences. The system takes about one second to process each multiple of 1000 sentences.
Sentences are going to be tokenized into words by identifying each of the sentence's language and using this language's sentence splitting algorithm. Additionally, both English and the identified language's stopwords are going to be removed from results. See Supported Languages for a list of supported languages and their codes.
Setting the 'stats' field to true changes the structure of the response, as shown in the example below. Following fields are included in the stats response:
Field | Description |
---|---|
num_words_returned |
The number of words returned by the call, up to num_words |
num_sentences_returned |
The number of sentences returned by the call, up to sample_size |
num_sentences_found |
The total number of sentences found by Solr to match the query |
num_words_param |
The num_words param passed into the call, or the default value |
sample_size_param |
The sample size passed into the call, or the default value |
Get word frequency counts for all sentences containing the word 'obama'
in The New York Times
URL: https://api.mediacloud.org/api/v2/wc/list?q=obama+AND+media_id:1
[
{
"count": 1014,
"stem": "obama",
"term": "obama"
},
{
"count": 106,
"stem": "republican",
"term": "republican"
},
{
"count": 78,
"stem": "campaign",
"term": "campaign"
},
{
"count": 72,
"stem": "romney",
"term": "romney"
},
{
"count": 59,
"stem": "washington",
"term": "washington"
},
{
"count": 58,
"stem": "democrat",
"term": "democrats"
}
]
Get word frequency counts for all sentences containing the word 'obama'
in The New York Times, with
stats data included
URL: https://api.mediacloud.org/api/v2/wc/list?q=obama+AND+media_id:1&stats=1
{
"stats": {
"num_words_returned": 5123,
"num_sentences_returned": 899,
"num_sentences_found": 899
},
"words": [
{
"count":1014,
"stem":"obama",
"term":"obama"
},
{
"count":106,
"stem":"republican",
"term":"republican"
},
{
"count":78,
"stem":"campaign",
"term":"campaign"
},
{
"count":72,
"stem":"romney",
"term":"romney"
},
{
"count":59,
"stem":"washington",
"term":"washington"
},
{
"count":58,
"stem":"democrat",
"term":"democrats"
}
]
}
Media Cloud associates tags with media sources, stories, and individual sentences. A tag consists of a short snippet of text,
a tags_id
, and tag_sets_id
. Each tag belongs to a single tag set. The tag set provides a separate name space for a group
of related tags. Each tag has a unique name ('tag') within its tag set. Each tag set consists of a tag_sets_id and a uniaue
name.
For example, the 'gv_country'
tag set includes the tags japan
, brazil
, haiti
and so on. Each of these tags is associated with
some number of media sources (indicating that the given media source has been cited in a story tagged with the given country
in a Global Voices post).
URL | Function |
---|---|
api/v2/tags/single/<tags_id> |
Return the tag in which tags_id equals <tags_id> |
None.
Field | Description |
---|---|
tags_id | Media Cloud internal tag ID |
tags_sets_id | Media Cloud internal ID of the parent tag set |
tag | text of tag, often cryptic |
label | a short human readable label for the tag |
description | a couple of sentences describing the meaning of the tag |
show_on_media | recommendation to show this tag as an option for searching Solr using the tags_id_media |
show_on_stories | recommendation to show this tag as an option for searching Solr using the tags_id_stories |
is_static | if true, users can expect this tag and its associations not to change in major ways |
tag_set_name | name field of associated tag set |
tag_set_label | label field of associated tag set |
tag_set_description | description field of associated tag set |
The show_on_media and show_on_stories fields are useful for picking out which tags are likely to be useful for external researchers. A tag should be considered useful for searching via tags_id_media or tags_id_stories if show_on_media or show_on_stories, respectively, is set to true for either the specific tag or its parent tag set.
Fetching information on the tag 8876989.
URL: https://api.mediacloud.org/api/v2/tags/single/8875027
Response:
[
{
"tag_sets_id": 5,
"show_on_stories": null,
"label": "U.S. Mainstream Media",
"tag": "ap_english_us_top25_20100110",
"tags_id": 8875027,
"show_on_media": 1,
"description": "Top U.S. mainstream media according Google Ad Planner's measure of unique monthly users.",
"tag_set_name": "collection",
"tag_set_label": "Collection",
"tag_set_description": "Curated collections of media sources"
}
]
URL | Function |
---|---|
api/v2/tags/list |
Return multiple tags |
Parameter | Default | Notes |
---|---|---|
last_tags_id |
0 | Return tags with a tags_id is greater than this value |
tag_sets_id |
none | Return tags belonging to the given tag sets. The most useful tag set is tag set 5. Can be passed multiple times to return any tag belonging to any of the tag sets. |
rows |
20 | Number of tags to return. Cannot be larger than 100 |
public |
none | If public=1, return only public tags (see below) |
search |
none | Search for tags by text (see below) |
similar_tags_id |
none | return list of tags with a similar |
If set to 1, the public parameter will return only tags that are generally useful for public consumption. Those tags are defined as tags for which show_on_media or show_on_stories is set to true for either the tag or the tag's parent tag_set. As described below in tags/single, a public tag can be usefully searched using the Solr tags_id_media field if show_on_media is true and by the tags_id_stories field if show_on_stories is true.
If the search parameter is set, the call will return only tags that match a case insensitive search for the given text. The search includes the tag and label fields of the tags plus the names and label fields of the associated tag sets. So a search for 'politics' will match tags whose tag or label field includes 'politics' and also tags belonging to a tag set whose name or label field includes 'politics'. If the search parameter has less than three characters, an empty result set will be returned.
URL: https://api.mediacloud.org/api/v2/tags/list?rows=2&tag_sets_id=5&last_tags_id=8875026
URL | Function |
---|---|
api/v2/tag_sets/single/<tag_sets_id> |
Return the tag set in which tag_sets_id equals <tag_sets_id> |
None.
Field | Description |
---|---|
tags_sets_id | Media Cloud internal ID of the tag set |
name | text of tag set, often cryptic |
label | a short human readable label for the tag |
description | a couple of sentences describing the meaning of the tag |
show_on_media | recommendation to show this tag as an option for searching Solr using the tags_id_media |
show_on_stories | recommendation to show this tag as an option for searching Solr using the tags_id_stories |
The show_on_media and show_on_stories fields are useful for picking out which tags are likely to be useful for external researchers. A tag should be considered useful for searching via tags_id_media or tags_id_stories if show_on_media or show_on_stories, respectively, is set to true for either the specific tag or its parent tag set.
Fetching information on the tag set 5.
URL: https://api.mediacloud.org/api/v2/tag_sets/single/5
Response:
[
{
"tag_sets_id": 5,
"show_on_stories": null,
"name": "collection",
"label": "Collections",
"show_on_media": null,
"description": "Curated collections of media sources. This is our primary way of organizing our media sources -- almost every media source in our system is a member of one or more of these curated collections. Some collections are manually curated, and others are generated using quantitative metrics."
}
]
URL | Function |
---|---|
api/v2/tag_sets/list |
Return all tag_sets |
Parameter | Default | Notes |
---|---|---|
last_tag_sets_id |
0 | Return tag sets with a tag_sets_id greater than this value |
rows |
20 | Number of tag sets to return. Cannot be larger than 100 |
None.
URL: https://api.mediacloud.org/api/v2/tag_sets/list
URL | Function |
---|---|
api/v2/auth/register |
Register a new user. |
admin
.
Field | Description |
---|---|
email |
(string) Email of new user. |
password |
(string) Password of new user. |
full_name |
(string) Full name of new user. |
notes |
(string) User's explanation on how user intends to use Media Cloud. |
subscribe_to_newsletter |
(integer) Whether or not user wants to subscribe to our mailing list. |
activation_url |
(string) Client's URL used for user account activation. |
Asking user to re-enter password and comparing the two values is left to the client.
Client should prevent automated registrations with a CAPTCHA.
After successful registration, user can not immediately log in as the user needs to activate their account via email first. User will be send an email with a link to activation_url
and the following GET parameters:
email
-- user's email to be used as a parameter toauth/activate
;activation_token
-- user's activation token to be used as a parameter toauth/activate
.
{
"success": 1
}
After successful registraction, user is sent an email inviting him to open a link activation_url?email=...&activation_token=...
.
{
"error": "Reason why the user can not be registered (e.g. duplicate email)."
}
URL: https://api.mediacloud.org/api/v2/auth/register
Input:
{
"email": "foo@bar.baz",
"password": "qwerty1",
"full_name": "Foo Bar",
"notes": "Just feeling like it.",
"subscribe_to_newsletter": 1,
"activation_url": "https://dashboard.mediacloud.org/activate"
}
Output:
{
"success": 1
}
URL | Function |
---|---|
api/v2/auth/activate |
Activate user using email and activation token from registration email. |
admin
.
Field | Description |
---|---|
email |
(string) Email of user to be activated. |
activation_token |
(string) Activation token sent by email. |
{
"success": 1,
"profile": {
"Full profile information as in auth/profile."
}
}
{
"error": "Reason why user activation has failed."
}
URL: https://api.mediacloud.org/api/v2/auth/activate
Input:
{
"email": "foo@bar.baz",
"activation_token": "3a0e7de3ba8e19227847b59e43f2ce54c98ec897"
}
Output:
{
"success": 1,
"profile": {
"Full profile information as in auth/profile."
}
}
URL | Function |
---|---|
api/v2/auth/resend_activation_link |
Resend activation email for newly registered user. |
admin
.
Field | Description |
---|---|
email |
(string) Email of newly created user to resend the activation email to. |
activation_url |
(string) Client's URL used for user account activation. |
For the description of activation_url
, see auth/register
.
{
"success": 1
}
{
"error": "Reason why the activation email can not be resent."
}
URL: https://api.mediacloud.org/api/v2/auth/resend_activation_link
Input:
{
"email": "foo@bar.baz",
"activation_url": "https://dashboard.mediacloud.org/activate"
}
Output:
{
"success": 1
}
URL | Function |
---|---|
api/v2/auth/send_password_reset_link |
Email a link to user to be used to reset their password. |
admin
.
Field | Description |
---|---|
email |
(string) Email of user to send the password reset link to. |
password_reset_url |
(string) Client's URL used for setting new password. |
User will be send an email with a link to password_reset_url
and the following GET parameters:
email
-- user's email to be used as a parameter toauth/reset_password
;password_reset_token
-- user's password reset token to be used as a parameter toauth/reset_password
.
{
"success": 1
}
After successful send password reset API call, user is sent an email inviting him to open a link password_reset_url?email=...&password_reset_token=...
.
{
"error": "Reason why the password reset link can not be sent."
}
URL: https://api.mediacloud.org/api/v2/auth/send_password_reset_link
Input:
{
"email": "foo@bar.baz",
"password_reset_url": "https://dashboard.mediacloud.org/reset_password"
}
Output:
{
"success": 1
}
URL | Function |
---|---|
api/v2/auth/reset_password |
Reset user's password using their password reset token send by auth/send_password_reset_link . |
admin
.
Field | Description |
---|---|
email |
(string) Email of user to reset the password to. |
password_reset_token |
(string) Password reset token sent by email. |
new_password |
(string) User's new password. |
{
"success": 1
}
{
"error": "Reason why the password can not be reset."
}
URL: https://api.mediacloud.org/api/v2/auth/reset_password
Input:
{
"email": "foo@bar.baz",
"password_reset_token": "3a0e7de3ba8e19227847b59e43f2ce54c98ec897",
"new_password": "qwerty1"
}
Output:
{
"success": 1
}
URL | Function |
---|---|
api/v2/auth/login |
Authenticate user with email + password and return user's API key and profile. |
API call is rate-limited.
admin-read
.
Parameter | Notes |
---|---|
email |
(string) Email address of the user. |
password |
(string) Password of the user. |
{
"success": 1,
"profile": {
"Full profile information as in auth/profile."
}
}
{
"error": "User was not found, password is incorrect, user is inactive or some other reason."
}
URL: https://api.mediacloud.org/api/v2/auth/login
Input:
{
"email": "user@email.com",
"password": "qwerty1"
}
Output:
{
"success": 1,
"profile": {
"Full profile information as in auth/profile."
}
}
URL | Function |
---|---|
api/v2/auth/profile |
Return profile information about the requesting user. |
search
.
{
"email": "(string) users@email.address",
"full_name": "(string) User's Full Name",
"api_key": "(string) User's API key.",
"notes": "(string) User's 'notes' field.",
"created_date": "(ISO 8601 date) of when the user was created.",
"active": "(integer) 1 if user is active (has activated account via email), 0 otherwise.",
"auth_roles": [
"(string) user-role-1",
"(string) user-role-2"
],
"limits": {
"weekly": {
"requests": {
"used": "(integer) Weekly request count",
"limit": "(integer) Weekly request limit; 0 if no limit"
},
"requested_items": {
"used": "(integer) Weekly requested items count",
"limit": "(integer) Weekly requested items limit; 0 if no limit"
}
}
}
}
Includes a list of authentication roles for the user that give the user permission to access various parts of the backend web interface and some of the private API functionality (that for example allow editing and administration of Media Cloud's sources).
Media Cloud currently includes the following authentication roles:
Role | Permission Granted |
---|---|
admin |
Read and write every resource |
admin-readonly |
Read every resource |
media-edit |
Edit media sources |
stories-edit |
Edit stories |
search |
Access https://core.mediacloud.org/search page |
tm |
Access legacy topic mapper web interface |
tm-readonly |
Access legacy topic mapper web interface with editing privileges |
URL: https://api.mediacloud.org/api/v2/auth/profile
{
"email": "hroberts@cyber.law.harvard.edu",
"full_name": "Hal Roberts",
"api_key": "bae132d8de0e0565cc9b84ec022e367f71f6dabf",
"notes": "Media Cloud Geek",
"created_date": "2017-03-24T03:23:47+00:00",
"active": 1,
"auth_roles": [
"media-edit",
"stories-edit"
],
"limits": {
"weekly": {
"requests": {
"used": 200,
"limit": 0
},
"requested_items": {
"used": 2000,
"limit": 0
}
}
}
}
URL | Function |
---|---|
api/v2/auth/change_password |
Change user's password. |
search
.
Field | Description |
---|---|
old_password |
(string) User's old password. |
new_password |
(string) User's new password. |
Asking user to re-enter password and comparing the two values is left to the client.
{
"success": 1
}
{
"error": "Reason why the password can not be changed."
}
URL: https://api.mediacloud.org/api/v2/auth/change_password
Input:
{
"old_password": "qwerty1",
"new_password": "qwerty1",
}
Output:
{
"success": 1
}
URL | Function |
---|---|
api/v2/auth/reset_api_key |
Reset user's API key. |
search
.
{
"success": 1,
"profile": {
"Full profile information as in auth/profile, including the new API key."
}
}
{
"error": "Reason why resetting user's API key has failed."
}
URL: https://api.mediacloud.org/api/v2/auth/reset_api_key
Output:
{
"success": 1,
"profile": {
"Full profile information as in auth/profile, including the new API key."
}
}
URL | Function |
---|---|
api/v2/stats/list |
Return basic summary stats about total sources, stories, feeds, etc processed by Media Cloud |
( none )
Field | Description |
---|---|
total_stories | total number of stories in the Media Cloud database |
total_downloads | total number of downloads (including stories and feeds) in the Media Cloud database |
total_sentences | total number of sentences in the Media Cloud database |
active_crawled_feeds | number of syndicated feeds with a story in the last 180 days |
active_crawled_media | number of media source with an active crawled feed |
daily_stories | number of stories added yesterday |
daily_downloads | number of downloads added yesterday |
URL: https://api.mediacloud.org/api/v2/stats/list
{
"total_stories": 516145344,
"total_downloads": 941078656,
"total_sentences": 6899028480,
"active_crawled_media": 123,
"active_crawled_feeds": 123,
"daily_stories": 123,
"daily_downloads": 123,
}
Detect whether a given block of content is likely to be ap syndicated content by looking for certain signals in the text (for example 'boston (ap)') and by comparing the text to the text of ap content in the Media Cloud database.
Field | Description |
---|---|
content |
text or html content |
Field | Description |
---|---|
is_syndicated | 1 if the story is syndicated, 0 otherwise |
URL: https://api.mediacloud.org/api/v2/util/is_syndicated_ap
Input:
{
"content": "WASHINGTON (AP) -- Republican Sen. Marco Rubio declared Thursday he will vote against the GOP'S sweeping tax package unless negotiators expand its child tax credit, jeopardizing the Republicans' razor-thin margin as they try to muscle the $1.5 trillion bill through Congress next week."
}
{
"is_syndicated": 1
}
Note: The Python examples below are included for reference purposes. However, a Python client for our API is now available and most Python users will find it much easier to use the API client instead of making web requests directly.
The format of the API responses is determined by the Accept
header on the request. The default is application/json
. Other supported formats include text/html
, text/x-json
, and text/x-php-serialization
. It's recommended that you explicitly set the Accept
header rather than relying on the default.
Here's an example of setting the Accept
header in Python:
import pkg_resources
import requests
assert pkg_resources.get_distribution("requests").version >= '1.2.3'
r = requests.get('https://api.mediacloud.org/api/v2/media/list',
params = params,
headers = { 'Accept': 'application/json'},
headers = { 'Accept': 'application/json'}
)
data = r.json()
media = []
start = 0
rows = 100
while True:
params = { 'start': start, 'rows': rows, 'key': MY_KEY }
print "start:{} rows:{}".format( start, rows)
r = requests.get( 'https://api.mediacloud.org/api/v2/media/list', params = params, headers = { 'Accept': 'application/json'} )
data = r.json()
if len(data) == 0:
break
start += rows
media.extend( data )
fieldnames = [
u'media_id',
u'url',
u'name'
]
with open( '/tmp/media.csv', 'wb') as csvfile:
print "open"
cwriter = csv.DictWriter( csvfile, fieldnames, extrasaction='ignore')
cwriter.writeheader()
cwriter.writerows( media )
This is broken down into multiple steps for convenience and because that's probably how a real user would do it.
The you almost always want to search by a specific media source or media collection. The easiest way to find a relevant media collection is to use our Sources Tool. The URL for a the US Mainstream Media media collection in the sources tool looks like this:
https://sources.mediameter.org/#media-tag/8875027/details
The number in that URL is the tags_id of the media collection.
We can obtain all stories by repeatedly querying api/v2/stories_public/list using the q
parameter to restrict to tags_id_media=8875027
and changing the last_processed_stories_id
parameter.
This is shown in the Python code below where process_stories
is a user provided function to process this data.
import requests
start = 0
rows = 100
while True:
params = { 'last_processed_stories_id': start, 'rows': rows, 'q': 'tags_id_media:8875027', 'key': MY_KEY }
print "Fetching {} stories starting from {}".format( rows, start)
r = requests.get( 'https://api.mediacloud.org/api/v2/stories_public/list/', params = params, headers = { 'Accept': 'application/json'} )
stories = r.json()
if len(stories) == 0:
break
start = stories[ -1 ][ 'processed_stories_id' ]
process_stories( stories )
Currently, the best way to do this is to create a CSV file with all media sources as shown in the earlier example.
Once you have this CSV file, manually search for The New York Times. You should find an entry for The New York Times at the top of the file with media_id=1
.
We can obtain the desired stories by repeatedly querying api/v2/stories_public/list
using the q
parameter to restrict to media_id
to 1 and the fq
parameter to restrict by date range. We repeatedly change the last_processed_stories_id
parameter to obtain all stories.
This is shown in the Python code below where process_stories
is a user provided function to process this data.
import requests
start = 0
rows = 100
while True:
params = {
'last_processed_stories_id': start,
'rows': rows,
'q': 'media_id:1',
'fq': 'publish_date:[2010-10-01T00:00:00Z TO 2010-11-01T00:00:00Z]',
'key': MY_KEY
}
print "Fetching {} stories starting from {}".format( rows, start)
r = requests.get( 'https://api.mediacloud.org/api/v2/stories_public/list/', params = params, headers = { 'Accept': 'application/json'} )
stories = r.json()
if len(stories) == 0:
break
start = stories[ -1 ][ 'processed_stories_id' ]
process_stories( stories )
Get word counts for top words for sentences matching 'trayvon' in US Mainstream Media during April 2012
As above, find the tags_id of the US Mainstream Media collection (8875027).
One way to appropriately restrict the data is by setting the q
parameter to restrict by sentence content and then the fq
parameter twice to restrict by tags_id_media
and publish_date
.
Below q
is set to "text:trayvon"
and fq
is set to "tags_iud_media:8875027" and "publish_date:[2012-04-01T00:00:00.000Z TO 2013-05-01T00:00:00.000Z]"
. (Note that ":", "[", and "]" are URL encoded.)
curl 'https://api.mediacloud.org/api/v2/wc?q=text:trayvon&fq=tags_iud_media:8875027&fq=publish_date:%5B2012-04-01T00:00:00.000Z+TO+2013-05-01T00:00:00.000Z%5D'
Alternatively, we could use a single large query by setting q
to "text:trayvon AND tags_id_media:8875027 AND publish_date:[2012-04-01T00:00:00.000Z TO 2013-05-01T00:00:00.000Z]"
:
curl 'https://api.mediacloud.org/api/v2/wc?q=text:trayvon+AND+tags_id_media:8875027+AND+publish_date:%5B2012-04-01T00:00:00.000Z+TO+2013-05-01T00:00:00.000Z%5D&fq=tags_id_media:8875027&fq=publish_date:%5B2012-04-01T00:00:00.000Z+TO+2013-05-01T00:00:00.000Z%5D'
The user requests a list of all tag sets.
curl https://api.mediacloud.org/api/v2/tag_sets/list
[
{
"tag_sets_id": 597,
"name": "gv_country"
},
{
"tag_sets_id": 800,
"name": "ts"
}
]
(Additional tag sets skipped for brevity.)
Looking through the output, the user sees that the tag_sets_id
is 800.
The following Python function shows how to find a tags_id
given a tag_sets_id
def find_tags_id( tag_name, tag_sets_id):
last_tags_id = 0
rows = 100
while True:
params = { 'last_tags_id': last_tags_id, 'rows': rows, 'key': MY_KEY }
print "start:{} rows:{}".format( start, rows)
r = requests.get( 'https://api.mediacloud.org/api/v2/tags/list/' + tag_sets_id , params = params, headers = { 'Accept': 'application/json'} )
tags = r.json()
if len(tags) == 0:
break
for tag in tags:
if tag['tag'] == tag_name:
return tag['tags_id']
last_tags_id = max( tag[ 'tags_id' ], last_tags_id )
return -1
Assume that the user determined that the tags_id
was 12345678 using the above code. The following will return
the word count for all sentences in stories belonging to any media source associated with tag 12345678.
curl 'https://api.mediacloud.org/api/v2/wc?q=tags_id_media:12345678'
See the "Get Word Counts for Top Words for Sentences with the Tag 'odd'
in tag_set = 'ts'
" example above.
See the "Get Word Counts for Top Words for Sentences with the Tag 'odd'
in tag_set = 'ts'
" example above.
We assume the tags_id
is 678910.
import requests
start = 0
rows = 100
while True:
params = { 'last_processed_stories_id': start, 'rows': rows, 'q': 'tags_id_stories:678910', 'key': MY_KEY }
print "Fetching {} stories starting from {}".format( rows, start)
r = requests.get( 'https://api.mediacloud.org/api/v2/stories_public/list/', params = params, headers = { 'Accept': 'application/json'} )
stories = r.json()
if len(stories) == 0:
break
start = stories[ -1 ][ 'processed_stories_id' ]
process_stories( stories )