## Introduction

This notebook finds Rogue Scholar blog posts about the Retraction Watch project using the [Rogue Scholar API](https://api.rogue-scholar.org/posts). [Retraction Watch](https://retractionwatch.com/) reports on retractions of scientific papers. the project was started in 2010 by Ivan Oransky and Adam Marcus.

:::{.callout-note}
* We use the query `retraction watch`.
* We limit results to posts published since `2010` (the year Retraction Watch launched) and `en` as language.
* We retrieve the `title`, `authors`, `publication date`, `abstract`, `blog name`, `blog_slug`, and `doi`
* We sort the results in reverse chronological order (newest first)
:::

## Results

We found 22 blog posts mentioning `retraction watch` out of 10560 total posts, and ended up with 12 posts after manual curation:

```{mermaid}
flowchart LR
  A[10560] -- Query: retraction watch --> B(22)
  B -- Manual curation --> C(12)
```

In [12]:
#| label: query-retraction-watch

import requests
import locale
import json
import pydash as py_
import re
import html
from typing import Optional
import datetime
from IPython.display import Markdown
locale.setlocale(locale.LC_ALL, "en_US")
baseUrl = "https://api.rogue-scholar.org/"
query = "retraction watch"
published_since = "2010"
feature_image = 0
curated = [1,2,3,9,12,16]

include_fields = "title,authors,published_at,summary,blog_name,blog_slug,doi,url,image"
url = baseUrl + f"posts?query={query.replace(' ', '+')}&published_since=2010&language=en&sort=published_at&order=desc&per_page=50&include_fields={include_fields}"
response = requests.get(url)
result = response.json()

def get_post(post):
    return post["document"]

def format_post(post):
    doi = post.get("doi", None)
    url = f"[{doi}]({doi})\n<br />" if doi else ""
    title = f"[{post['title']}]({doi})" if doi else f"[{post['title']}]({post['url']})"
    published_at = datetime.datetime.utcfromtimestamp(post["published_at"]).strftime("%B %-d, %Y")
    blog = f"[{post['blog_name']}](https://rogue-scholar.org/blogs/{post['blog_slug']})"
    author = ", ".join([ f"{x['name']}" for x in post.get("authors", None) or [] ])
    summary = post["summary"]
    return f"### {title}\n{url}Published {published_at} in {blog}<br />{author}<br /><br />{summary}\n"

posts = [ get_post(x) for i, x in enumerate(result["hits"]) if i not in curated]
posts_as_string = "\n".join([ format_post(x) for x in posts])

def doi_from_url(url: str) -> Optional[str]:
    """Return a DOI from a URL"""
    match = re.search(
        r"\A(?:(http|https)://(dx\.)?(doi\.org|handle\.stage\.datacite\.org|handle\.test\.datacite\.org)/)?(doi:)?(10\.\d{4,5}/.+)\Z",
        url,
    )
    if match is None:
        return None
    return match.group(5).lower()

# Get csl-formatted metadata for all posts that have a DOI
def get_csl(post):
    doi = doi_from_url(post["doi"])
    res = requests.get(baseUrl + "posts/" + doi + "?format=csl")
    csl = res.json()
    
    # remove keys not compatible with CSL spec
    csl = py_.omit(csl, ["license", "original-title"])
    csl["id"] = post["doi"]
    csl["title"] = html.unescape(post["title"])
    csl["container-title"] = post["blog_name"]
    csl["type"] = "article"
    return json.dumps(csl, indent=2)

csl_list = "[\n" + ",\n".join([ get_csl(x) for x in posts if x.get("doi", None) is not None ]) + "\n]"
with open('references.json', 'w') as f:
    f.write(csl_list)

images = [ x["image"] for x in posts if x.get("image", None) is not None ]
image = images[feature_image]
markdown = f"![]({image})\n\n"
markdown += posts_as_string
Markdown(markdown)

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

## References

::: {#refs}
:::