## PCOR Data
List of PCOR and PCORI PMIDs from Lauren Maggio. Original spreadsheets contained:
 - "PCORI" has the PMIDs for PCORI-funded research that has one (n=1390). 
 - "PCOR" has PMIDs for research that fits the search hedge for patient-centered outcomes research (n=11,991). 
After sorting and deduplicating these lists, they were reduced to 11,559 unique PMIDs for PCOR and 1,387 for PCORI.
These lists were saved in hive as `pcor_pmids` and `pcori_pmids`.

A third hive object `pmids_in_w` represents PMID links in W. It was populated on 2019-05-20 via the following SQL query:
```SELECT
  DISTINCT el_from AS page_id,
  REGEXP_REPLACE(el_to, '.+?(=|/)(\\d+).*', '\\2') AS pmid,
  now() AS dt
FROM
  externallinks
WHERE
  el_from IN (
    SELECT
      page_id
    FROM
      page
    WHERE
      page_namespace = 0
  )
  AND (
    lower(el_to) REGEXP '.*ncbi.*(=|/)pubmed'
    OR lower(el_to) REGEXP 'pubmed\.gov'
  )
  AND el_to REGEXP '(=|/)\\d{4,}'```

In [1]:
# basic setup
# use PySpark YARN kernel
import pyspark
import re
import pyspark.sql
from pyspark.sql import *
import pandas as pd
import matplotlib.pyplot as plt
import hashlib
import os.path
from pyspark.sql.functions import desc
from datetime import timedelta, date

%matplotlib inline
spark_hive = pyspark.sql.HiveContext(sc)

In [3]:
# count of distinct PCOR PMIDs found in W pages

pcor_query = """
select
  distinct pmid
from
  ryanmax.pmids_in_w
where
  pmids_in_w.pmid in (
    select
      distinct pmid
    from
      ryanmax.pcor_pmids
  )
"""

pcor_pmids = spark.sql(pcor_query)
print('# distinct PCOR PMIDs found in W: ',pcor_pmids.count())
pcor_pmids.show()

# distinct PCOR PMIDs found in W:  139
+--------+
|    pmid|
+--------+
|24825528|
|12150500|
| 1870651|
|21317246|
|11871850|
|24154835|
|29845606|
|24029581|
|23849146|
|16097911|
|23633698|
|16719549|
|24491310|
|15911845|
|10208075|
|21972305|
|12137631|
|21975733|
|25658275|
|29062538|
+--------+
only showing top 20 rows



In [4]:
# count of distinct page/PCOR PMIDs found in W pages

pcor_query = """
select
  distinct page_id, pmid
from
  ryanmax.pmids_in_w
where
  pmids_in_w.pmid in (
    select
      distinct pmid
    from
      ryanmax.pcor_pmids
  )
"""

pcor_pmids = spark.sql(pcor_query)
print('# distinct page/PCOR PMID found in W: ',pcor_pmids.count())
pcor_pmids.show()

# distinct page/PCOR PMID found in W:  161
+--------+--------+
| page_id|    pmid|
+--------+--------+
| 2601911|21937524|
|28858211|20961244|
|10360168|29512110|
|60367555|29096653|
|  579403|27601495|
|  198725|10490440|
| 1567332| 3830932|
|23453324|20542857|
|23453327|30025154|
| 1414111|11437001|
| 1395311|12813601|
|31617226|12084703|
|45169338|25378444|
| 2155752|26766577|
|   70547|23588749|
|    1805|26560888|
|27399297|29512110|
|48354345|22511692|
|  648828|17719803|
|11960085|21860452|
+--------+--------+
only showing top 20 rows



In [5]:
# count of distinct page/PCORI PMIDs found in W pages

pcori_query = """
select
  distinct page_id, pmid
from
  ryanmax.pmids_in_w
where
  pmids_in_w.pmid in (
    select
      distinct pmid
    from
      ryanmax.pcori_pmids
  )
"""

pcori_pmids = spark.sql(pcori_query)
print('# distinct page/PCORI PMID found in W: ',pcori_pmids.count())
pcori_pmids.show()

# distinct page/PCORI PMID found in W:  30
+--------+--------+
| page_id|    pmid|
+--------+--------+
|  159010|27623861|
|    1914|29260224|
|56103154|27779803|
|  351581|28154833|
| 1600967|27623861|
|  774446|29623949|
| 1165522|28191449|
|52716604|26020598|
|56103154|27875626|
|29697910|28540344|
|49421690|27925423|
|14673089|29987313|
| 1810614|28235242|
|  613640|29260224|
|  541592|29260224|
| 2245783|29985746|
|   56557|28600913|
|53976473|28245152|
|45314824|25200366|
|21741953|30518517|
+--------+--------+
only showing top 20 rows



In [6]:
# count of distinct PCORI PMIDs found in W pages

pcori_query = """
select
  distinct pmid
from
  ryanmax.pmids_in_w
where
  pmids_in_w.pmid in (
    select
      distinct pmid
    from
      ryanmax.pcori_pmids
  )
"""

pcori_pmids = spark.sql(pcori_query)
print('# distinct PCORI PMIDs found in W: ',pcori_pmids.count())
pcori_pmids.show()

# distinct PCORI PMIDs found in W:  24
+--------+
|    pmid|
+--------+
|30518517|
|28245152|
|28191449|
|28154833|
|27115262|
|26020598|
|30044773|
|28157742|
|29623949|
|29260224|
|28600913|
|27779803|
|28540344|
|25200366|
|28235242|
|27713905|
|27925423|
|29987313|
|27623861|
|25263997|
+--------+
only showing top 20 rows

