## Import data into pySpark
- https://github.com/titipata/pubmed_parser/wiki/Download-and-preprocess-MEDLINE-dataset

- MEDLINE BULK DOWNLOAD -
wget ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/*.gz

- MEDLINE UPDATES -
wget ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/*.gz
- Change JAVA version: https://kodejava.org/how-do-i-set-the-default-java-jdk-version-on-mac-os-x/

In [17]:
import os
import pandas as pd
from glob import glob
import pubmed_parser as pp
import findspark
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
from pyspark.sql import Row
from pyspark.sql import SQLContext
findspark.init()

In [2]:
# Create spark session and variables
sc = SparkContext("local", "medline db")
conf = SparkConf()
spark = SparkSession.builder.\
    config(conf=conf)\
#     .getOrCreate()

sqlContext = SQLContext(sc)

## Import medline files into a spark dataframe
- https://github.com/titipata/pubmed_parser/wiki/Download-and-preprocess-MEDLINE-dataset

In [3]:
medline_files_rdd = sc.parallelize(glob('../data/pubmed/*.gz'))

medline_files_rdd

ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:195

In [4]:
parse_results_rdd = medline_files_rdd.\
    flatMap(lambda x: [Row(file_name=os.path.basename(x), **publication_dict) 
                       for publication_dict in pp.parse_medline_xml(x)])

parse_results_rdd

PythonRDD[1] at RDD at PythonRDD.scala:53

In [5]:
medline_df = parse_results_rdd.toDF()

In [6]:
# Process upates and deletes
from pyspark.sql import Window
from pyspark.sql.functions import rank, max, sum, desc

In [7]:
window = Window.partitionBy(['pmid']).orderBy(desc('file_name'))

In [8]:
windowed_df = medline_df.select(
    max('delete').over(window).alias('is_deleted'),
    rank().over(window).alias('pos'),
    '*')

In [9]:
medline_lastview = windowed_df.where('is_deleted = False and pos = 1')

In [10]:
type(medline_lastview)

pyspark.sql.dataframe.DataFrame

## Explore db
- https://medium.com/@aieeshashafique/exploratory-data-analysis-using-pyspark-dataframe-in-python-bd55c02a2852

In [11]:
medline_lastview.printSchema()

root
 |-- is_deleted: boolean (nullable = true)
 |-- pos: integer (nullable = true)
 |-- abstract: string (nullable = true)
 |-- affiliations: string (nullable = true)
 |-- authors: string (nullable = true)
 |-- chemical_list: string (nullable = true)
 |-- country: string (nullable = true)
 |-- delete: boolean (nullable = true)
 |-- doi: string (nullable = true)
 |-- file_name: string (nullable = true)
 |-- issn_linking: string (nullable = true)
 |-- journal: string (nullable = true)
 |-- keywords: string (nullable = true)
 |-- medline_ta: string (nullable = true)
 |-- mesh_terms: string (nullable = true)
 |-- nlm_unique_id: string (nullable = true)
 |-- other_id: string (nullable = true)
 |-- pmc: string (nullable = true)
 |-- pmid: string (nullable = true)
 |-- pubdate: string (nullable = true)
 |-- publication_types: string (nullable = true)
 |-- references: string (nullable = true)
 |-- title: string (nullable = true)



In [12]:
# Select columns
columns = ['abstract', 'file_name', 'journal', 'keywords', 'mesh_terms', 'pubdate'] 
df = medline_lastview[columns]


In [14]:
df.printSchema()

root
 |-- abstract: string (nullable = true)
 |-- file_name: string (nullable = true)
 |-- journal: string (nullable = true)
 |-- keywords: string (nullable = true)
 |-- mesh_terms: string (nullable = true)
 |-- pubdate: string (nullable = true)



In [16]:
# df.collect()

In [7]:
# df = df.toPandas()

## Extracting medline citations

In [3]:
import medic

In [5]:
wget "$URL?db=PubMed&retmax=99&term=$QUERY" -O - 2> /dev/null \
| grep "^<Id>" \
| sed -E 's|</?Id>||g' \
| cut -f3 \
> pmids.txt

SyntaxError: invalid syntax (<ipython-input-5-7448bcdbfb02>, line 1)

## Calculate TF-IDF
- https://github.com/titipata/pubmed_parser/wiki/Download-and-preprocess-MEDLINE-dataset

## Classify with Fastest
- https://www.futurice.com/blog/classifying-text-with-fasttext-in-pyspark
- https://fasttext.cc/docs/en/supervised-tutorial.html

# Topic modeling
- https://nbviewer.jupyter.org/github/chambliss/Notebooks/blob/master/Word2Vec_News_Analysis.ipynb