## How to run this notebook

1. Download the docker image: `docker pull jupyter/pyspark-notebook`

2. Make sure you have the input data containing the UMLS IDs for each titles obtained from running `get_ids_from_abs.py`: `umls_cui_in_titles.txt`
 
3. Start the PySpark jupyter notebook by running the docker and mount the volume of where the data
   - `docker run -it -p 8888:8888 -v /Users/slin/covid_nlp/title_result:/mnt/result jupyter/pyspark-notebook`

4. Go to `http://localhost:8888` in a browser. It'd ask for a token and a password. Token can be found in the console running the notebook. password can be anything.

5. Import this file into Docker container.

see more instruction here https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867

In [26]:
import os
import math
os.listdir('/mnt/result') 

['100k_200k',
 '1_100k',
 '200k_345k',
 'concept_counts',
 'concept_counts_filtered1',
 'concept_counts_filtered_abstracts',
 'concept_map',
 'count',
 'count_sorted',
 'umls_cui_in_abstracts.txt',
 'umls_cui_in_titles.txt']

In [2]:
import pandas as pd 
import numpy
import matplotlib.pyplot as plt 
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf

In [3]:
sc = SparkContext("local","Find number of occurrences of concepts")

### Step 1  - Get the UMLS counts in ALL titles

In [4]:
# filename "all" is the file name where each line is the umls present in a title. there are ~138k titles
words = sc.textFile("/mnt/result/umls_cui_in_titles.txt").flatMap(lambda line: line.split(","))

In [5]:
words.take(5)

['C3714514', 'C0948075', 'C2242472', 'C0009450', 'C0699744']

In [6]:
wordCounts = words.map(lambda word: (word, 1)).reduceByKey(lambda a,b:a +b)

In [7]:
wordCounts.take(1)

[('C3714514', 12513)]

In [8]:
wordCounts

PythonRDD[8] at RDD at PythonRDD.scala:53

In [9]:
# sort the result just to get an idea
counts_sorted = wordCounts.sortBy(lambda item: item[1], ascending=False)
# counts_sorted.saveAsTextFile("/mnt/result/count_sorted")

In [10]:
counts_sorted.take(10)

[('C0009450', 13686),
 ('C0042769', 13094),
 ('C3714514', 12513),
 ('C0206419', 12111),
 ('C0948075', 11369),
 ('C0010078', 11191),
 ('C0206423', 10145),
 ('C1550587', 9993),
 ('C1556682', 9782),
 ('C1175743', 7511)]

These are the top ten entries in the file
where C0009450 means "communicable diseases", and C0042769 means "virus disease". Make sense. 

### Step 2 - make a map of the concept name to concept ids

In [11]:
concept_maps = sc.textFile("/mnt/result/concept_map").map(lambda line: line.split(","))

In [12]:
concept_maps.take(1)

[['22274', 'C0027651', 'C2981607', 'C1882062', 'C1368871', 'C0026640']]

In this file, each concept id contains many related CUI (strings that start with "C"). Based on the counts of CUIs (wordCounts), we need to use that information to obtain the counts for each concept ids.
Since it's not a 1-to-1 relationship, and that some concept ids might contain CUI(s) that are in multiple concepts, the best data structure I can think of is map of CUI to list of concept ids. We'd use another map to keep count of the concepts. 

In [13]:
# convert from ['22274', 'C0027651', 'C2981607', 'C1882062', 'C1368871', 'C0026640'] to 
# RDD [('22274', 'C0027651'), ('22274', 'C2981607'), ...]

concept_maps = concept_maps.map(lambda line: (line[0], line[1:]))
    

In [14]:
concept_maps.take(1)

[('22274', ['C0027651', 'C2981607', 'C1882062', 'C1368871', 'C0026640'])]

In [25]:
concept_maps.count()

10448

In [15]:
def convert_to_tuple_list(input):
    concept_id = input[0]
    return [(cui, concept_id) for cui in input[1]]
ulms_concept_rdd = concept_maps.flatMap(lambda entry: convert_to_tuple_list(entry))

In [16]:
ulms_concept_rdd.take(10)

[('C0027651', '22274'),
 ('C2981607', '22274'),
 ('C1882062', '22274'),
 ('C1368871', '22274'),
 ('C0026640', '22274'),
 ('C0002895', '22281'),
 ('C2699300', '22281'),
 ('C1260595', '22281'),
 ('C0750151', '22281'),
 ('C3273373', '22281')]

## Inspect UMLS appearing in multiple concept names
For example, many concept names contain the word "infection" for different body parts and causes.


In [28]:
# First count the occurrence of each UMLS term in the concept names and make a new map
umls_count = ulms_concept_rdd.map(lambda x: (x[0], 1)).reduceByKey(lambda a,b:a +b).sortBy(lambda item: item[1], ascending=False)

In [29]:
umls_count.take(10) # some UMLS appear in more than ~600 concepts 

[('C0016658', 618),
 ('C1963113', 617),
 ('C1880851', 617),
 ('C0016662', 616),
 ('C1160964', 613),
 ('C1306459', 244),
 ('C0560267', 174),
 ('C0024620', 153),
 ('C3263723', 125),
 ('C0027651', 117)]

In [78]:
umls_count.count()

11054

In [80]:
# see the distribution of the counts if the counts were divided up by 20 regions
umls_count.map(lambda x: x[1]).histogram(20)

([1.0,
  31.85,
  62.7,
  93.55000000000001,
  124.4,
  155.25,
  186.10000000000002,
  216.95000000000002,
  247.8,
  278.65000000000003,
  309.5,
  340.35,
  371.20000000000005,
  402.05,
  432.90000000000003,
  463.75,
  494.6,
  525.45,
  556.3000000000001,
  587.15,
  618],
 [10958, 70, 14, 3, 2, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5])

# Use TF-IDF

Calculate log(N/nt), where N is the total number of concepts and nt is the number of concepts the UMLS ID appear in.

In [79]:
joined_result = ulms_concept_rdd.join(wordCounts)

In [71]:
joined_result.take(5) # these are the TF 

[('C2981607', ('22274', 6)),
 ('C2981607', ('24602', 6)),
 ('C2981607', ('27516', 6)),
 ('C2981607', ('27835', 6)),
 ('C2981607', ('30061', 6))]

In [90]:
# Calculate IDF 
n_concepts = concept_maps.count()
umls_idf = umls_count.map(lambda x: (x[0], math.log(n_concepts/x[1])))

In [91]:
umls_idf.take(5)

[('C0016658', 2.8276773940585214),
 ('C1963113', 2.8292968276108246),
 ('C1880851', 2.8292968276108246),
 ('C0016662', 2.8309188879826928),
 ('C1160964', 2.835800915580001)]

In [92]:
# create a dictionary for each lookup
idf_dict = dict(umls_idf.collect())

In [93]:
idf_dict['C0016658']

2.8276773940585214

In [98]:
# Mutiply tf by idf, also take log of tf
joined_result_scaled = joined_result.map(lambda x: ( x[1][0] , math.log(x[1][1]) * idf_dict[x[0]]))
joined_result_scaled.take(2)

[('22274', 8.538682574872077), ('24602', 8.538682574872077)]

In [101]:
# sum the counts of UMLS occurences in titles across all the concept IDs
joined_result_sum = joined_result_scaled.groupByKey().mapValues(sum)
joined_result_sum = joined_result_sum.sortBy(lambda item: item[1], ascending=False)
joined_result_sum.take(10)

[('19131544', 741.7130033963231),
 ('40233612', 707.289657870309),
 ('44783628', 563.7453826287107),
 ('19131481', 546.8678465408319),
 ('40173507', 507.69418833449095),
 ('4058695', 481.7558123104858),
 ('42800463', 471.9893986668932),
 ('4331309', 465.53202097814795),
 ('4098617', 438.70593407153245),
 ('37119138', 426.7260232871088)]

In [102]:
joined_result_sum.count()# this is how many potential features we can use, though we probably only use the top few ones.

9006

In [104]:
# get rid of zero counts
joined_result_sum = joined_result_sum.filter(lambda x: x[1] > 0)

In [105]:
# save to a file to be used later in our model trainnig 
joined_result_sum.saveAsTextFile("/mnt/result/concept_counts_title_tfidf")

# Repeat all for abstracts

In [108]:
words = sc.textFile("/mnt/result/umls_cui_in_abstracts.txt").flatMap(lambda line: line.split(","))
wordCounts = words.map(lambda word: (word, 1)).reduceByKey(lambda a,b:a +b)
joined_result = ulms_concept_rdd.join(wordCounts)
joined_result_scaled = joined_result.map(lambda x: ( x[1][0] , math.log(x[1][1]) * idf_dict[x[0]]))
joined_result_sum = joined_result_scaled.groupByKey().mapValues(sum)
joined_result_sum = joined_result_sum.sortBy(lambda item: item[1], ascending=False)
joined_result_sum.take(10)

[('19131544', 1121.4249107333112),
 ('40233612', 1107.4251094679921),
 ('19131481', 879.0779182056724),
 ('44783628', 820.0347352556407),
 ('4098740', 768.105817450619),
 ('37119138', 728.6940575475434),
 ('3018994', 726.5751710216754),
 ('3022250', 726.5751710216754),
 ('40173507', 700.4274869799882),
 ('42800463', 675.430511191428)]

In [109]:
joined_result_sum = joined_result_sum.filter(lambda x: x[1] > 0)
joined_result_sum.saveAsTextFile("/mnt/result/concept_counts_abstracts_tfidf")