## How to run this notebook

1. Download the docker image: `docker pull jupyter/pyspark-notebook`

2. Make sure you have the input data containing the UMLS IDs for each titles obtained from running `get_ids_from_abs.py`: `umls_cui_in_titles.txt`
 
3. Start the PySpark jupyter notebook by running the docker and mount the volume of where the data
   - `docker run -it -p 8888:8888 -v /Users/slin/covid_nlp/title_result:/mnt/result jupyter/pyspark-notebook`

4. Go to `http://localhost:8888` in a browser. It'd ask for a token and a password. Token can be found in the console running the notebook. password can be anything.

5. Import this file into Docker container.

see more instruction here https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867

In [1]:
import os
os.listdir('/mnt/result') 

['100k_200k',
 '1_100k',
 '200k_345k',
 'concept_counts',
 'concept_map',
 'count',
 'count_sorted',
 'umls_cui_in_titles.txt']

In [2]:
import pandas as pd 
import numpy
import matplotlib.pyplot as plt 
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf

In [3]:
sc = SparkContext("local","Find number of occurrences of concepts")

### Step 1  - Get the UMLS counts in ALL titles

In [4]:
# filename "all" is the file name where each line is the umls present in a title. there are ~138k titles
words = sc.textFile("/mnt/result/umls_cui_in_titles.txt").flatMap(lambda line: line.split(","))

In [5]:
words.take(5)

['C3714514', 'C0948075', 'C2242472', 'C0009450', 'C0699744']

In [6]:
wordCounts = words.map(lambda word: (word, 1)).reduceByKey(lambda a,b:a +b)

In [7]:
wordCounts.take(1)

[('C3714514', 12513)]

In [153]:
wordCounts

PythonRDD[284] at RDD at PythonRDD.scala:53

In [8]:
# sort the result just to get an idea
counts_sorted = wordCounts.sortBy(lambda item: item[1], ascending=False)
# counts_sorted.saveAsTextFile("/mnt/result/count_sorted")

In [9]:
counts_sorted.take(10)

[('C0009450', 13686),
 ('C0042769', 13094),
 ('C3714514', 12513),
 ('C0206419', 12111),
 ('C0948075', 11369),
 ('C0010078', 11191),
 ('C0206423', 10145),
 ('C1550587', 9993),
 ('C1556682', 9782),
 ('C1175743', 7511)]

These are the top ten entries in the file
where C0009450 means "communicable diseases", and C0042769 means "virus disease". Make sense. 

### Step 2 - make a map of the concept name to concept ids

In [53]:
concept_maps = sc.textFile("/mnt/result/concept_map").map(lambda line: line.split(","))

In [54]:
concept_maps.take(1)

[['22274', 'C0027651', 'C2981607', 'C1882062', 'C1368871', 'C0026640']]

In this file, each concept id contains many related CUI (strings that start with "C"). Based on the counts of CUIs (wordCounts), we need to use that information to obtain the counts for each concept ids.
Since it's not a 1-to-1 relationship, and that some concept ids might contain CUI(s) that are in multiple concepts, the best data structure I can think of is map of CUI to list of concept ids. We'd use another map to keep count of the concepts. 

In [55]:
# convert from ['22274', 'C0027651', 'C2981607', 'C1882062', 'C1368871', 'C0026640'] to 
# RDD [('22274', 'C0027651'), ('22274', 'C2981607'), ...]

concept_maps = concept_maps.map(lambda line: (line[0], line[1:]))
    

In [56]:
concept_maps.take(1)

[('22274', ['C0027651', 'C2981607', 'C1882062', 'C1368871', 'C0026640'])]

In [71]:
def convert_to_tuple_list(input):
    concept_id = input[0]
    return [(cui, concept_id) for cui in input[1]]
ulms_concept_rdd = concept_maps.flatMap(lambda entry: convert_to_tuple_list(entry))

In [72]:
ulms_concept_rdd.take(10)

[('C0027651', '22274'),
 ('C2981607', '22274'),
 ('C1882062', '22274'),
 ('C1368871', '22274'),
 ('C0026640', '22274'),
 ('C0002895', '22281'),
 ('C2699300', '22281'),
 ('C1260595', '22281'),
 ('C0750151', '22281'),
 ('C3273373', '22281')]

## Filter out UMLS that appear in too many concept names
They are umls unspecific to each concept. So we'll filter those out. For example, many concept names contain the word "infection" for different body parts and causes.


In [76]:
# First count the occurrence of each UMLS term in the concept names and make a new map
umls_count = ulms_concept_rdd.map(lambda x: (x[0], 1)).reduceByKey(lambda a,b:a +b).sortBy(lambda item: item[1], ascending=False)

In [77]:
umls_count.take(10)

[('C0016658', 618),
 ('C1963113', 617),
 ('C1880851', 617),
 ('C0016662', 616),
 ('C1160964', 613),
 ('C1306459', 244),
 ('C0560267', 174),
 ('C0024620', 153),
 ('C3263723', 125),
 ('C0027651', 117)]

In [78]:
umls_count.count()

11054

In [80]:
# see the distribution of the counts if the counts were divided up by 20 regions
umls_count.map(lambda x: x[1]).histogram(20)

([1.0,
  31.85,
  62.7,
  93.55000000000001,
  124.4,
  155.25,
  186.10000000000002,
  216.95000000000002,
  247.8,
  278.65000000000003,
  309.5,
  340.35,
  371.20000000000005,
  402.05,
  432.90000000000003,
  463.75,
  494.6,
  525.45,
  556.3000000000001,
  587.15,
  618],
 [10958, 70, 14, 3, 2, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5])

In [81]:
# Make a map of those we need to filter out. Let's be aggressive and filter out those appaer in more than 1 concept
umls_count_filter = umls_count.filter(lambda x: x[1] > 1)
umls_count_filter.count()

5251

In [82]:
umls_count_filter_set = set(concept_count_filter.map(lambda x: x[0]).collect())

In [84]:
# Filter out those that are in >1 concept names
ulms_concept_rdd_filtered = ulms_concept_rdd.filter(lambda x: x[0] not in concept_count_filter_set)
ulms_concept_rdd_filtered.take(2)

[('C2699300', '22281'), ('C1260595', '22281')]

In [85]:
# join UMLS count from titles with this filtered result, key being the UMLS terms
joined_result = ulms_concept_rdd_filtered.join(wordCounts)
joined_result.take(2) # each item: (UMLS ID, (concept ID, count of UMLS ID from all titles))

[('C2699300', ('22281', 6)), ('C1260595', ('22281', 6))]

In [87]:
# sum the counts of UMLS occurences in titles across all the concept IDs
joined_result = joined_result.map(lambda x: x[1])
joined_result = joined_result.groupByKey().mapValues(sum)
joined_result = joined_result.sortBy(lambda item: item[1], ascending=False)

In [88]:
joined_result.take(10)

[('1792515', 3905),
 ('3034780', 2907),
 ('440022', 2281),
 ('44507566', 1880),
 ('2617205', 1364),
 ('198677', 1339),
 ('2514534', 1189),
 ('433131', 1188),
 ('432436', 1003),
 ('4275257', 927)]

In [90]:
joined_result.count()# this is how many potential features we can use, though we probably only use the top few ones.

1470

In [91]:
ulms_concept_rdd.count() # this is how many features before we do filter

39717

In [89]:
# save to a file to be used later in our model trainnig 
joined_result.saveAsTextFile("/mnt/result/concept_counts_filtered1")