<a href="https://colab.research.google.com/github/artms-18/ML-Projects/blob/main/SkimLit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SkimLit

The purpose of this notebook is to build an NLP model to amke reading medical abstracts easier.

The paper we're replicating (the source of the dataset that we'll be using) is available here: 
https://arxiv.org/abs/1710.06071

Reading through the paper above, we see that the model architecture that they use to achieve their best results is available here: https://arxiv.org/abs/1612.05251


## Confirm access to a GPU

In [2]:
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-3d418ffd-aeaf-b61e-8a83-339f7851f29f)


## Get data

Since we'll be replicating the paper above (PubMed 200k RCT), let's download the dataset they used.

https://github.com/Franck-Dernoncourt/pubmed-rct

In [13]:
!git clone https://github.com/Franck-Dernoncourt/pubmed-rct
!ls pubmed-rct

fatal: destination path 'pubmed-rct' already exists and is not an empty directory.
PubMed_200k_RCT
PubMed_200k_RCT_numbers_replaced_with_at_sign
PubMed_20k_RCT
PubMed_20k_RCT_numbers_replaced_with_at_sign
README.md


In [14]:
# Check what files are in the PubMed_20K dataset

!ls pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/

dev.txt  test.txt  train.txt


In [19]:
# Start our experiments using the 20k dataset
data_dir = "/content/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/"

In [20]:
import os
filenames = [data_dir + filename for filename in os.listdir(data_dir)]
filenames

['/content/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/test.txt',
 '/content/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/dev.txt',
 '/content/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/train.txt']

## Preprocess data

Now we've got some text data, it's time to become one with it.

And one of the best ways to become one with the data is to ...

> Visualize, visualize, visualize

Let's write a function to read in all of the lines with python

In [21]:
# Create a function to read the lines of a document

def get_lines(filename):
  """
  Reads filename (a text filename) and returns the lines of text as a list.

  Args:
    filename: a string containing the target filepath

  Returns:
    A list of strings woth one string per line from the taget filenmae.
  """

  with open(filename, "r") as f:
    return f.readlines()

In [47]:
# Let's read in the training lines

train_lines = get_lines(data_dir+"train.txt")
train_lines[:20]

['###24293578\n',
 'OBJECTIVE\tTo investigate the efficacy of @ weeks of daily low-dose oral prednisolone in improving pain , mobility , and systemic low-grade inflammation in the short term and whether the effect would be sustained at @ weeks in older adults with moderate to severe knee osteoarthritis ( OA ) .\n',
 'METHODS\tA total of @ patients with primary knee OA were randomized @:@ ; @ received @ mg/day of prednisolone and @ received placebo for @ weeks .\n',
 'METHODS\tOutcome measures included pain reduction and improvement in function scores and systemic inflammation markers .\n',
 'METHODS\tPain was assessed using the visual analog pain scale ( @-@ mm ) .\n',
 'METHODS\tSecondary outcome measures included the Western Ontario and McMaster Universities Osteoarthritis Index scores , patient global assessment ( PGA ) of the severity of knee OA , and @-min walk distance ( @MWD ) .\n',
 'METHODS\tSerum levels of interleukin @ ( IL-@ ) , IL-@ , tumor necrosis factor ( TNF ) - , and 

In [48]:
len(train_lines)

210040

In [49]:
test_lines = get_lines(data_dir + "test.txt")
dev_lines = get_lines(data_dir + "dev.txt")

In [50]:
sample_line = train_lines[1]
sample_line

'OBJECTIVE\tTo investigate the efficacy of @ weeks of daily low-dose oral prednisolone in improving pain , mobility , and systemic low-grade inflammation in the short term and whether the effect would be sustained at @ weeks in older adults with moderate to severe knee osteoarthritis ( OA ) .\n'

In [51]:
import re

temp_list = re.split(r'\t', sample_line)
len(temp_list[1].split())



49

Let's think about how we want our data to look...

How I think our data would be best represented...

```
[{'line_number': 0,
  'target': 'BACKGROUND',
  'text': 'Emotional eating is associated with overeating and the development of obesity .\n'
  'total_lines: 11},
  ]
```


In [59]:
def preprocess(lines):

  '''
  Takes a list of lines and reformats them as shown above.

  Args:
    Lines: a list containing the target value and text

  Returns:
    A structured dictionary

  '''

  formatted_list = []

  for i, line in enumerate(lines):

    temp = re.split(r'\t', line)
    target = temp[0]

    try: 
      text = temp[1]
    except IndexError:
      text = 'None'
      
    length = len(text.split())

    small_dict = {'line_number': i,
                  'target': target,
                  'text': text,
                  'total_lines': length}

    formatted_list.append(small_dict)

  return formatted_list


In [60]:
train_lines = train_lines[1:]
train_lines[:20]

preprocessed_train = preprocess(train_lines)

In [61]:
preprocessed_train[:10]

[{'line_number': 0,
  'target': 'METHODS',
  'text': 'Serum levels of interleukin @ ( IL-@ ) , IL-@ , tumor necrosis factor ( TNF ) - , and high-sensitivity C-reactive protein ( hsCRP ) were measured .\n',
  'total_lines': 29},
 {'line_number': 1,
  'target': 'RESULTS',
  'text': 'There was a clinically relevant reduction in the intervention group compared to the placebo group for knee pain , physical function , PGA , and @MWD at @ weeks .\n',
  'total_lines': 30},
 {'line_number': 2,
  'target': 'RESULTS',
  'text': 'The mean difference between treatment arms ( @ % CI ) was @ ( @-@ @ ) , p < @ ; @ ( @-@ @ ) , p < @ ; @ ( @-@ @ ) , p < @ ; and @ ( @-@ @ ) , p < @ , respectively .\n',
  'total_lines': 55},
 {'line_number': 3,
  'target': 'RESULTS',
  'text': 'Further , there was a clinically relevant reduction in the serum levels of IL-@ , IL-@ , TNF - , and hsCRP at @ weeks in the intervention group when compared to the placebo group .\n',
  'total_lines': 36},
 {'line_number': 4,
  't