# Data Preprocessing

The first thing we need to do is preprocess our data. Let's break this down into two smaller steps:

1. **Download** the data into a **text file**.
2. **Process** the text file data into a list of **weighted surface co-occurences**.

Below, we have implemented these steps.

## 1. Downloading the Data
### E-Utilities and Biopython 

To download the data we need, we will make use of E-Utilities (NCBI Entrez Programming Utilities), a set of tools designed to facilitate the process of downloading large sets of bioinformatics data.

A general introduction to the E-Utilities:
- https://www.ncbi.nlm.nih.gov/books/NBK25497/
- *'A set of nine server-side programs that provide a stable interface into the Entrez query and database system at the NCBI'*.
- Uses a fixed URL syntax that translates a standard set of input parameters into the values necessary for various NCBI software components to search for and retrieve the requested data.
- The E-utilities are therefore the structured interface to the Entrez system, which currently includes 38 databases covering a variety of biomedical data, including nucleotide and protein sequences, gene records, three-dimensional molecular structures, and the biomedical literature.
- To access data, a piece of software posts an E-utility URL to NCBI, then retrieves the results of this and processes the data.
- It can use any computer languages that can send a URL to the E-utilities server and interpret the XML response (i.e. Python, Perl, Java, C++).
- NCBI requests that users limit requests to no more than 3 per second.

From this, I have gleaned that I can use a combination of **ESearch** and **EFetch** to find and retrieve the data I want.

In order to make use of the tools, a program must post an 'E-Utility URL' to NCBI. **BioPython** is a library that provides a tool called Entrez to send these URLs using Python. 

Below, it is used to fetch all articles related to the term 'biology'. It then saves the data to one text file, 'bio_corpus.txt'.

In [3]:
from Bio import Entrez
import xml.etree.ElementTree as ET

Entrez.email = 'aidanlowrie@example.com'
search_word = 'biology'

search_handle = Entrez.esearch(db='pubmed', term=search_word, retmax=100000)
record = Entrez.read(search_handle)
search_handle.close()

uids = record['IdList']

fetch_handle = Entrez.efetch(db="pubmed", id=','.join(uids), 
                       rettype="abstract", retmode="xml") # Return data in XML form.

data = fetch_handle.read()
fetch_handle.close()

# Now parse the XML
root = ET.fromstring(data)

# The path to the abstract text will depend on the structure of the returned XML, 
# but it will be something like this:
abstracts = root.findall(".//AbstractText")

# Write just the abstract text to the file
with open("data/bio_corpus.txt", "w") as file:
    for abstract in abstracts:
        file.write(str(abstract.text) + "\n\n")  # Add two newlines for separation


According to the [PubMed help document](https://pubmed.ncbi.nlm.nih.gov/help/#understanding-docsum), searches of its database return results sorted by a Best Match algorithm. This algorithm puts a weight on each result based on its relevance to the search query, and orders results according to this weight. Recently-published and highly-cited articles are given a higher weight by this algorithm. More details can be found [here](https://pubmed.ncbi.nlm.nih.gov/help/#understanding-docsum).

What this means for the data is that it will reflect the **current state of biology research in April 2023**.

Originally the goal was to retrieve the full text for each article, but this is not allowed by the database. Therefore, abstracts have been downloaded instead. The resulting file's size is around 27mb - a similar size to the reuters corpus.


## 2. Preprocessing the Text File
### The CorpusProcessor

In assignment 2, I created a ```CorpusProcessor``` class. An object of this class takes in any corpus of text and processes it by running it through a pipeline of (customisable) steps, outlined below:
1. Breaking the corpus into sentences.
2. Tagging and lemmatizing the corpus.
3. Removing words with fewer than three alphanumeric characters.
4. Removing stopwords.
5. Removing infrequent words.
6. Finding surface co_occurrences.
7. Removing the least frequent co_occurrences.

With the data processed into a text file, it can now be passed into a CorpusProcessor object, which will automatically process the text to our specifications. 

To pass it into the object, the data must first be processed into a class with a ```sents()``` function.

In [1]:
# Importing nltk resources.
import nltk
from nltk.corpus.reader import PlaintextCorpusReader
from nltk.tokenize import sent_tokenize, word_tokenize

# Making a CorpusReader class that I can pass into my CorpusProcessor
class CorpusReader:
    def __init__(self, file_path):
        self.file_path = file_path
        self.raw_text = ''
        self.sentences = self.read_to_sentences()
        
    def sents(self):
        return [sentence for sentence in self.sentences]
    
    # This function opens a file and tokenizes its contents, returning a series of sentences.
    def read_to_sentences(self):
        with open(self.file_path, 'r') as file:
            self.raw_text = file.read()
        tokenized_sentences = []
        sentences = sent_tokenize(self.raw_text)
        for sentence in sentences:
            tokenized_sentences.append(word_tokenize(sentence))
        return tokenized_sentences

corpus = CorpusReader('data/bio_corpus.txt')

ModuleNotFoundError: No module named 'nltk'

Now, the object can be passed in to the CorpusProcessor.

In [5]:
# Importing the CorpusProcessor class.
from surface_cooccurrences import CorpusProcessor

# Passing the CorpusReader object into the CorpusProcessor.
processed_corpus = CorpusProcessor(corpus, 
                                   remove_most_frequent=20, # Remove 20 most frequent words.
                                   frequency_threshold=25, # Remove words that appear under 25 times throughout the corpus. 
                                   sc_frequency_threshold=10) # Remove surface co-occurrence word pairs that appear together under 10 times.

From this, a list of **noun surface co-occurrences** is generated. This is what we will be focusing on.

### Weighted Surface Co-occurrences 

With our list of surface-cooccurrences, smoothed_ppmi should be carried out in order to produce weighted surface co-occurrences. This has been implemented below. The processed surface-cooccurrences is then saved to a csv file. 

In [15]:
import math
from collections import Counter
import csv

# Carry out sppmi function. 
def smoothed_ppmi(o_11, r_1, c_1, n, alpha=0.75):
    result = ppmi(o_11, r_1, c_1, n, alpha = 0.75)
    return result

# Carry out ppmi function. 
def ppmi(o_11, r_1, c_1, n, alpha=0):
    if alpha > 0:
        c_1 = c_1 ** alpha    
    observed = o_11
    expected = (r_1*c_1)/n 
    result = math.log(observed/max(0.001, expected), 2)
    return max(0, result)

# Function takes a CorpusProcessor object and carries out a function to produce weighted surface co-occurrences. 
def weighted_surface_cooccurrences(corpus_processor_object, measure_function):
    adjusted_surface_frequencies = Counter()
    for key, value in corpus_processor_object.surface_cooccurrences.items():
        o_11 = value
        r_1 = corpus_processor_object.filtered_lemma_frequencies[key[0]]
        c_1 = corpus_processor_object.filtered_lemma_frequencies[key[1]]
        n = sum(corpus_processor_object.surface_cooccurrences.values())
        adjusted_surface_frequencies[key] = measure_function(o_11, r_1, c_1, n)
    return adjusted_surface_frequencies

# Carrying out sppmi on our weighted surface cooccurrences data in order to produce weighted cooccurrences.
sc_sppmi = weighted_surface_cooccurrences(processed_corpus, smoothed_ppmi)
processed_corpus = sc_sppmi
pair_frequencies = [(key, value) for key, value in processed_corpus.items()]
sorted_pairs = sorted(pair_frequencies, key=lambda x: x[1], reverse=True)

# Saving weighted surface cooccurrence data to a csv.
with open('data/bio_surface_cooccurrences.csv', 'w', newline='') as csvfile:
    csv_writer = csv.writer(csvfile)
    csv_writer.writerow(['Word1', 'Word2', 'Frequency'])
    for pair in sorted_pairs:
        csv_writer.writerow([pair[0][0].split('-')[0], pair[0][1].split('-')[0], pair[1]])

## Saving processed corpus to a text file.

In our next section, key-word extraction, we will need access to the entire processed corpus. This will be used to get the semantic data necessary for keyword extraction.

In [None]:
with open('processed_corpus.txt', w) as file:
    for sentence in processed_corpus.corpus:
        file.write(sentence)


# New Extra Stuff For Relationship Extraction

I haven't had time to get this into the notebook in a neat way, but I thought you guys might appreciate having access to all my code.

In [None]:
# Importing nltk resources.
import nltk
from nltk.corpus.reader import PlaintextCorpusReader
from nltk.tokenize import sent_tokenize, word_tokenize
from transformers import pipeline
import csv

# Opening corpus.
with open("data/genetics_corpus.txt", "r") as file:
    corpus_text = file.read()

corpus_sentences = sent_tokenize(corpus_text)

def extract_triplets(text):
    triplets = []
    relation, subject, object_ = '', '', ''
    text = text.strip()
    current = 'x'
    for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
        if token == "<triplet>":
            current = 't'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
                relation = ''
            subject = ''
        elif token == "<subj>":
            current = 's'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
            object_ = ''
        elif token == "<obj>":
            current = 'o'
            relation = ''
        else:
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token
    if subject != '' and relation != '' and object_ != '':
        triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
    return triplets

triplet_extractor = pipeline('text2text-generation', model='Babelscape/rebel-large', tokenizer='Babelscape/rebel-large')

with open("triplets.csv", "w") as file:
    field_names = ['head', 'type', 'tail']
    writer = csv.DictWriter(file, fieldnames=field_names)    
    for sentence in corpus_sentences:
        extracted_text = triplet_extractor.tokenizer.batch_decode([triplet_extractor(sentence, return_tensors=True, return_text=False)[0]["generated_token_ids"]])
        triplets = extract_triplets(extracted_text[0])
        for triplet in triplets:
            writer.writerow(triplet)
        print([(triplet['head'], triplet['type'], triplet['tail']) for triplet in triplets])
    