# Data Preprocessing

# Knowledge Graph Representation ?????????

## E-Utilities

E-Utilities (NCBI Entrez Programming Utilities) is a set of tools designed to facilitate the process of downloading large sets of bioinformatics data.

A general introduction to the E-Utilities:
- https://www.ncbi.nlm.nih.gov/books/NBK25497/
- *'A set of nine server-side programs that provide a stable interface into the Entrez query and database system at the NCBI'*.
- Uses a fixed URL syntax that translates a standard set of input parameters into the values necessary for various NCBI software components to search for and retrieve the requested data.
- The E-utilities are therefore the structured interface to the Entrez system, which currently includes 38 databases covering a variety of biomedical data, including nucleotide and protein sequences, gene records, three-dimensional molecular structures, and the biomedical literature.
- To access data, a piece of software posts an E-utility URL to NCBI, then retrieves the results of this and processes the data.
- It can use any computer languages that can send a URL to the E-utilities server and interpret the XML response (i.e. Python, Perl, Java, C++).
- NCBI requests that users limit requests to no more than 3 per second.

From this, I have gleaned that I can use a combination of **ESearch** and **EFetch** to find and retrieve the data I want. 

## BioPython

I previously explained that E-Utilities is a series of programs that can be used to interact with NCBI's database system. In order to make use of the tools, software must post an 'E-Utility URL' to NCBI. 

BioPython, among providing many other functions, provides a tool called Entrez which can be used to send these URLs using Python. 

Below, I have used it to fetch all articles that contain the term 'genetics' and save the data to one text file, called 'genetics_corpus.txt'.

In [2]:
# Importing the Entrez module from Biopython.
from Bio import Entrez

In [3]:
# NCBI requires that you set your email address when using E-Utilities.
Entrez.email = 'aidanlowrie@example.com'

In [6]:
# We're searching for any pubmed articles related to 'genetics'. 
search_word = 'genetics'

# This line uses ESearch to carry out the search.
search_handle = Entrez.esearch(db='pubmed', term=search_word, retmax=100000) # Returning max 100000 articles

# Then this line records the output of the search to a dictionary.
record = Entrez.read(search_handle)

search_handle.close()

URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:992)>

I was interested to know how the retmax parameter works; are the returned articles randomly selected, or does it return the top most-cited search results for the given search word?

According to the [PubMed help document](https://pubmed.ncbi.nlm.nih.gov/help/#understanding-docsum), searches of its database return results sorted by a Best Match algorithm. This algorithm puts a weight on each result based on its relevance to the search query, and orders results according to this weight. Recently-published and highly-cited articles are given a higher weight by this algorithm. More details can be found [here](https://pubmed.ncbi.nlm.nih.gov/help/#understanding-docsum).

What this means for my data is that it will be a reflection of the current state of genetics in April 2023. Further analysis would have to be carried out to learn about how it has changed over time. This would be an interesting place to take the research next.

Next, I need to convert the dictionary into a list of Unique Identifiers. This list of **UIDs** is required for the **EFetch**.

In [20]:
# The returned UIDs are stored to the search record's IDList key. 
uids = record['IdList']

In [21]:
# The EFetch attempts to retrieve the abstract from each UID.
fetch_handle = Entrez.efetch(db="pubmed", id=','.join(uids), 
                       rettype="abstract", retmode="text")

data = fetch_handle.read()
fetch_handle.close()

*Originally I wanted to retrieve the full text for each article, but apparently this is not allowed by the database. To do so manually would require far too much work  (and might be infringing on intellectual property / copyright laws), so I'm using the abstracts (which are never behind a paywall) instead.*

In [22]:
# Saving the data to a file.
with open("data/genetics_corpus.txt", "w") as file:
    file.write(data)

With that, the data has been saved onto a text file. I think that the size of the file is appropriate - around 27mb, a similar size to the reuters corpus.

## Preprocessing

First, I need to process my data into a class with a ```sents()``` function.

In [2]:
# Importing nltk resources.
import nltk
from nltk.corpus.reader import PlaintextCorpusReader
from nltk.tokenize import sent_tokenize, word_tokenize

In [3]:
# Making a CorpusReader class that I can pass into my CorpusProcessor
class CorpusReader:
    def __init__(self, file_path):
        self.file_path = file_path
        self.raw_text = ''
        self.sentences = self.read_to_sentences()
        
    def sents(self):
        return [sentence for sentence in self.sentences]
    
    # This function opens a file and tokenizes its contents, returning a series of sentences.
    def read_to_sentences(self):
        with open(self.file_path, 'r') as file:
            self.raw_text = file.read()
        tokenized_sentences = []
        sentences = sent_tokenize(self.raw_text)
        for sentence in sentences:
            tokenized_sentences.append(word_tokenize(sentence))
        return tokenized_sentences

In [4]:
corpus = CorpusReader('data/genetics_corpus.txt')

## The CorpusProcessor

In assignment 2, I created a ```CorpusProcessor``` class. An object of this class takes in any corpus of text and processes it by running it through a pipeline of (customisable) steps, outlined below:
1. Breaking the corpus into sentences.
2. Tagging and lemmatizing the corpus.
3. Removing words with fewer than three alphanumeric characters.
4. Removing stopwords.
5. Removing infrequent words.
6. Finding surface co_occurrences.
7. Removing the least frequent co_occurrences.

The previous steps were necessary to get my data into a format that could be processed by a CorpusProcessor. Now, the rest of the preprocessing is just a matter of passing the data into the object.

In [5]:
# Importing the CorpusProcessor class.
from surface_cooccurrences import CorpusProcessor

In [6]:
processed_corpus = CorpusProcessor(corpus, 
                                   remove_most_frequent=20,
                                   frequency_threshold=25, 
                                   sc_frequency_threshold=10)

In [15]:
import math

def smoothed_ppmi(o_11, r_1, c_1, n, alpha=0.75):
    result = ppmi(o_11, r_1, c_1, n, alpha = 0.75)
    return result

def ppmi(o_11, r_1, c_1, n, alpha=0):
    if alpha > 0:
        c_1 = c_1 ** alpha    
    observed = o_11
    expected = (r_1*c_1)/n 
    result = math.log(observed/max(0.001, expected), 2)
    return max(0, result)

def weighted_surface_cooccurrences(corpus_processor_object, measure_function):
    adjusted_surface_frequencies = Counter()
    for key, value in corpus_processor_object.surface_cooccurrences.items():
        o_11 = value
        r_1 = corpus_processor_object.filtered_lemma_frequencies[key[0]]
        c_1 = corpus_processor_object.filtered_lemma_frequencies[key[1]]
        n = sum(corpus_processor_object.surface_cooccurrences.values())
        adjusted_surface_frequencies[key] = measure_function(o_11, r_1, c_1, n)
    return adjusted_surface_frequencies

In [24]:
from collections import Counter
sc_sppmi = weighted_surface_cooccurrences(processed_corpus, smoothed_ppmi)

In [25]:
processed_corpus = sc_sppmi

In [27]:
pair_frequencies = [(key, value) for key, value in processed_corpus.items()]
sorted_pairs = sorted(pair_frequencies, key=lambda x: x[1], reverse=True)

In [28]:
import csv

with open('data/genetics_surface_cooccurrences.csv', 'w', newline='') as csvfile:
    csv_writer = csv.writer(csvfile)
    csv_writer.writerow(['Word1', 'Word2', 'Frequency'])
    for pair in sorted_pairs:
        csv_writer.writerow([pair[0][0].split('-')[0], pair[0][1].split('-')[0], pair[1]])