# Learning Goal
Understand why natural language processing and text representation are important, the different ways to represent text, and how to implement a few simple textual representations

# Install Libraries
We'll be using the gensim library to learn word embeddings. The commented out lines below are for installing gensim through anaconda and python respectively. 

In [30]:
# STUDENT TEST RUN
from gensim.models import Word2Vec
from gensim.summarization.textcleaner import split_sentences, tokenize_by_word
import numpy as np
# conda install -c anaconda gensim
# pip install --upgrade gensim

# Testing that our code works
This code reads in the raw .xml file from pubmed and parses out some abstracts into a more readable text file.

In [5]:
# STUDENT TEST RUN

# Process the original pubmed download. This is just so you can see how it's done. We won't work with the xml file.
n_abs = 0
with open("data/pubmed_sample_test.txt", "w") as outfile:
    with open("data/pubmed20n0001.xml", "r") as pubmed_file:
        for line in pubmed_file:
            if "<AbstractText>" in line:
                line = line.strip()# remove leading and trailing whitespace
                line = line.replace("<AbstractText>", "").replace("</AbstractText>", "")# these strings identify 
                    # when an abstract is present in the xml file.
                outfile.write(line + "\n")# write the text to the text file.
                n_abs += 1
print(n_abs, "abstracts processed")

15437 abstracts processed


# Read in data from file
The data is a text file of abstracts separated by new lines. We'll read this data into a list

In [6]:
abstract_list = []
with open("data/pubmed_sample.txt", "r") as abstract_file:
    for line in abstract_file:
        abstract_list.append(line.strip())
print(len(abstract_list), "abstracts read in")


15437 abstracts read in


In [7]:
for i in range(5):
    print(abstract_list[i])
    print("***************************************************\n\n\n")

(--)-alpha-Bisabolol has a primary antipeptic action depending on dosage, which is not caused by an alteration of the pH-value. The proteolytic activity of pepsin is reduced by 50 percent through addition of bisabolol in the ratio of 1/0.5. The antipeptic action of bisabolol only occurs in case of direct contact. In case of a previous contact with the substrate, the inhibiting effect is lost.
***************************************************



A report is given on the recent discovery of outstanding immunological properties in BA 1 [N-(2-cyanoethylene)-urea] having a (low) molecular mass M = 111.104. Experiments in 214 DS carcinosarcoma bearing Wistar rats have shown that BA 1, at a dosage of only about 12 percent LD50 (150 mg kg) and negligible lethality (1.7 percent), results in a recovery rate of 40 percent without hyperglycemia and, in one test, of 80 percent with hyperglycemia. Under otherwise unchanged conditions the reference substance ifosfamide (IF) -- a further development

# Process our data
The next step is processing our abstracts into sentences. Word2vec can work with either sentences to learn the context around words, or with entire documents (abstracts). This is a design choice and up to you. In the next section process the abstracts into sentences and store them in a list where you have one sentence per element in the list. The documentation is in (https://radimrehurek.com/gensim/summarization/textcleaner.html) for the function `split_sentences` which we'll be using


In [8]:
sentence_list = []
for abstract in abstract_list:
    sentences = split_sentences(abstract)
    for sentence in sentences:
        sentence_list.append(sentence)
print(len(sentence_list), "sentences extracted")

101724 sentences extracted


In [9]:
for i in range(5):
    print(sentence_list[i])
    print("***************************************************\n")

(--)-alpha-Bisabolol has a primary antipeptic action depending on dosage, which is not caused by an alteration of the pH-value.
***************************************************

The proteolytic activity of pepsin is reduced by 50 percent through addition of bisabolol in the ratio of 1/0.5.
***************************************************

The antipeptic action of bisabolol only occurs in case of direct contact.
***************************************************

In case of a previous contact with the substrate, the inhibiting effect is lost.
***************************************************

A report is given on the recent discovery of outstanding immunological properties in BA 1 [N-(2-cyanoethylene)-urea] having a (low) molecular mass M = 111.104.
***************************************************



## Gensim expects each sentence or document as a list of words
Gensim works with sentences or documents not as strings, but as lists of words or tokens. So for each sentence and for each abstract we need to convert it into a list of tokens/words. We can use the function `tokenize_by_word`. See documentation (https://radimrehurek.com/gensim/summarization/textcleaner.html).

In [10]:
# One more step. Word2Vec expects a lists of text, where each text is a list of tokens, or words.
abstract_list_tokenized = []
for abstract in abstract_list:
    tokens = list(tokenize_by_word(abstract))
    abstract_list_tokenized.append(tokens)

In [11]:
sentence_list_tokenized = []
for sentence in sentence_list:
    tokens = list(tokenize_by_word(sentence))
    sentence_list_tokenized.append(tokens)

In [12]:
abstract_list[0]

'(--)-alpha-Bisabolol has a primary antipeptic action depending on dosage, which is not caused by an alteration of the pH-value. The proteolytic activity of pepsin is reduced by 50 percent through addition of bisabolol in the ratio of 1/0.5. The antipeptic action of bisabolol only occurs in case of direct contact. In case of a previous contact with the substrate, the inhibiting effect is lost.'

In [13]:
abstract_list_tokenized[0]

['alpha',
 'bisabolol',
 'has',
 'a',
 'primary',
 'antipeptic',
 'action',
 'depending',
 'on',
 'dosage',
 'which',
 'is',
 'not',
 'caused',
 'by',
 'an',
 'alteration',
 'of',
 'the',
 'ph',
 'value',
 'the',
 'proteolytic',
 'activity',
 'of',
 'pepsin',
 'is',
 'reduced',
 'by',
 'percent',
 'through',
 'addition',
 'of',
 'bisabolol',
 'in',
 'the',
 'ratio',
 'of',
 'the',
 'antipeptic',
 'action',
 'of',
 'bisabolol',
 'only',
 'occurs',
 'in',
 'case',
 'of',
 'direct',
 'contact',
 'in',
 'case',
 'of',
 'a',
 'previous',
 'contact',
 'with',
 'the',
 'substrate',
 'the',
 'inhibiting',
 'effect',
 'is',
 'lost']

# Training word embeddings
Using the function `Word2Vec` from gensim we can now train word embeddings. The documentation is in (https://radimrehurek.com/gensim/models/word2vec.html)

In [14]:
model_abstract = Word2Vec(
                sentences = abstract_list_tokenized, # corpus we're using to train on
                size=100, # dimension of the word embeddings
                window=5, # max distance between a current and predicted word in a sentence.
                min_count=1,# words must occur at least min_count times to be learned.
                workers=6, # number of threads to use to train the model.
                iter=5, # How many times to iterate through the data
                )

In [15]:
model_sentence = Word2Vec(
                sentences = sentence_list_tokenized, # corpus we're using to train on
                size=100, # dimension of the word embeddings
                window=5, # max distance between a current and predicted word in a sentence.
                min_count=5,# words must occur at least min_count times to be learned.
                workers=6, # number of threads to use to train the model.
                iter=5, # How many times to iterate through the data
                )

## Explore trained word embeddings
Now we  can explore the word embeddings. Take a look at the embeddings. How many are there? How big are they? Do they make sense?

In [16]:
embeddings = model_abstract.wv

In [17]:
embeddings.vectors.shape

(49667, 100)

In [18]:
embeddings.most_similar("dosage")

[('daily', 0.7900731563568115),
 ('lorazepam', 0.7866021990776062),
 ('doses', 0.7565786838531494),
 ('dosages', 0.7515625953674316),
 ('flunitrazepam', 0.7309408187866211),
 ('clobazam', 0.7291922569274902),
 ('dose', 0.7207592129707336),
 ('medication', 0.7198101282119751),
 ('diazepam', 0.7167715430259705),
 ('night', 0.7145477533340454)]

In [19]:
embeddings.most_similar("lower")

[('higher', 0.9551340937614441),
 ('larger', 0.8494495153427124),
 ('smaller', 0.8398783802986145),
 ('greater', 0.8242799043655396),
 ('slower', 0.8149850368499756),
 ('faster', 0.7730324268341064),
 ('less', 0.7588409185409546),
 ('weaker', 0.7478044033050537),
 ('shorter', 0.7442339658737183),
 ('olds', 0.7146769762039185)]

In [20]:
embeddings.most_similar("mouse")

[('chick', 0.8861088752746582),
 ('embryo', 0.885520339012146),
 ('chicken', 0.8504881858825684),
 ('spleen', 0.8464064598083496),
 ('embryonic', 0.8459387421607971),
 ('thymus', 0.8300246596336365),
 ('fibroblasts', 0.8206666111946106),
 ('hamster', 0.8181971311569214),
 ('transplantable', 0.812572181224823),
 ('ascites', 0.8125156164169312)]

In [21]:
embeddings.most_similar("doctor")

[('planning', 0.9345557689666748),
 ('nurses', 0.930653989315033),
 ('professional', 0.9247720241546631),
 ('faculty', 0.9179189801216125),
 ('dreams', 0.91535484790802),
 ('item', 0.9147955775260925),
 ('instruction', 0.9137358665466309),
 ('ambulatory', 0.9137187600135803),
 ('chiefly', 0.9132506847381592),
 ('cardiosurgical', 0.9129608869552612)]

In [22]:
embeddings.most_similar("patient")

[('child', 0.8291334509849548),
 ('children', 0.8103070259094238),
 ('patients', 0.8045302033424377),
 ('syndrome', 0.8039336204528809),
 ('illness', 0.7976101040840149),
 ('complication', 0.7855963706970215),
 ('disease', 0.7851829528808594),
 ('symptoms', 0.7780343890190125),
 ('history', 0.76768958568573),
 ('diagnosis', 0.7598612308502197)]

In [23]:
embeddings.most_similar("man")

[('humans', 0.8267447352409363),
 ('born', 0.7447031736373901),
 ('animal', 0.7442922592163086),
 ('dog', 0.7438254356384277),
 ('selfmedication', 0.731346845626831),
 ('developing', 0.7291499972343445),
 ('persons', 0.7241732478141785),
 ('adults', 0.7154818773269653),
 ('rhesus', 0.712918221950531),
 ('fetus', 0.7105545997619629)]

In [24]:
embeddings.most_similar("woman")

[('girl', 0.9408396482467651),
 ('child', 0.9113438725471497),
 ('boy', 0.9050248861312866),
 ('undescended', 0.8856906294822693),
 ('febrile', 0.868108868598938),
 ('fever', 0.8678246140480042),
 ('aml', 0.8620784878730774),
 ('recurrent', 0.8598812818527222),
 ('blunt', 0.8578423261642456),
 ('colitis', 0.8576388359069824)]

In [25]:
embeddings.most_similar("dna")

[('rna', 0.9252740144729614),
 ('collagen', 0.7753475904464722),
 ('protein', 0.7356704473495483),
 ('phage', 0.7323349714279175),
 ('chromatin', 0.7231100797653198),
 ('peptidoglycan', 0.7049508094787598),
 ('polymerase', 0.6963043212890625),
 ('particles', 0.6945540904998779),
 ('denatured', 0.6872721314430237),
 ('peroxidase', 0.6838382482528687)]

# Representing text with word embeddings
As we saw in the presentation you can represent text as a bag of words with lists of counts or indices, but we can also represent text using hte embeddings we just created.
How might we represent the sentence "The antipeptic action of bisabolol only occurs in case of direct contact"?

In [33]:
text_to_rep = "The antipeptic action of bisabolol only occurs in case of direct contact"
tokenized_text_rep = tokenize_by_word(text_to_rep)
token_rep = np.zeros(100)
for token in tokenized_text_rep:
    token_rep += embeddings.get_vector(token)

In [34]:
token_rep

array([ -6.3173846 ,  -0.48974925,   5.27286469, -12.50651808,
        11.54413133,  -0.23772389,  -1.22886669,   1.89218713,
         7.49742263,   0.6647285 ,  -2.82737012,  -2.60073211,
        -4.02593215,   0.30432927,  -2.42086267,  -2.2361285 ,
        -5.62169872,   1.54503212, -11.69214445,  -0.79232608,
        -0.33236859,   3.1782889 ,  -6.59507738, -10.04997568,
         6.55995556,  -5.80688068,   6.09653857,   3.5436502 ,
        -8.05478445,  -2.28972913,   0.91302256,  -6.59270721,
        -2.08980044,  -3.78746317,  -0.29013413,  -1.0528254 ,
         1.89586251,   3.38953174,  11.10219219,  -3.70852464,
         3.25008322,   3.16323836,  -1.11842857,  -4.82449154,
         4.96149293,   8.45263036,   3.03171396,  -7.14921184,
        -3.80331249,  -3.12980413,   2.81587895,  -1.31853396,
         4.72511805,   5.29715664,   0.93269341,  -2.12010075,
         4.05143256,   4.38863342,  -1.82183438,   8.66315102,
         0.86496164,   6.87171341,  11.20077016,   1.83