# Learning Goal
Understand why natural language processing and text representation are important, the different ways to represent text, and how to implement a few simple textual representations

# Install Libraries
We'll be using the gensim library to learn word embeddings. The commented out lines below are for installing gensim through anaconda and python respectively. 

In [None]:
# STUDENT TEST RUN
from gensim.models import Word2Vec
from gensim.summarization.textcleaner import split_sentences, tokenize_by_word
# conda install -c anaconda gensim
# pip install --upgrade gensim

# Testing that our code works
This code reads in the raw .xml file from pubmed and parses out some abstracts into a more readable text file.

In [None]:
# STUDENT TEST RUN

# Process the original pubmed download. This is just so you can see how it's done. We won't work with the xml file.
n_abs = 0
with open("data/pubmed_sample_test.txt", "w") as outfile:
    with open("data/pubmed20n0001.xml", "r") as pubmed_file:
        for line in pubmed_file:
            if "<AbstractText>" in line:
                line = line.strip()# remove leading and trailing whitespace
                line = line.replace("<AbstractText>", "").replace("</AbstractText>", "")# these strings identify 
                    # when an abstract is present in the xml file.
                outfile.write(line + "\n")# write the text to the text file.
                n_abs += 1
print(n_abs, "abstracts processed")

# Read in data from file
The data is a text file of abstracts separated by new lines. We'll read this data into a list

In [None]:
abstract_list = []


In [None]:
for i in range(5):
    print(abstract_list[i])
    print("***************************************************\n\n\n")

# Process our data
The next step is processing our abstracts into sentences. Word2vec can work with either sentences to learn the context around words, or with entire documents (abstracts). This is a design choice and up to you. In the next section process the abstracts into sentences and store them in a list where you have one sentence per element in the list. The documentation is in (https://radimrehurek.com/gensim/summarization/textcleaner.html) for the function `split_sentences` which we'll be using


In [None]:
sentence_list = []


In [None]:
for i in range(5):
    print(sentence_list[i])
    print("***************************************************\n")

## Gensim expects each sentence or document as a list of words
Gensim works with sentences or documents not as strings, but as lists of words or tokens. So for each sentence and for each abstract we need to convert it into a list of tokens/words. We can use the function `tokenize_by_word`. See documentation (https://radimrehurek.com/gensim/summarization/textcleaner.html).

In [None]:
# One more step. Word2Vec expects a lists of text, where each text is a list of tokens, or words.
abstract_list_tokenized = []

In [None]:
sentence_list_tokenized = []

In [None]:
abstract_list[0]

In [None]:
abstract_list_tokenized[0]

# Training word embeddings
Using the function `Word2Vec` from gensim we can now train word embeddings. The documentation is in (https://radimrehurek.com/gensim/models/word2vec.html)

In [None]:
model_abstract = Word2Vec(

                )

In [None]:
model_sentence = Word2Vec(


## Explore trained word embeddings
Now we  can explore the word embeddings. Take a look at the embeddings. How many are there? How big are they? Do they make sense?

In [None]:
embeddings = model_abstract.wv

In [None]:
embeddings.vectors.shape

In [None]:
# How else can we explore word embeddings?