# UTILISING UNSTRUCTURED DATA IN GEOSCIENCE
## DAY ONE: WORD EMBEDDINGS

This tutorial is split into three parts:
1. Investigeting pre-trained embeddings looking at finding most similar words and computing word-vector maths
2. Creating our own embeddings
3. Comparing our embeddings with the pre-trained embeddings
    
Prior to beginning, please ensure you have downloaded the data and installed the python packages detailed in the environment.yaml file.

## TODAY'S CHALLENGE

The challenge for day one is: 

   **Create a set of geoscientific word embeddings and identify the most similar term to 5 given terms. Similarly, calculate the nearest term to a vector maths problem**
   
Terms for the most similar:
- salt
- ghost
- gather
- elastic

Vector calculations to compute:
- P-wave - compressional + shear
- seal - mudstone + sandstone
- PSTM - time + depth
- Kirchoff - ray + wavefield
    
Please submit all results via https://forms.gle/RPjt4af7smMToq4z8 by 11:59pm GMT on 15 June 2021. 



### Notebook set-up
Importing all the packages that we will need for this tutorial 

In [87]:
import os

import gensim
import nltk
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/cebirnie/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

The imported packages include:

- **gensim** :  https://pypi.org/project/gensim/

    This is the primary package we will be using for generating and using the word embeddings.
    
    
- **nltk** : https://www.nltk.org/

    This is the package we will use for processing of the text data prior to its conversion to word embeddings. The 'punkt' extension is to provide sentence tokenisation. (Tokenisation will be discussed below.)
    
    
- **pandas** : https://pandas.pydata.org/

    We will use this package to aid our comparison between the pre-trained and custom-made embeddings, as well as to load in our corpus.

# PART 1: Getting to grips with pre-trained word embeddings

In this section we are going to use pre-trained word embeddings, look at word similarity and perform some vector maths.

The word vectors have been downloaded from: http://vectors.nlpl.eu/repository/ a resource by the University of Oslo where you will find many more pretrained word embeddings.

The downloaded embeddings are loaded using the Gensim Python package. 

**Load the pre-trained vectors from file, I assume they have been placed into a folder called data that is one level up**

In [51]:
# Data directory
data_dir = "../data/"

# Load vectors 
wikiemb_path = os.path.join(data_dir, "wiki_w2v.bin")
wiki_vecs = gensim.models.KeyedVectors.load_word2vec_format(gpath, binary=True)

## Word Similarity

Using the cosine distance we can compute the similarity between neighbouring words.

We are going to use the function: https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.most_similar.html

In [90]:
# Check out the documentation of the function
wiki_vecs.most_similar?

**Example: How to find 10 most similar word to `happy`**

In [96]:
wiki_vecs.most_similar(positive=['happy'])

[('Dataset', 0.08874545991420746),
 ('pipelines', 0.08865741640329361),
 ('Examination', 0.0821446105837822),
 ('Encana', 0.07870388776063919),
 ('NUC', 0.077957883477211),
 ('PDR', 0.07451966404914856),
 ('Datum', 0.0742947906255722),
 ('Quadro', 0.07304269820451736),
 ('Seismic', 0.07278499007225037),
 ('NMC', 0.07264432311058044)]

**To do:**
1. Find the 10 most similar words to `education`
2. Find the most similar word to `science`
3. Advanced: Find the least similar word to `happy`

## Basic word vector maths

Similar to how we identified the most similar words to a single word, as the words are in a vector format then we can combine the vectors and find the most similar word to the combination of the vectors. Such vector summations equal the equivalent of analogies in the language domain. 

Again, here we shall be using word2vec's `most_similar` function, but this time instead of passing a single word as either positive or negative, we will pass in a list of words.

In [97]:
# Looking back at the documentation notice that the input to 'positive' and 'negative' are lists
wiki_vecs.most_similar?

**Example: What is the female equivalent of a king?** 

In [4]:
print (wiki_vecs.most_similar(positive=["king", "woman"], negative = ["man"])[0])

('queen', 0.7168769240379333)


Explanation: To understand which vectors should be summed versus subtracted consider what the analogy is. In this scenario, we start with the positive king vector and we wish to remove the male vector and add the female. As such the equation would be: **king-man+woman**

**Example 2: What is the capital of England?**

In [5]:
print (wiki_vecs.most_similar(positive=["Oslo", "England"], negative = ["Norway"])[0])

('London', 0.6139545440673828)


Explanation: In this example we were not told which analogy to use therefore we had to understand the the question is asking about a country and its capital city and use this to make our own analogy. In this case we used the equation: **Oslo-Norway+England**

**To do:**
1. Repeat example 2 with your own analogy to find the capital of Scotland
2. Write your own analogy and equation to compute the past tense of run
3. Write your own analogy and equation to determine the colour of the sky
4. Advanced: We do not always need 3 components to the equation. Answer the following: What is a king if he is not a royal?

# PART 2: Creating our own word embeddings

In this section we are going to generate our own word embeddings from geoscientific texts. 

To do so we will need to: 
1. read in our corpus (geoscientific text),
2. perform any necessary processing of the corpus,
3. compute the word vectors


### 2.1 Data loading
Iraya Energies has very kindly provided the corpus for this summer school. The corpus is composed of summaries of geoscience conference abstracts and journal papers. 

If using the flat file, I assume that this is in the same location as the wiki embeddings.

In [18]:
geodata = pd.read_json(os.path.join(data_dir, "document_info.json"))
geodata.head()

Unnamed: 0,classification,page_num,par_num,is_cli,file,doc_text,remarks
0,text,1,1,False,d4018029-1b2f-45ca-a9d1-ebacfa337567,Yogyakarta is one of most-populated provinces ...,{'TITLE': 'Determining Groundwater Recharge Po...
1,text,1,1,False,f90fb8fe-225a-404c-b0d2-78c9e6586aa2,The magnetotelluric (MT) 1D inversion modeling...,{'TITLE': 'Magnetotelluric 1D Inversion Using ...
2,text,1,1,False,215e638f-328d-4c01-a3b2-971dbea28321,"In direct current (DC) sounding (VES) data, in...",{'TITLE': 'Comparison of Particle Swarm Optimi...
3,text,1,1,False,14b0e724-a9f9-4258-b908-8a91912f1978,In order to perform seismic inversion and have...,{'TITLE': 'Quantification of errors in well-tr...
4,text,1,1,False,36d262a1-4bce-4571-8e85-9b090f7b019f,Seismic wave energy attenuation and velocity d...,{'TITLE': 'Reflectivity Dispersion for Gas Det...


In this tutorial we are only interested in the summaries of the documents so let us just extract out that information from the dataframe.

In [26]:
geotexts = geodata['doc_text'].values.tolist()

We now have a list of summaries. First, lets check how many summaries we have:

In [27]:
len(geotexts)

1047

And lets look at the first few summaries

In [29]:
geotexts[:3]

['Yogyakarta is one of most-populated provinces in Indonesia having a high groundwater utilization. Most of water usages are still derived from groundwater resource. This condition is exacerbated by urbanization process which has a big deal on decreasing recharge area of groundwater due to land use change. To understand the water balance and the vulnerability of groundwater in Yogyakarta, the recharge area of groundwater need to be analyzed. The quantifying of recharge potential zone in Yogyakarta, was conducted by the integration of all factors influencing the hydrogeological process, those are lithology, land cover/land use, lineament and drainage frequency density, and geomorphology. The data were gained from satellite images (DEM and Landsat 8) and other exogenetic data (geomorphologic and geologic map). A GIS approach was used to integrate each influencing factor which has its own degree of effect. The groundwater recharge potential zone in Yogyakarta is well estimated using this 

## 2.2 Pre-processing the text

The first thing we must do is convert the strings of words into lists of tokens (where a token indicates what is separated by a space). We will use the nltk package to do this. After doing so, we will quickly analyse our corpus and look at different preprocessing.

In [30]:
# Join lines together so it becomes one long line
text = " ".join(geotexts)

# Separate out the sentences 
sentences = nltk.sent_tokenize(text)

# Seperate out each word within each sentence
tokenised_sents = [nltk.word_tokenize(sent) for sent in sentences]

In [99]:
# Let us look at our first sentence, now that it has been tokenised
tokenised_sents[0]

['Yogyakarta',
 'is',
 'one',
 'of',
 'most-populated',
 'provinces',
 'in',
 'Indonesia',
 'having',
 'a',
 'high',
 'groundwater',
 'utilization',
 '.']

**Example: how many tokens do we have in total?**

In [31]:
total_tokens = [t for sent in tokenised_sents for t in sent]

print ('Total number of tokens: %i'%len(total_tokens))

Total number of tokens: 160321


**To do:**
1. How many unique tokens do we have?
2. Lowercase all the tokens
3. Advanced: stem each token (hint: check out https://www.nltk.org/howto/stem.html)

## 2.3 Compute the word embeddings

We are going to use gensim's modelling package: https://radimrehurek.com/gensim/models/word2vec.html

In the first example we will use the tokenised sentences with minimal preprocessing and follow the same modelling methodology as was used for the wiki embeddings: 
- Skipgram approach
- Vector size of 300

In [103]:
# Let us look at the doc string for the function we will use to create the embeddings
gensim.models.Word2Vec?

**Example: Generating skipgram embeddings from our corpus**

In [102]:
# Skip-gram model
sg_geoscience = gensim.models.Word2Vec(tokenised_sents, sg=1, min_count=2, window=5, vector_size=300)
sg_geoscience.train(tokenised_sents, total_examples=len(tokenised_sents), epochs=250)

(27776237, 40080250)

In [112]:
%%timeit
# Skip-gram model
sg_geoscience = gensim.models.Word2Vec(tokenised_sents, sg=1, min_count=2, window=5, vector_size=300)
sg_geoscience.train(tokenised_sents, total_examples=len(tokenised_sents), epochs=250)

2min 54s ± 8.26 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [109]:
# Lets access the embeddings and see how many were created?
len(sg_geoscience.wv.vectors)

7305

**To do:**
1. Vary the size of the vector, how does this influence training time?
2. Change the min_count, how does this influence the number of embeddings created?
3. Advanced: create a new set of embeddings from a CBOW methodology (hint: look at the modelling documentation for class gensim.models.word2vec.Word2Vec on https://radimrehurek.com/gensim/models/word2vec.html)

# PART THREE: COMPARING THE EMBEDDINGS

Using the analysis techniques from Part 1 and the embeddings we created in Part 2. However, this time for our analysis we are going to focus on a goescientific use case. Therefore, all the terms and anologies will be geoscience-related. 

**Example: We have created a little function to compare the most similar word to a specified word for both the premade and custom embeddings. In this example we look at the most similar words to the term 'signal'.**

In [110]:
def comparing_embeddings_similarity(word, g_emb, sg_emb):
    g    = pd.DataFrame(g_emb.most_similar(positive=[word])[:5],columns=["g_name","g_score"])
    sg   = pd.DataFrame(sg_emb.wv.most_similar(positive=[word])[:5],columns=["sg_name","sg_score"])
    
    df = pd.concat([g, sg],axis = 1)
    display (df)

In [111]:
word = 'shale' 

comparing_embeddings_similarity(word, wiki_w2v, sg_geoscience)

Unnamed: 0,g_name,g_score,sg_name,sg_score
0,shales,0.783693,Barnett,0.380249
1,Shale,0.75016,About,0.367455
2,oil-shale,0.711005,Umr,0.367139
3,siltstone,0.69624,roughly,0.357243
4,dolostone,0.694625,Fiqa,0.35536


Some terms may not have been available in both corpi and therefore no embedding will exist for that term. In this scenario, our function will not work:

In [72]:
word = 'Marmousi' 
comparing_embeddings_similarity(word, wiki_w2v, sg_geoscience)

KeyError: "Key 'Marmousi' not present"

Then we can just look in the geoscience corpus at what is the most similar, as opposed to running a comparison of the two embeddings.

In [74]:
word = 'Marmousi' 

pd.DataFrame(sg_geoscience.wv.most_similar(positive=[word]),columns=["sg_name","sg_score"])

Unnamed: 0,sg_name,sg_score
0,SEG/EAGE,0.575661
1,implementations,0.559229
2,Marmousi‐2,0.491406
3,Foothill,0.485939
4,non‐quadratic,0.41716
5,II,0.413697
6,Pre-Stack,0.404282
7,difference-based,0.399245
8,AzAVO,0.39902
9,Bouguer,0.397072


**To do:**
1. What is the most similar term to wave?
2. What is the most similar term to migration?
3. Advanced: Incorporate the custom made CBoW embeddings from Part 2 into the function and rerun the similarity studies.
4. Advanced: Add an option into the function to return the least similar words

Unnamed: 0,sg_name,sg_score
0,converted-wave,0.439103
1,single-sensor,0.427688
2,landstreamers,0.427032
3,S-wave,0.420408
4,C-wave,0.41643
5,SH-wave,0.415499
6,subtracted,0.366667
7,assurance,0.362692
8,sorting,0.354691
9,converted,0.346103


**Example: Let us now consider the word vector maths. In this case let us consider the coal equivalent of a salt dome.**

In [56]:
print (wiki_vecs.most_similar(positive=["dome", "coal"],negative=["salt"]))
print (sg_geoscience.wv.most_similar(positive=["dome", "coal"],negative=["salt"]))

[('cupola', 0.5283207893371582), ('domes', 0.511414647102356), ('round-topped', 0.49815085530281067), ('roof', 0.4923069179058075), ('domed', 0.48560717701911926), ('smokestack', 0.48251205682754517), ('firebox', 0.4752655625343323), ('trainshed', 0.4744787812232971), ('chimneys', 0.47257155179977417), ('skylight', 0.46635544300079346)]
[('seam', 0.40692535042762756), ('inhomogeneities', 0.4021550416946411), ('organic-rich', 0.3903917074203491), ('Selar', 0.3770110607147217), ('Cornish', 0.3614294230937958), ('ash', 0.3448043167591095), ('dolerite', 0.33493053913116455), ('time-step', 0.33342188596725464), ('Ombuku', 0.33070141077041626), ('sides', 0.3279433250427246)]


**To do:**
1. Determine what is the shear equivalent of a P-wave
2. Write your own geoscience analogy (see https://www.earthdoc.org/docserver/fulltext/fb/38/7/fb2020051.pdf?expires=1619893516&id=id&accname=fromqa190&checksum=9E55711AF8CF1D67250F04B959D084CD for inspiration)
3. Advanced: Incorporate the custom made CBoW embeddings from Part 2 into the function and rerun the analogy studies.

# Final task: vary all the hyperparameters of the pipeline for embedding creation (part two) and see how it changes the results in part three.