<a href="https://colab.research.google.com/github/antahiap/dsr-nlp/blob/main/notebooks/03_gensim_lsdyna.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Gensim to create word embeddings for LSDYNA MAnual


In [1]:
# Install wget (if not already installed)
!apt-get -qq install wget

# Download a specific file from the GitHub repository
!wget "https://github.com/antahiap/dsr-nlp/tree/main/data/lsdyna_i_r13.txt"
!wget "https://github.com/antahiap/dsr-nlp/tree/main/data/lsdyna_ii_r13.txt"


--2023-08-09 09:29:37--  https://github.com/antahiap/dsr-nlp/tree/main/data/lsdyna_i_r13.txt
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://github.com/antahiap/dsr-nlp/blob/main/data/lsdyna_i_r13.txt [following]
--2023-08-09 09:29:38--  https://github.com/antahiap/dsr-nlp/blob/main/data/lsdyna_i_r13.txt
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 4589 (4.5K) [text/plain]
Saving to: ‘lsdyna_i_r13.txt’


2023-08-09 09:29:38 (65.0 MB/s) - ‘lsdyna_i_r13.txt’ saved [4589/4589]

--2023-08-09 09:29:38--  https://github.com/antahiap/dsr-nlp/tree/main/data/lsdyna_ii_r13.txt
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://github.co

In [2]:
!ls

data  lsdyna_ii_r13.txt  lsdyna_i_r13.txt  sample_data


## Import all necessary modules.

In [3]:
import os
import logging
import os
import multiprocessing
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt')
import numpy as np
import seaborn as sns
from scipy import spatial
import matplotlib.pyplot as plt

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Train Gensim.

Here we feed all the text data into Gensim to train Word2Vec.

- [Gensim homepage](https://radimrehurek.com/gensim/).
- [Wikipedia: Word2Vec](https://en.wikipedia.org/wiki/Word2vec).

In [4]:
from gensim.test.utils import datapath
from gensim import utils

class MyCorpus:
    """An iterator that yields sentences (lists of str)."""

    def __init__(self):
        self.lines = []

        files = os.listdir(".")
        files = [file for file in files if file.endswith(".txt")]
        print(f"Found {len(files)} files")

        for file in files:
            print(file)
            for line in open(file):
                self.lines += [line]
        print(f"Got {len(self.lines)} lines.")

    def __iter__(self):
        for line in self.lines:
            preprocessed_line = utils.simple_preprocess(line)
            yield preprocessed_line

In [5]:
import gensim.models

sentences = MyCorpus()

model = gensim.models.Word2Vec(
    sentences=sentences,
    sg=1,
    vector_size=300,
    window=20,
    min_count=3,
    workers=multiprocessing.cpu_count()
)

print("Done.")

Found 2 files
Got 2 lines.
Done.


## Find most similar words.

With vectors it is easy to find the nearest neighbours.

Note: Feel free to experiment with your own words.

In [14]:
# Access the vocabulary
vocabulary = model.wv.key_to_index

# Print the vocabulary
for word in vocabulary:
    print(word)


null
false
name
nlp
dsr
path
txt
antahiap
contenttype
data
true
file
com
https
github
main
directory
lsdyna_i_r
lsdyna_ii_r
pdf
ls
settings
dismiss
blob
notice
new
gitignore
totalcount
items
envs
img
notebooks
src
about
readme
md
changes
propose
or
make
to
in
signed
be
must
you
creating
docs
and
requirements
symbols
ce


In [15]:
model.wv.most_similar("requirements", topn=20)

[('txt', 0.9993817210197449),
 ('main', 0.9993041753768921),
 ('ce', 0.9992712736129761),
 ('data', 0.9992510080337524),
 ('totalcount', 0.9992466568946838),
 ('md', 0.9992452263832092),
 ('file', 0.9992308020591736),
 ('gitignore', 0.9992303252220154),
 ('directory', 0.9992228150367737),
 ('settings', 0.9992197155952454),
 ('lsdyna_i_r', 0.9992177486419678),
 ('path', 0.9992145895957947),
 ('in', 0.9991920590400696),
 ('or', 0.9991906881332397),
 ('notebooks', 0.9991894960403442),
 ('img', 0.9991830587387085),
 ('make', 0.9991651773452759),
 ('to', 0.9991651773452759),
 ('about', 0.9991628527641296),
 ('you', 0.9991617798805237)]

## Plot word similarities.

That was just one word. Let us generate a similarity matrix of a lot of words. Again, use your own.

In [9]:
def plot_similarities(words):
    features = [np.array(model.wv[word]) for word in words]

    similarities = np.zeros((len(features), len(features)))
    for index1, feature1 in enumerate(features):
        for index2, feature2 in enumerate(features):
            similarities[index1, index2] = 1 - spatial.distance.cosine(feature1, feature2)

    fig, ax = plt.subplots(figsize=(12, 12))
    g = sns.heatmap(
        similarities,
        annot=True,
        xticklabels=words,
        yticklabels=words,
        cmap="inferno"
    )
    g.set_xticklabels(words, rotation=90)
    g.set_yticklabels(words, rotation=0)
    g.set_title("Semantic Similarity")

words = [
    "harry",
    "hermione",
    "ron",
    "fred",
    "george",
    "snape",
    "wand",
    "snitch",
    "marauder",
    "map",
    "hogwarts",
    "slytherin",
    "gryffindor",
    "hufflepuff",
    "ravenclaw",
    "voldemort",
    "tom",
    "horcrux",
    "snake",
    "nagini",
    "wizard",
    "witch"
]
plot_similarities(words)

KeyError: ignored

# Thank you!