<a href="https://colab.research.google.com/github/astrapi69/DroidBallet/blob/master/NLP_D1_2_E2_Word_Embeddings_Job_Ads_helper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a id='Q0'></a>
<center><a target="_blank" href="https://learning.constructor.org/"><img src="https://drive.google.com/uc?id=1RNy-ds7KWXFs7YheGo9OQwO3OnpvRSU1" width="200" style="background:none; border:none; box-shadow:none;" /></a> </center>

_____

<center> <h1> Helper Notebook: Projecting Word Embeddings </h1> </center>

<p style="margin-bottom:1cm;"></p>

_____

<center>Constructor Academy, 2024</center>


# Helper Notebook: Projecting Word Embeddings
We will work with job ads from job.ch. A dataset of 10000 English job ads is provided.

The goal of this exercise will be to develop a working understanding of Word2vec and use t-sne as a way to analyze word embeddings

Like any classical NLP task the steps in this analysis will be

- Clean data
- Build a corpus
- Train word2vec
- Visualize using t-sne

In [None]:
! pip install umap-learn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting umap-learn
  Downloading umap-learn-0.5.3.tar.gz (88 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.2/88.2 KB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pynndescent>=0.5
  Downloading pynndescent-0.5.8.tar.gz (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: umap-learn, pynndescent
  Building wheel for umap-learn (setup.py) ... [?25l[?25hdone
  Created wheel for umap-learn: filename=umap_learn-0.5.3-py3-none-any.whl size=82830 sha256=3c732b18552b50479101dfc7d7bfab4e47af2185d5f36899b673e2f1fa2432f0
  Stored in directory: /root/.cache/pip/wheels/f4/3e/1c/596d0a463d17475af648688443fa4846fef624d1390339e7e9
  Build

In [None]:
import re

import nltk
import numpy as np
import pandas as pd
import umap
from gensim.models import word2vec
from matplotlib import pyplot as plt
from sklearn.manifold import TSNE

import nltk
nltk.download('stopwords')
nltk.download('punkt')

%matplotlib inline

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Download dataset

In [None]:
!curl -L -o job_ads_eng.csv "http://drive.google.com/uc?export=download&id=1IGCgrq7AqygIaLcjiFwlqgcNoQd1OAqo"

## Data preparation
### Load the data set

In [None]:
data = pd.read_csv("job_ads_eng.csv")  # .sample(50000, random_state=23)
data.head(3)

### Data cleaning

In [None]:
STOP_WORDS = nltk.corpus.stopwords.words("english")


def clean_sentence(sentence):
    """
    remove chars that are not letters or numbers, downcase
    """
    # TODO
    # Hint: Use regex!


def remove_stopwords(sentence):
    """
    remove stopwords
    """
    # TODO


In [None]:
data = data.dropna(subset=["Content"])  # remove rows without content
data["Content"] = data["Content"].apply(clean_sentence)
data["Content"] = data["Content"].apply(remove_stopwords)
data.head(3)

In [None]:
# Let's have a look at an example text
data[data["Job title"].str.contains("Data")]["Content"].values[1]

### Create the corpus

In [None]:
# Create a list of lists containing the words of each description

# TODO
corpus = ??
corpus[0][:10]

## Create word embeddings
We use word2vec of the gensim package.

In [None]:
# TODO
model = ??
model.wv["email"]

## Project and plot embeddings
Let's use t-SNE or umap to project the embeddings into a 2 or 3-dim space. For plotting we use an interactive plotly plot.

In [None]:
from plotly import express as px


def plot_embeddings(model, projection="tsne", dim=2, wordlist=None, **kwargs):

    vectors_proj, lables = project_embeddings(
        model, projection=projection, dim=dim, wordlist=wordlist, **kwargs
    )

    if dim == 2:
        plot_2d(vectors_proj, lables)
    elif dim == 3:
        plot_3d(vectors_proj, lables)
    else:
        raise ValueError("Dimension of input vectors has to be 2 or 3.")


def project_embeddings(model, projection="tsne", dim=2, wordlist=None, **kwargs):
    if not wordlist:
        wordlist = model.wv.key_to_index

    lables = [word for word in wordlist]
    vectors = [model.wv[word] for word in wordlist]

    if projection == "tsne":
        vectors_proj = call_tsne(vectors, n_components=dim, **kwargs)
    elif projection == "umap":
        vectors_proj = call_umap(vectors, n_components=dim, **kwargs)
    return vectors_proj, lables


def call_tsne(vectors, n_components, **kwargs):
    arguments = dict(perplexity=40, init="pca", n_iter=2500, random_state=23)
    arguments.update(kwargs)
    tsne_model = TSNE(n_components=n_components, **arguments)
    vectors_proj = tsne_model.fit_transform(vectors)
    return vectors_proj


def call_umap(vectors, n_components, **kwargs):
    arguments = dict(n_neighbors=15, min_dist=0.1, metric="euclidean")
    arguments.update(kwargs)
    umap_model = umap.UMAP(random_state=42, n_components=n_components, **arguments)
    vectors_proj = umap_model.fit_transform(vectors)
    return vectors_proj


def plot_2d(vectors_proj, lables=None):
    x = [vec[0] for vec in vectors_proj]
    y = [vec[1] for vec in vectors_proj]

    fig = px.scatter(x=x, y=y, text=lables)
    fig.update_traces(textposition="top center", textfont_size=10)
    fig.update_layout(height=800, title_text="2d projection of word embeddings")
    fig.show()


def plot_3d(vectors_proj, lables=None):
    x = [vec[0] for vec in vectors_proj]
    y = [vec[1] for vec in vectors_proj]
    z = [vec[2] for vec in vectors_proj]

    fig = px.scatter_3d(x=x, y=y, z=z, text=lables)
    fig.update_traces(textposition="top center", textfont_size=10, marker_size=3)
    fig.update_layout(height=800, title_text="3d projection of word embeddings")
    fig.show()

In [None]:
plot_embeddings(model, projection="umap", dim=2)

## Export the word embeddings
This allows us to visualize them at https://projector.tensorflow.org/

![tensorflow_projector_job_adds.gif](attachment:tensorflow_projector_job_adds.gif)

In [None]:
# TODO
lables = ??
vectors = ??

In [None]:
pd.DataFrame(lables).to_csv("lables.tsv", sep="\t", index=False, header=False)
pd.DataFrame(vectors).to_csv("vectors.tsv", sep="\t", index=False, header=False)

Go to https://projector.tensorflow.org/. Click `Load` for uploading the vectors and the labels files.

## Word embeddings - Try different parameters

In [None]:
# A more selective model, word has to be at least 1000 times in the corpus

# TODO
model = ??

In [None]:
plot_embeddings(model, projection="umap", dim=2)

In [None]:
# Creat word embeddings with 300 components

# TODO
model = ??

In [None]:
plot_embeddings(model, projection="umap", dim=2)

## Let's find some similar words to our query
Create the word embeddings.

In [None]:
# TODO
model = ??

Define a search word and find the most similar words.

In [None]:
# plot the most similiar words
search_word = "python"

# TODO
m_similar = ??
wordlist = ??
# add the word itself
wordlist.append(search_word)

Plot the search word together with the similar words.

In [None]:
plot_embeddings(model, projection="umap", dim=2, wordlist=wordlist)

Print the list of similar words.

In [None]:
m_similar