# Create Custom Word Embeddings

In this notebook we aim to create our custom word-embeddings which are generic/unsupervised and can be used in any of our NLP applications where we use a word embeddings.

We majorly will be doing the following :

  1. Install & Import Packages
  2. Define our Text Pre-processing pipeline.
  3. Creating Text Corpus
  4. Train Word Embeddings
  5. Convert & Save 

In [1]:
# Let's mount our G-Drive. 

from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


# 1. Install & Import Packages

Make sure you restart the runtime once the pip installs has completed.

In [2]:
!pip install -U fasttext
!pip install -U gensim
!pip install tiny-tokenizer
!pip install flair

Collecting fasttext
[?25l  Downloading https://files.pythonhosted.org/packages/10/61/2e01f1397ec533756c1d893c22d9d5ed3fce3a6e4af1976e0d86bb13ea97/fasttext-0.9.1.tar.gz (57kB)
[K     |████████████████████████████████| 61kB 2.0MB/s 
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.1-cp36-cp36m-linux_x86_64.whl size=2384065 sha256=d4e85b70e9f7fdf878e13da282a573148197922f0aea470a1280c3cc258812bc
  Stored in directory: /root/.cache/pip/wheels/9f/f0/04/caa82c912aee89ce76358ff954f3f0729b7577c8ff23a292e3
Successfully built fasttext
Installing collected packages: fasttext
Successfully installed fasttext-0.9.1
Collecting gensim
[?25l  Downloading https://files.pythonhosted.org/packages/d1/dd/112bd4258cee11e0baaaba064060eb156475a42362e59e3ff28e7ca2d29d/gensim-3.8.1-cp36-cp36m-manylinux1_x86_64.whl (24.2MB)
[K     |████████████████████████████████| 24.2MB 42.1MB/s 
Installing colle

In [1]:
# Let's import our packages !

import pandas as pd
import fasttext
import re
import html
from tqdm import tqdm 
from flair.embeddings import WordEmbeddings

# 2. Text Pre-processing Pipeline

In NLP with the structured & unstructured text, one constant that doen't change is text pre-processing. Based on the task & requirements we create a text-cleaning pipeline.

Every try-except block can be written as a different modular function which can be invoked from preprocess_text() function. This serves as a pipeline of the series of text-cleaning that you might require for your dataset.

In [0]:
clean = re.compile('<.*?>')

def preprocess_text(text) :
  try :
    # soup = BeautifulSoup(text, "html.parser")
    # text = soup.get_text()
    text=  re.sub(clean, '', text)
    text = html.unescape(text)
  except Exception as e:
    print("Error in HTML Processing ...")
    print(text)
    text = text
    raise e
  try :
    # remove extra newlines (often might be present in really noisy text)
    text = text.translate(text.maketrans("\n\t\r", "   "))
  except :
    print("Error in removing extra lines ...")
    print(text)
    text = text

  try :
    # remove extra whitespace
    text = re.sub(' +', ' ', text)
    text = text.strip()
  except :
    print("Error in extra whitespace removal ...")
    print(text)
    text = text

  return text

# 3. Creating the Corpus

To train a word-embedding we first need to create a single corpus which contains all of our text that we want the `Machines` to understand. This includes the following :

1. Loading all the Documents (Text Files/Sentences/Documents etc) & pre-process
2. *One Line Per Document* : Traning format
3. Save to re-use.
4. Train FastText Embedding
5. Convert to Gensim re-usable & distributable format

In [3]:
# Define the Base Path & Data Files
path = '/content/drive/My Drive/ICDMAI_Tutorial/notebook/'
data ='filtered_data/question_tag_text_mapping.pkl'

# Load the structured file with all document Text
question_tag = pd.read_pickle(path+data)
question_tag.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body,CreationMonth,CreationYear,Tag
0,120,83.0,2008-08-01 15:50:08,,21,ASP.NET Site Maps,<p>Has anyone got experience creating <strong>...,8,2008,"[asp.net, sql]"
1,260,91.0,2008-08-01 23:22:08,,49,Adding scripting functionality to .NET applica...,<p>I have a little game written in C#. It uses...,8,2008,"[c#, .net]"
2,330,63.0,2008-08-02 02:51:36,,29,Should I use nested classes in this case?,<p>I am working on a collection of classes use...,8,2008,[c++]
3,470,71.0,2008-08-02 15:11:47,2016-03-26T05:23:29Z,13,Homegrown consumption of web services,<p>I've been writing a few web services for a ...,8,2008,"[web-services, .net]"
4,580,91.0,2008-08-02 23:30:59,,21,Deploying SQL Server Databases from Test to Live,<p>I wonder how you guys manage deployment of ...,8,2008,[sql-server]


In [4]:
# Apply the pre-processing on the Question Body & Text

question_tag['Title'] = question_tag['Title'].apply(preprocess_text)
question_tag['Body'] = question_tag['Body'].apply(preprocess_text)
question_tag.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body,CreationMonth,CreationYear,Tag
0,120,83.0,2008-08-01 15:50:08,,21,ASP.NET Site Maps,Has anyone got experience creating SQL-based A...,8,2008,"[asp.net, sql]"
1,260,91.0,2008-08-01 23:22:08,,49,Adding scripting functionality to .NET applica...,I have a little game written in C#. It uses a ...,8,2008,"[c#, .net]"
2,330,63.0,2008-08-02 02:51:36,,29,Should I use nested classes in this case?,I am working on a collection of classes used f...,8,2008,[c++]
3,470,71.0,2008-08-02 15:11:47,2016-03-26T05:23:29Z,13,Homegrown consumption of web services,I've been writing a few web services for a .ne...,8,2008,"[web-services, .net]"
4,580,91.0,2008-08-02 23:30:59,,21,Deploying SQL Server Databases from Test to Live,I wonder how you guys manage deployment of a d...,8,2008,[sql-server]


In [0]:
# Iterate & Create "One Line Per Document : Traning format"

text_lines =  list()
for index in tqdm(question_tag.index) :
  title = question_tag.loc[index,'Title']
  body =  question_tag.loc[index,'Body']

  text =  title + '. ' + body
  text_lines.append(text)

# Save to re-use easily next-time
with open(path+'/training_data/training_data.txt', 'w',encoding ='utf-8') as filehandle:
    filehandle.writelines("%s\n" % sent for sent in text_lines)

# 4. Train Word Embeddings

Here we use FastText to train wor-embeddings as these are very fast compared to any other trainings available and serves the purpose to evaluate how valuable custom embeddings are to the task at hand.


FastText open-sourced by Facebook, provides both command-line based library & python based. If possible use command-line based as it's more faster.

The options & hyperparameters to be tune can be found in the documentation.


**Link**  : https://github.com/facebookresearch/fastText/tree/master/python#train_unsupervised-parameters

In [0]:
# Skipgram model :
model = fasttext.train_unsupervised(path+'/training_data/training_data.txt', model='skipgram',dim=300,verbose=3)
model.save_model(path+"/training_data/training_data_skipgram_d300.bin")
print("Model Saved")

# 5. Convert & Save

Generally you would want to convert the embedding to Gensim format as it's a widely used library and provides wrapper to most framework.

In [0]:
from gensim.models import KeyedVectors

# We use the .vec file which does not contain the Meta-Data of FastText word-embeddings
model = KeyedVectors.load_word2vec_format('/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/word_embedding/model.vec')

# Save
model.save('/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/word_embedding/gensim_model')

In [7]:
# Test that we are able to load it in for the down-stream tasks/code based we want.
word_embeddings = [ WordEmbeddings('/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/word_embedding/gensim_model')]
print("Word Embedding Loaded Successfully in Flair")

Word Embedding Loaded Successfully in Flair


In [15]:
# Looking at the embeddings
from flair.embeddings import WordEmbeddings
from flair.data import Sentence
word_embeddings = WordEmbeddings('/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/word_embedding/gensim_model')

# some example sentence
sentence = Sentence('Welcome to ICDMAI 2020')
# sentence = Sentence(question_tag.head(1))

# embed sentences
word_embeddings.embed(sentence)

# go through each token in sentence
for token in sentence:
    # print token string
    print("Word : {}".format(token.text))
    # print embedding of this Token
    print(token.embedding)

    # print shape of embedding of this Token
    print(token.embedding.shape)

Word : Welcome
tensor([-0.0477,  0.2607,  0.2033,  0.2048, -0.0809,  0.1777,  0.0807, -0.2186,
         0.0905, -0.2447, -0.0885,  0.3721,  0.7040,  0.4241,  0.4265, -0.8608,
         0.1301, -0.7587, -0.2028,  0.6566, -0.5186, -0.3107, -0.4054, -0.4756,
         0.1132, -0.0198,  0.4574,  0.1947, -0.4173,  0.0510, -0.2822, -0.2527,
         0.1500, -0.4198, -0.0433,  0.5972,  0.2721, -0.6502,  0.7049, -0.3756,
         0.2631,  0.5738, -0.6479,  0.1652,  0.2238,  0.6156,  0.3448,  0.1264,
        -0.0845,  0.2268, -0.3862,  0.6731,  0.7547,  0.3776, -0.0597, -0.0291,
         0.0330,  0.1897,  0.1477, -0.2566,  0.4579, -0.1078,  0.0178, -0.5004,
         0.1670,  0.4812,  0.0785,  0.3406, -0.3407, -0.3803,  0.0712,  0.2021,
         0.4849,  0.5238,  0.1135,  0.4455, -0.0758,  0.2082, -0.1574,  0.5648,
        -0.7159, -0.2169,  0.1408, -0.0399, -0.1633, -0.1153,  0.1199, -0.1713,
         0.1180, -0.1937, -0.5096, -0.1051,  0.6814,  0.5651, -0.0935,  0.1203,
        -0.0947, -0.3284,