Date : / /2023

## Word2Vec

<img src="93033pic1.png" alt="word2vec model">

1. Word2Vec is a group of related models used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2Vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space.

2. Word2Vec can utilize either of two model architectures to produce these distributed representations of words: Continuous Bag-Of-Words (CBOW) or continuously sliding skip-gram. In both architectures, Word2Vec considers both individual words and a sliding context window as it iterates over the corpus 2.

3. In the CBOW architecture, the model predicts the current word based on the context words. The CBOW can be viewed as a 'fill in the blank' task, where the word embedding represents the way the word influences the relative probabilities of other words in the context window 2.

In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words. The skip-gram architecture weighs nearby context words more heavily than more distant context words. According to the authors' note, CBOW is faster while skip-gram does a better job for infrequent words 2.

After the model has trained, the learned word embeddings are positioned in the vector space such that words that share common contexts in the corpus — that is, words that are semantically and syntactically similar — are located close to one another in the space. More dissimilar words are located farther from one another in the space 2.

Word2Vec was introduced by a team of researchers at Google led by Tomas Mikolov. Google hosts an open-source version of Word2Vec released under an Apache 2.0 license. In 2014, Mikolov left Google for Facebook, and in May 2015, Google was granted a patent for the method, which does not abrogate the Apache license under which it has been released 1.

Word2Vec is not a singular algorithm, rather, it is a family of model architectures and optimizations that can be used to learn word embeddings from large datasets. Embeddings learned through Word2Vec have proven to be successful on a variety of downstream natural language processing tasks 4.

The output of the Word2Vec neural net is a vocabulary in which each item has a vector attached to it, which can be fed into a deep-learning net or simply queried to detect relationships between words 1.

In terms of implementation, you can use the Keras Subclassing API to define your Word2Vec model. This involves defining layers for target and context embeddings, and a function that computes the dot product of target and context embeddings from a training pair

### Imports

!pip install textblob

In [16]:
### -------------------
### Importing libraries
### -------------------

import pandas as pd

# import numpy as np
# import matplotlib.pyplot as plt
# import seaborn as sns

import nltk, string, gensim
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

import json    # Library to work with data json format

from textblob import Word

from gensim.models import Word2Vec

In [2]:
nltk.download('abc')

[nltk_data] Downloading package abc to /home/dai/nltk_data...
[nltk_data]   Package abc is already up-to-date!


True

In [3]:
from nltk.corpus import abc

### Global variables

### Input data

In [4]:
### ----------
### input data
### ----------

abc.sents()

[['PM', 'denies', 'knowledge', 'of', 'AWB', 'kickbacks', 'The', 'Prime', 'Minister', 'has', 'denied', 'he', 'knew', 'AWB', 'was', 'paying', 'kickbacks', 'to', 'Iraq', 'despite', 'writing', 'to', 'the', 'wheat', 'exporter', 'asking', 'to', 'be', 'kept', 'fully', 'informed', 'on', 'Iraq', 'wheat', 'sales', '.'], ['Letters', 'from', 'John', 'Howard', 'and', 'Deputy', 'Prime', 'Minister', 'Mark', 'Vaile', 'to', 'AWB', 'have', 'been', 'released', 'by', 'the', 'Cole', 'inquiry', 'into', 'the', 'oil', 'for', 'food', 'program', '.'], ...]

In [5]:
model = gensim.models.Word2Vec(abc.sents())

In [6]:
# printing the vector of science (word similar to science)

data = model.wv.most_similar('science')


print(data)

[('law', 0.9344958662986755), ('policy', 0.9130299091339111), ('general', 0.9113556742668152), ('agriculture', 0.9049243927001953), ('education', 0.9046909213066101), ('discussion', 0.9044297933578491), ('media', 0.9019559025764465), ('biology', 0.8990610837936401), ('physics', 0.8984453082084656), ('department', 0.8983114361763)]


### Reading json file

1. json file is in the format of key ---- > value pair.
2. The current file that we are reading is : intent.jason
3. json file can also be read with pandas DataFrame

In [7]:
# file that we are reading is : 'intents.json'
# data path : https://mitu.co.in/dataset/

json_file = 'intents.json'

with open('intents.json', 'r') as fh:
    
    data = json.load(fh)


    
# Printing the data
    
data

[{'tag': 'welcome',
  'patterns': ['Hi',
   'How are you',
   'Is any one to talk?',
   'Hello',
   'hi are you available'],
  'responses': ['Hello, thanks for contacting us',
   'Good to see you here',
   ' Hi there, how may I assist you?']},
 {'tag': 'goodbye',
  'patterns': ['Bye', 'See you later', 'Goodbye', 'I will come back soon'],
  'responses': ['See you later, thanks for visiting',
   'have a great day ahead',
   'Wish you Come back again soon.']},
 {'tag': 'thankful',
  'patterns': ['Thanks for helping me',
   'Thank your guidance',
   "That's helpful and kind from you"],
  'responses': ['Happy to help!',
   'Any time!',
   'My pleasure',
   'It is my duty to help you']},
 {'tag': 'hoursopening',
  'patterns': ['What hours are you open?',
   'Tell your opening time?',
   'When are you open?',
   'Just your timing please'],
  'responses': ["We're open every day 8am-7pm",
   'Our office hours are 8am-7pm every day',
   'We open office at 8 am and close at 7 pm']},
 {'tag': 'pay

##### Reading file with pandas DataFrame

In [8]:
# 

df = pd.read_json('intents.json')


df

Unnamed: 0,tag,patterns,responses
0,welcome,"[Hi, How are you, Is any one to talk?, Hello, ...","[Hello, thanks for contacting us, Good to see ..."
1,goodbye,"[Bye, See you later, Goodbye, I will come back...","[See you later, thanks for visiting, have a gr..."
2,thankful,"[Thanks for helping me, Thank your guidance, T...","[Happy to help!, Any time!, My pleasure, It is..."
3,hoursopening,"[What hours are you open?, Tell your opening t...","[We're open every day 8am-7pm, Our office hour..."
4,payments,"[Can I pay using credit card?, Can I pay usin...","[We accept VISA, Mastercard and credit card, W..."


In [9]:
# converting the list of pattern into string

df['patterns'] = df['patterns'].apply(', '.join)

df

Unnamed: 0,tag,patterns,responses
0,welcome,"Hi, How are you, Is any one to talk?, Hello, h...","[Hello, thanks for contacting us, Good to see ..."
1,goodbye,"Bye, See you later, Goodbye, I will come back ...","[See you later, thanks for visiting, have a gr..."
2,thankful,"Thanks for helping me, Thank your guidance, Th...","[Happy to help!, Any time!, My pleasure, It is..."
3,hoursopening,"What hours are you open?, Tell your opening ti...","[We're open every day 8am-7pm, Our office hour..."
4,payments,"Can I pay using credit card?, Can I pay using...","[We accept VISA, Mastercard and credit card, W..."


In [10]:
stop = stopwords.words('english')

**what does following command do:**

**df['patterns'] = df['patterns'].str.replace('[^\w\s]','')**

Answer | Phind V9 Model

The command df['patterns'] = df['patterns'].str.replace('[^\w\s]','') is used in Python's pandas library to replace all non-alphanumeric characters in the 'patterns' column of the dataframe df with an empty string.

Let's break it down:

    df['patterns']: This is selecting the 'patterns' column from the dataframe df.

    .str.replace('[^\w\s]',''): This is using the str.replace() function, which is a pandas function used to replace occurrences of a pattern/regex in the Series/Index. The function is being called on the 'patterns' column. The argument '[^\w\s]' is a regular expression that matches any character that is not a word character (denoted by \w) or a whitespace character (denoted by \s). The second argument '' is the replacement string, which in this case is an empty string. This means that any character that matches the pattern [^\w\s] will be replaced with nothing, effectively removing it.

So, in summary, this command is removing all non-alphanumeric characters from the 'patterns' column of the dataframe df.

In [14]:
df['patterns'] = df['patterns'].apply(lambda x:' '.join(x.lower() for x in x.split()))

df['patterns'] = df['patterns'].apply(lambda x:' '.join(x for x in word_tokenize(x) if x not in string.punctuation))

df['patterns'] = df['patterns'].str.replace('[^\w\s]','')

df['patterns'] = df['patterns'].apply(lambda x: ' '.join(x for x in x.split() if not x.isdigit()))

df['patterns'] = df['patterns'].apply(lambda x: ' '.join(x for x in x.split() if not x in stop))

df['patterns'] = df['patterns'].apply(lambda x: ' '. join([Word(word).lemmatize() for word in x.split()]))

In [15]:
df['patterns']

0                       hi one talk hello hi available
1                 bye see later goodbye come back soon
2        thanks helping thank guidance 's helpful kind
3       hour open tell opening time open timing please
4    pay using credit card pay using mastercard pay...
Name: patterns, dtype: object

In [19]:
# taking out the outer list

bigger_list = []

for i in df['patterns']:
    
    li = i.split()
    
    bigger_list.append(li)


bigger_list

[['hi', 'one', 'talk', 'hello', 'hi', 'available'],
 ['bye', 'see', 'later', 'goodbye', 'come', 'back', 'soon'],
 ['thanks', 'helping', 'thank', 'guidance', "'s", 'helpful', 'kind'],
 ['hour', 'open', 'tell', 'opening', 'time', 'open', 'timing', 'please'],
 ['pay',
  'using',
  'credit',
  'card',
  'pay',
  'using',
  'mastercard',
  'pay',
  'using',
  'cash']]

In [21]:
# custom data is fed to machine for further processing

model = Word2Vec(bigger_list, min_count = 1, workers = 4)

print(model)

Word2Vec<vocab=32, vector_size=100, alpha=0.025>


In [25]:
# saving the model with .save method

model.save('Word2Vec.model')

model.save('model.bin')

In [23]:
new_model = Word2Vec.load('model.bin')

In [24]:
vocab = list(new_model.wv.key_to_index)

vocab

['using',
 'pay',
 'hi',
 'open',
 'later',
 'soon',
 'back',
 'come',
 'goodbye',
 'bye',
 'see',
 'helping',
 'available',
 'hello',
 'talk',
 'one',
 'thanks',
 'cash',
 'thank',
 'mastercard',
 "'s",
 'helpful',
 'kind',
 'hour',
 'tell',
 'opening',
 'time',
 'timing',
 'please',
 'credit',
 'card',
 'guidance']

In [26]:
similar_words = new_model.wv.most_similar('kind')

print(similar_words)

[('one', 0.19613029062747955), ('pay', 0.18879325687885284), ('guidance', 0.14262470602989197), ('credit', 0.13661333918571472), ('hour', 0.10765139758586884), ('see', 0.09932278841733932), ('bye', 0.07770184427499771), ('opening', 0.0754380002617836), ('helpful', 0.06751954555511475), ('mastercard', 0.04943053424358368)]
