# Developing a word2vec model

Objective:

  To understand the process of training and using a word2vec model to capture semantic meaning from text data.

Background:

  Word2Vec is a popular method for representing words as vectors. The idea is to capture the semantic meaning of words based on their context in a given text corpus. This exercise will help students understand how to train a Word2Vec model and use it for various NLP tasks.

Scenario:

  Imagine you are working for an e-learning company, "Blue Data EdTech", which wants to recommend relevant courses to its users based on the content of courses they've previously enrolled in. You decide to use Word2Vec to understand the content of courses and then make relevant recommendations.

Dataset:

  The dataset consists of course descriptions from various subjects offered by "Blue Data EdTech". Each course description is about 100-200 words long, giving a brief overview of what the course covers.

Data Cleaning and Pre-processing:
- Load the dataset into a suitable data structure (e.g., pandas DataFrame).
- Clean the data by removing any special characters, numbers, and converting all text to lowercase.
- Tokenize the cleaned data.

Train a Word2Vec Model:
- Use the Gensim library to train a Word2Vec model on the tokenized course descriptions.
- Experiment with different hyperparameters (e.g., vector size, window size, min_count) to optimize your model.

Model Evaluation:
- Use the model to find the most similar words to some test words (e.g., "programming", "data", "design").


In [3]:
import pandas as pd
import re
import numpy as np

In [6]:
# load dataset
df = pd.read_csv("Course_recommendation.csv")
df.shape

(56, 3)

In [None]:
df.head()

In [None]:
def get_corpus(ps):
  text = ps['Domain'] + " " + ps["Course Title"] + " " + ps['Course Description']
  return text
corpus = [get_corpus(df.iloc[s,:]) for s in range(df.shape[0])]
corpus


In [None]:
def clean_data(doc):
  doc = re.sub("[^a-zA-z0-9\s]","",doc)
  doc = doc.strip()
  return doc

corpus = [clean_data(doc) for doc in corpus]
corpus

In [44]:
corpus = [doc.lower().split() for doc in corpus]
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
sw =list(ENGLISH_STOP_WORDS)

corpus = [[w for w in doc if w not in sw] for doc in corpus]

## Developing a word2vec model using gensim

In [45]:
from gensim import models

In [59]:
model = models.Word2Vec(corpus,vector_size=10,window=3,min_count=1,epochs=20)

In [60]:
model

<gensim.models.word2vec.Word2Vec at 0x7f9e300cb8b0>

In [61]:
model.wv['python']

array([ 0.06664161, -0.09053505, -0.03858478,  0.00308311, -0.09070887,
       -0.01679079,  0.06406178,  0.06560419, -0.03008008, -0.0251871 ],
      dtype=float32)

In [63]:
model.wv.most_similar("design")

[('photoshop', 0.867955207824707),
 ('neural', 0.8625140190124512),
 ('data', 0.8281702995300293),
 ('art', 0.8266161680221558),
 ('cloud', 0.7981105446815491),
 ('covers', 0.7965190410614014),
 ('spark', 0.7853581309318542),
 ('funding', 0.7693154811859131),
 ('variables', 0.7692983150482178),
 ('learn', 0.7668612003326416)]

In [64]:
model.wv.most_similar("programming")

[('started', 0.9157686233520508),
 ('protect', 0.8068074584007263),
 ('seo', 0.805827260017395),
 ('techniques', 0.7980539202690125),
 ('power', 0.7945994138717651),
 ('concepts', 0.7859807014465332),
 ('optimization', 0.78365159034729),
 ('resource', 0.7811184525489807),
 ('include', 0.7451245784759521),
 ('world', 0.7399213910102844)]