# <span style="color:#800000"> Fitting a Word2Vec Model on the IkoKaziKE data</span>

<span style="color:orange">Here,I call in the data,check for missing values,keep what I need and use the BOW model </span>

In [1]:
#Import the necessary libraries
!pip install gensim
import re
import string 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from plotly import graph_objs as go
import nltk
import warnings
warnings.filterwarnings("ignore")
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from collections import Counter
from bs4 import BeautifulSoup



In [2]:
#import the dataset
df_tweet=pd.read_csv("twitter-job-hunter-chatbot.csv")
df_tweet.head()

Unnamed: 0,Datetime,Text,Source,harsh tag,Favourite Count,Retweets,6,7,submitter_name
0,18/07/2020 11:46,"b""Ladies!!! #IkoKaziKE #IkoKazi Today, I reall...",b'Mimimkenya7',"['IkoKaziKE', 'IkoKazi']",,775.0,,,Brian Cheye
1,18/07/2020 11:18,b'#IkoKaziKe #IkoKazi Looking for a Audit Trai...,b'MtandaoPromoter',"['IkoKaziKe', 'IkoKazi']",,5.0,,,Brian Cheye
2,18/07/2020 09:08,b'@moneychapKE are an online crowdfunding and ...,b'KameneAndJalas',"['KameneAndJalas', 'ikoJob', 'TwendeKaziKe', '...",,15675.0,,,Brian Cheye
3,18/07/2020 09:07,b'We are an online crowdfunding and fundraisin...,b'moneychapKE',"['IkoKazi', 'IkoKaziKE']",,79.0,,,Brian Cheye
4,18/07/2020 08:46,b'Striding into the weekend with a productive ...,b'amunsoft',"['webdeveloper', 'business', 'IkoKazi', 'IkoKa...",,2.0,,,Brian Cheye


In [3]:
print(df_tweet.shape)

(29674, 9)


In [4]:
#df_tweet.info()

In [5]:
#Variable Data types
df_tweet.dtypes

Datetime            object
Text                object
Source              object
harsh tag           object
Favourite Count    float64
Retweets           float64
6                  float64
7                  float64
submitter_name      object
dtype: object

# Cleaning the Data

In [6]:
df_tweet.isnull().sum()

Datetime               0
Text                   0
Source               214
harsh tag          24953
Favourite Count      532
Retweets            2552
6                   6599
7                   6599
submitter_name         0
dtype: int64

Since harsh tag has very many missing values that will be unreasonable to drop,
We fill the missing values with "no tag" the drop the rest of the na's

In [7]:
df_tweet["harsh tag"].fillna("no tag",inplace=True)

<span style="color:green">We can now drop the na values without worrying of dropping so many observations </span>

In [8]:
df_tweet.dropna(inplace=True)

In [9]:
#confirm in there are still na values
df_tweet.isnull().sum()

Datetime           0
Text               0
Source             0
harsh tag          0
Favourite Count    0
Retweets           0
6                  0
7                  0
submitter_name     0
dtype: int64

In [10]:
df_tweet.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22865 entries, 4047 to 27121
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Datetime         22865 non-null  object 
 1   Text             22865 non-null  object 
 2   Source           22865 non-null  object 
 3   harsh tag        22865 non-null  object 
 4   Favourite Count  22865 non-null  float64
 5   Retweets         22865 non-null  float64
 6   6                22865 non-null  float64
 7   7                22865 non-null  float64
 8   submitter_name   22865 non-null  object 
dtypes: float64(4), object(5)
memory usage: 1.7+ MB


In [11]:
#check if we have any dulicates
print(df_tweet.duplicated().any())

True


In [12]:
#drop the duplicates
df_tweet.drop_duplicates(keep=False,inplace=True)

In [13]:
#Confirm if the duplicates have been dropped
print(df_tweet.duplicated().any())

False


In [14]:
df_tweet.describe()

Unnamed: 0,Favourite Count,Retweets,6,7
count,22807.0,22807.0,22807.0,22807.0
mean,382.4571,5.172535,356.0263,147.239225
std,10310.95,82.011641,10306.85,1213.919912
min,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0
50%,1.0,0.0,0.0,0.0
75%,2.0,2.0,0.0,0.0
max,1496504.0,4167.0,1496395.0,28407.0


# <u><span style="color:orange"> Dealing with outliers </span></u>

In [15]:
from scipy import stats
num = ['Retweets']

for i, col in enumerate(num):
    z=np.abs(stats.zscore(df_tweet[col]))
print(z)

[0.06307212 0.06307212 0.05087846 ... 0.05087846 0.05087846 0.06307212]


In [16]:
#subset the data with observations whose z score is less than 3
df1=df_tweet[(z<3)]
print (f"Data frame with outliers had : {(df_tweet.shape[0])}")
print(f"Data frame without outliers has : {df1.shape[0]}")


Data frame with outliers had : 22807
Data frame without outliers has : 22768


In [17]:
#remove spaces and convert the variable names to lower
df1.columns=df1.columns.str.lower().str.strip()

In [18]:
#Cleaning text,lower remove punctuation,white space
#create a function then use it to clean the column texts
'''Make text lowercase, remove text in square brackets,remove links,remove punctuation
    and remove words containing numbers.'''
def clean_text(text):
    text=str(text).lower()
    text=re.sub('\[.*?\]','',text)   # question marks,square brackets
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    
    text = re.sub("\'s", " ", text) # we have cases like "Sam is" or "Sam's" (i.e. his) these two cases aren't separable, I choose to compromise are kill "'s" directly
    text = re.sub(" whats ", " what is ", text, flags=re.IGNORECASE)
    text = re.sub("\'ve", " have ", text)
    text = re.sub("can't", "can not", text)
    text = re.sub("n't", " not ", text)
    text = re.sub("i'm", "i am", text, flags=re.IGNORECASE)
    text = re.sub("\'re", " are ", text)
    text = re.sub("\'d", " would ", text)
    text = re.sub("\'ll", " will ", text)
    text = re.sub("e\.g\.", " eg ", text, flags=re.IGNORECASE)
    text = re.sub("(\d+)(kK)", " \g<1>000 ", text)
    text = re.sub("e-mail", " email ", text, flags=re.IGNORECASE)
    text = re.sub("\(s\)", " ", text, flags=re.IGNORECASE)
    text = re.sub("[c-fC-F]\:\/", " disk ", text)
    return text           


In [19]:
df1['text']= df1['text'].apply(lambda x:clean_text(x))
df1['source']=df1['source'].apply(lambda x:clean_text(x))

In [20]:
df1.head()

Unnamed: 0,datetime,text,source,harsh tag,favourite count,retweets,6,7,submitter_name
4047,2020-07-18 10:16:18+00:00,job seekers free cover letter template ikok...,simoningari,no tag,1.0,0.0,4342.0,4994.0,Dennis Mwaniki
4048,2020-07-18 10:09:08+00:00,get the latest and best car shade and all cust...,simonmagak,no tag,0.0,0.0,28.0,92.0,Dennis Mwaniki
4049,2020-07-18 10:05:52+00:00,join jobalertke for latest updates on ikok...,simoningari,no tag,2.0,1.0,4342.0,4994.0,Dennis Mwaniki
4050,2020-07-18 10:05:22+00:00,hey people im looking for a farm manager with ...,vidolebaridi,no tag,0.0,0.0,141.0,485.0,Dennis Mwaniki
4051,2020-07-18 10:05:16+00:00,job hunting can be emotionally draining to en...,simoningari,no tag,0.0,0.0,4342.0,4994.0,Dennis Mwaniki


In [21]:
#use a subset of what I need.
df_w2v = df1[["datetime" ,"text","source","harsh tag","retweets"]]  #get rid of 6,7 and submitter_name

In [22]:
df_w2v.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22768 entries, 4047 to 27121
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   datetime   22768 non-null  object 
 1   text       22768 non-null  object 
 2   source     22768 non-null  object 
 3   harsh tag  22768 non-null  object 
 4   retweets   22768 non-null  float64
dtypes: float64(1), object(4)
memory usage: 1.0+ MB


In [23]:
df_w2v.dtypes

datetime      object
text          object
source        object
harsh tag     object
retweets     float64
dtype: object

# <span style="color:maroon">Word2Vec Model </span>

Here we will implement the Word2vec model using the Gensim library
Procedure 
<br> 0.Create/Identify your corpus </br>
<br> 1.Tokenize </br>   //preprocessing
<br> 2.Remove Stopwords</br>
<br> 3.Modelling with Word2vec
<br> 4. Split Test and Train Data <br>

In [24]:
#1.Tokenizing

nltk.download('punkt')
nltk.download('wordnet')


[nltk_data] Downloading package punkt to C:\Users\Eunice
[nltk_data]     Mutahi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Eunice
[nltk_data]     Mutahi\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

# Creating the model

In [25]:
def prep_text(text) :

     # 1. Remove HTML.
    my_text = BeautifulSoup(text,"html.parser").get_text()
    
    #Tokenize words
    tokens=nltk.word_tokenize(my_text)
    
    #Define stopwords
    stops=set(stopwords.words("english"))
    # Convert the stopwords list to "set" data type.
    #stops = set(nltk.corpus.stopwords.words("english"))
    
    #Remove the stopwords
    words=[x for x in tokens if  not x in stops]
   
    
    #Remove the stopwords
    #for i in range(0,len(words)):
       # words[i]=[x for x in words if  not x in stops]
    
    return words

In [26]:
#Performing the iterative preps on each of the statements
df_w2v['preped_text']=df_w2v['text'].apply(prep_text)

In [27]:
df_w2v.head(10)

Unnamed: 0,datetime,text,source,harsh tag,retweets,preped_text
4047,2020-07-18 10:16:18+00:00,job seekers free cover letter template ikok...,simoningari,no tag,0.0,"[job, seekers, free, cover, letter, template, ..."
4048,2020-07-18 10:09:08+00:00,get the latest and best car shade and all cust...,simonmagak,no tag,0.0,"[get, latest, best, car, shade, customised, ca..."
4049,2020-07-18 10:05:52+00:00,join jobalertke for latest updates on ikok...,simoningari,no tag,1.0,"[join, jobalertke, latest, updates, ikokazike,..."
4050,2020-07-18 10:05:22+00:00,hey people im looking for a farm manager with ...,vidolebaridi,no tag,0.0,"[hey, people, im, looking, farm, manager, spec..."
4051,2020-07-18 10:05:16+00:00,job hunting can be emotionally draining to en...,simoningari,no tag,0.0,"[job, hunting, emotionally, draining, ensure, ..."
4052,2020-07-18 09:59:06+00:00,job whatsapp groups to join ikokazike,simoningari,no tag,0.0,"[job, whatsapp, groups, join, ikokazike]"
4053,2020-07-18 09:47:12+00:00,i am looking for good carpenter to do some wor...,babananii,no tag,1.0,"[looking, good, carpenter, work, recommendatio..."
4054,2020-07-18 09:46:00+00:00,kot were still taking applications for regiona...,qsskenya,no tag,1.0,"[kot, still, taking, applications, regional, h..."
4055,2020-07-18 09:40:05+00:00,the east african community is looking for a h...,findjobskenya,no tag,0.0,"[east, african, community, looking, human, res..."
4056,2020-07-18 09:39:26+00:00,vacancy at miritini medical labtech qualificat...,hossanamiritini,no tag,3.0,"[vacancy, miritini, medical, labtech, qualific..."


In [28]:

from gensim.models import Word2Vec

In [29]:
#fit the model
word2vec=Word2Vec(df_w2v['preped_text'],min_count=2)      # Specifies to include only words in the model that appear atleast 10 times in the corpus

In [30]:
#Lets view the vocabulary
vocabulary=word2vec.wv.vocab
print(vocabulary)



In [49]:
# preping text for Word2Vec 
def counting(text) :
    my_text = BeautifulSoup(text,"html.parser").get_text()
    
    #Tokenize words
    tokens=nltk.word_tokenize(my_text)
    
    #Define stopwords
    stops=set(stopwords.words("english"))
    
    
    #Remove the stopwords
    words=[x for x in tokens if  not x in stops]
    #Confirming with the frequencies
    wordscount={}
    
    #for data in my_text:
    for word in words:
        if word not in wordscount.keys():
            wordscount[word]= 1
        else:
            wordscount[word] += 1
    #print(wordscount)
    
    return wordscount

        

In [50]:
#Performing the iterative preps on each of the statements
df_w2v['count_text']=df_w2v['text'].apply(counting)

In [51]:

print(df_w2v['count_text'])

4047     {'job': 1, 'seekers': 1, 'free': 1, 'cover': 1...
4048     {'get': 1, 'latest': 1, 'best': 1, 'car': 1, '...
4049     {'join': 1, 'jobalertke': 1, 'latest': 1, 'upd...
4050     {'hey': 1, 'people': 1, 'im': 1, 'looking': 1,...
4051     {'job': 3, 'hunting': 2, 'emotionally': 1, 'dr...
                               ...                        
27116    {'looking': 1, 'job': 1, 'opportunities': 1, '...
27117    {'looking': 1, 'job': 1, 'opportunities': 1, '...
27118    {'looking': 1, 'job': 1, 'opportunities': 1, '...
27119    {'looking': 1, 'job': 1, 'opportunities': 1, '...
27121       {'used': 1, 'original': 1, 'padsikokazike': 1}
Name: count_text, Length: 22768, dtype: object


In [53]:
#Get all counts
top = Counter([item for sublist in df_w2v['preped_text'] for item in sublist])  #counts the frequency of words
temp = pd.DataFrame(top.most_common(10))  #A dataframe of top 5
temp.columns = ['Common_words','count']
temp.style.background_gradient(cmap='Blues')

Unnamed: 0,Common_words,count
0,ikokazike,12717
1,job,11281
2,vacancy,8493
3,hiring,4480
4,looking,4200
5,apply,4083
6,ikokazi,3636
7,jobs,3552
8,check,2681
9,opportunities,2485


# Model Analysis

Lets now explore what we created

Finding Vectors for a Word
We know that the Word2Vec model converts words to their corresponding vectors. Let's see how we can view vector representation of any particular word.

In [72]:
vec1=word2vec.wv['jobs']
print(vec1)

[ 5.40504038e-01  9.66072679e-01  9.84730124e-01  9.83429015e-01
 -7.59980381e-01  6.62668943e-01  5.87260962e-01  2.64047056e-01
 -5.49408235e-02  3.82870138e-02  3.75827610e-01 -9.79443714e-02
  1.53482902e+00  2.80314296e-01 -8.94509777e-02 -5.48211694e-01
 -2.57424384e-01  8.87459040e-01  1.03759892e-01  1.04219377e+00
  7.11032867e-01  2.20856738e+00  1.68405843e+00  9.24764201e-02
  8.11248004e-01  1.08494103e+00 -6.92745205e-03  5.36264718e-01
 -7.67842889e-01 -7.40501344e-01  1.16757369e+00 -4.18461561e-01
  3.79425108e-01 -3.14091504e-01  2.59481072e-01 -4.78525579e-01
  1.97310090e-01 -1.37005734e+00  6.91328764e-01 -7.16852784e-01
  6.13990784e-01 -1.16432154e+00  7.37183571e-01  3.88904437e-02
  1.07252026e+00 -1.00167476e-01  1.58698082e-01  2.01555419e+00
  2.52359182e-01  3.73059750e-01  1.85973632e+00  3.34861815e-01
 -7.77836800e-01  1.20185447e+00  1.39400281e-03 -6.08495414e-01
 -6.80903912e-01  3.49670887e-01 -2.65605241e-01  9.40679431e-01
  1.95879757e+00  3.53820

Finding Similar Words

Earlier we said that contextual information of the words is not lost using Word2Vec approach. We can verify this by finding all the words similar to the word "seeking".

In [54]:
similar=word2vec.wv.most_similar('job')
print(similar)

[('diss', 0.8201165199279785), ('assembler', 0.8180742263793945), ('vivo', 0.816642165184021), ('readvertisement', 0.8150718212127686), ('jobot', 0.8134506940841675), ('hts', 0.8108627796173096), ('npcpizzahut', 0.8107939958572388), ('treasurer', 0.8104172945022583), ('premisehealth', 0.810227632522583), ('sidian', 0.8093194365501404)]
