# Session 2 - Word representation

Task 1

Generalize all the preprocessing tasks into one single function that can be use in the Vectorizer (passing boolean to make sure that you can activate or deactivate on preprocessing step)

Task 2

Work and research on Hashing Vectorizer. What advantage and disadvantage it gives. Implement it for you project (use TFidf for the rest)


Research

Research and create a presentation of the Latent Dirichlet Allocation (LDA) model.

Task 4

Implement your research model in order to predict industries on your dataset

Task 5

Learn and implement techniques to evaluate your model [Use sklearn](https://scikit-learn.org/stable/modules/model_evaluation.html)

Task 6 (Optional)

Create a WordCloud for each cluster predicted by your model

In [None]:
import pandas as pd
import random
import regex as re
import unicodedata
import nltk
import spacy
import string
from sklearn.feature_extraction.text import HashingVectorizer
!python -m spacy download en_core_web_sm >> /dev/null
!pip install gensim

2021-11-03 17:15:50.033066: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-11-03 17:15:50.033102: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.[0m
Collecting gensim
  Downloading gensim-4.1.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 27.5 MB/s 
Installing collected packages: gensim
Successfully installed gensim-4.1.2
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.[0m


In [None]:
dataset = pd.read_csv('employer_raw_data.csv')

In [None]:

old_sentences = dataset['description'].values

In [None]:
# Preprocessing function

def get_preprocessing_function(
    use_lower: bool = True,
    use_alpha: bool = True,
    use_stemming: bool = False,
    use_lemmatization: bool = True,
    punctuation: bool = True,
    numbers: bool = True,
    url: bool = True
):

    # Remove punctuation
    STRING_PUNCTUATION = string.punctuation
    def punctuation(text: str):
        return text.translate(str.maketrans("", "", STRING_PUNCTUATION)) if punctuation else text

    # Remove numbers
    def numbers(text: str):
        return ''.join([i for i in text if not i.isdigit()]) if numbers else text

    # Remove URLS
    def urls(text: str):
        url_pattern = re.compile(r'https?://\S+|www\.\S+')
        return url_pattern.sub(r'', text) if url else text
    
    # Remove alpha numerics
    def alpha(text: str):
        return re.sub("[^a-z]+", " ", text) if use_alpha else text

    # Make lowercase
    def lower(text: str):
        return text.lower() if use_lower else text
    
    # Implement stemming
    def stemming(text: str):
        stemmer = nltk.stem.PorterStemmer()
        return " ".join([stemmer.stem(word) for word in text.split(' ')]) if use_stemming else text
    
    
    def preprocess(text: str):
        #Order of processing steps
        steps = [
            lower, 
            punctuation, 
            numbers, 
            urls, 
            alpha, 
            stemming
            ]
        for step in steps:
            text = step(text)
        return text
    
    return preprocess

In [None]:
# Define preprocess function and apply to random sentence

preprocess = get_preprocessing_function(
    use_lower = True,
    use_alpha = True,
    use_stemming = True,
    punctuation = True,
    numbers = True,
    url = True
)

sentence = random.choice(list(old_sentences))
processed_sentence = preprocess(sentence)

print(f"""
Non processed corpus:
{sentence}
------------------------
Processed corpus:
{processed_sentence}
""")


Non processed corpus:
STMicroelectronics is a global independent semiconductor company and a leader in developing and delivering semiconductor solutions across the spectrum of microelectronics applications. An unrivaled combination of silicon and system expertise, manufacturing strength, Intellectual Property (IP) portfolio, and strategic partners positions, STMicroelectronics is at the forefront of System-on-Chip ... STMicroelectronics Standard Products are a broad range of industry-standard and drop-in replacements for the most popular general-purpose analog ICs, discrete and serial EEPROMs. The Standard Products are manufactured to the highest quality standards with many AECQ-qualified for automotive applications. ... Mouser® and Mouser Electronics® are ... Cut Tape. Product is cut from a full reel tape into customized quantities. MouseReel™ (Add $7.00 reeling fee) A product reel is cut according to customer-specified quantities. All MouseReel orders are non-cancellable and non-ret

In [None]:
# Creating new column for clean descirption
from tqdm import tqdm
dataset["description"] = dataset["description"].fillna(".")
dataset["description"] = dataset["description"].astype(str)
clean_descriptions=[]
descriptions=dataset["description"].values 
for desc in tqdm(descriptions):
    clean_descriptions.append(preprocess(desc)) 
dataset['clean_description']=clean_descriptions

100%|██████████| 20000/20000 [04:35<00:00, 72.56it/s]


In [None]:

sentences = dataset['clean_description'].values

In [None]:
dataset['clean_description']=dataset['clean_description'].values

In [None]:
#copying data with clean sets to new dataset
dataset.to_csv('clean_set.csv', index=False)


Cleaning industry data

In [None]:
training=pd.read_csv('industry_data.csv')
training

training["description"] = training["description"].fillna(".")
training["description"] = training["description"].astype(str)
clean_descriptions=[]
descriptions=training["description"].values 
for desc in tqdm(descriptions):
    clean_descriptions.append(preprocess(desc)) 
training['clean_description']=clean_descriptions
training['clean_description']=training['clean_description'].values.astype(str)
training.to_csv('industry_data.csv', index=False)

100%|██████████| 13/13 [00:01<00:00, 10.99it/s]


In [None]:
# Defining stopwords
with open("stopwords.txt", "r") as f_in:
    stopwords = [i.strip().lower() for i in f_in.readlines()]

### Count Vectorizer

An encoded vector is returned with a length of the entire vocabulary and an integer count for the number of times each word appeared in the document.

In [None]:
# Implementing Vectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Parameters that we can tune
NGRAM = (1, 1) # Add more features when context is needed
MIN_DF = 2 # Ignore terms that appear less than 10% of document
MAX_DF = 0.4 # Ignore terms appear more than 30%
MAX_FEATURES = 4000 # Define the lenght of the vocabulary

count_vec = CountVectorizer(
    ngram_range = NGRAM,
    tokenizer = lambda s: s.split(),
    stop_words = stopwords,
    min_df = MIN_DF,
    max_df = MAX_DF,
    max_features = MAX_FEATURES    
)

In [None]:
# Fit the vectorizer and explore data
sample = dataset["clean_description"].sample(100)
count_vec.fit(sample)

print(count_vec.vocabulary_)

{'resourc': 81, 'care': 12, 'network': 61, 'medic': 56, 'rang': 76, 'improv': 41, 'counti': 22, 'benefit': 7, 'healthcar': 37, 'patient': 67, 'research': 80, 'hospit': 39, 'facil': 31, 'integr': 44, 'career': 13, 'opportun': 65, 'share': 88, 'nation': 60, 'region': 78, 'educ': 27, 'visit': 97, 'senior': 87, 'join': 46, 'bank': 6, 'onlin': 64, 'account': 0, 'place': 69, 'hour': 40, 'insur': 43, 'institut': 42, 'rate': 77, 'specialist': 92, 'privat': 74, 'client': 16, 'agenc': 1, 'local': 52, 'regist': 79, 'call': 10, 'assist': 4, 'applic': 3, 'emerg': 28, 'complet': 18, 'onli': 63, 'life': 48, 'hi': 38, 'children': 14, 'depart': 25, 'live': 50, 'scienc': 85, 'llc': 51, 'memori': 58, 'salari': 82, 'power': 70, 'control': 21, 'announc': 2, 'manufactur': 54, 'softwar': 90, 'price': 73, 'well': 98, 'know': 47, 'secur': 86, 'sourc': 91, 'ga': 35, 'data': 23, 'capit': 11, 'director': 26, 'partner': 66, 'ltd': 53, 'uk': 96, 'train': 95, 'sale': 83, 'associ': 5, 'invest': 45, 'limit': 49, 'citi

In [None]:

vector=count_vec.transform(sample)    
vector.todense() #encoded sparse vectors to np arrays

matrix([[0, 0, 0, ..., 1, 0, 0],
        [7, 0, 0, ..., 0, 0, 0],
        [1, 2, 0, ..., 1, 0, 0],
        ...,
        [0, 0, 1, ..., 0, 0, 2],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 2, 0, ..., 0, 0, 0]])

### Hashing Vectorizer

Vocabularies can be large when using counts and frequencies, so hashing vectorizers does one way hash of words to convert them to integers, thus no vocabulary is needed and can choose an arbitrary-long fixed length vector. However, it cannot convert the encoding back to a word.

In [None]:
# Implementing Hashing Vectorizer

from sklearn.feature_extraction.text import HashingVectorizer

vectorizer = HashingVectorizer(n_features=20)
vector = vectorizer.transform(sample)
print(vector.shape)
print(vector.toarray())

(100, 20)
[[-0.05638839  0.09867968 -0.31013613 ... -0.09867968 -0.38062162
  -0.0140971 ]
 [-0.04950738 -0.09901475  0.42081271 ...  0.07426107 -0.19802951
   0.02475369]
 [-0.19014018  0.24718224  0.15211215 ...  0.26619626 -0.34225233
   0.        ]
 ...
 [-0.17156089  0.          0.19062321 ... -0.11437393 -0.17156089
   0.07624929]
 [ 0.16204746  0.13889782  0.34724455 ...  0.02314964  0.09259855
  -0.11574818]
 [-0.3243575   0.01621787  0.08108937 ...  0.3243575  -0.27570387
  -0.04865362]]


### Training dataset with LDA

Tfidf/Term Frequency times Inverse Document Frequency/ are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.

In [None]:
import pandas as pd
training_df = pd.read_csv('industry_data.csv')
training_corpus = training_df['clean_description'].values.astype(str)
industry_names= training_df['industry'].values

employer_df = pd.read_csv("clean_set.csv")
employer_df = employer_df.drop(labels=["employers", "description"], axis=1)
corpus = employer_df['clean_description'].values.astype(str)

In [None]:
# Defining stopwords
with open("stopwords.txt", "r") as f_in:
    stopwords = [i.strip().lower() for i in f_in.readlines()]

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
NGRAM = (1, 1) # Add more features when context is needed
MIN_DF = 2 # Ignore terms that appear less than 10% of document
MAX_DF = 0.4 # Ignore terms appear more than 30%
#MAX_FEATURES = 4000 # Define the lenght of the vocabulary

idf_vec = TfidfVectorizer(
    #ngram_range=NGRAM,
    tokenizer=lambda s: s.split(),
    stop_words=stopwords,
    min_df=MIN_DF,
    max_df=MAX_DF,
    #max_features=MAX_FEATURES,
    use_idf=True,
    smooth_idf=True
)

In [None]:
#transforming industry data to vector by tfdif vectorizer

vector_idf = idf_vec.fit_transform(training_corpus)


In [None]:
#importing lda 
from sklearn.decomposition import LatentDirichletAllocation as LDA
lda = LDA(n_components=20,learning_method='online', learning_offset=40, n_jobs=-1)
#getting industry topics
industry_topics=lda.fit_transform(vector_idf)

In [None]:
#transforming employer data to vector by tfdif vectorizer
#and creating topics by lda
employer_vectors=idf_vec.transform(corpus)

In [None]:
employer_topics=lda.transform(employer_vectors)

In [None]:
print(len(employer_topics))

20000


In [None]:
import numpy as np
industry_prediction = []
for employer_vec in employer_topics:
    distances = []
    for industry_vec in industry_topics:
        #Look at how close the company topics are from the industry
        distances.append(np.linalg.norm(industry_vec - employer_vec))
    #Pick the closest company
    best_industry_index = np.argmin(distances)
    industry_prediction.append(industry_names[best_industry_index])
#print(len(industry_prediction))
#components=lda.components_

In [None]:


employer_df["lda_prediction"] = industry_prediction

In [None]:
employer_df.to_csv('clean_set.csv', index=False)

### LDA Example

In [None]:
# Implementing Latent Dirichlet Allocation (LDA) model
#It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

import gensim
from gensim.test.utils import common_texts
from gensim.corpora.dictionary import Dictionary

# Create a corpus from a list of texts
common_dictionary = Dictionary(common_texts)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]

lda = gensim.models.LdaModel(common_corpus, num_topics=10)

In [None]:
import io
import os.path
import re
import tarfile

import smart_open

def extract_documents(url='https://cs.nyu.edu/~roweis/data/nips12raw_str602.tgz'):
    with smart_open.open(url, "rb") as file:
        with tarfile.open(fileobj=file) as tar:
            for member in tar.getmembers():
                if member.isfile() and re.search(r'nipstxt/nips\d+/\d+\.txt', member.name):
                    member_bytes = tar.extractfile(member).read()
                    yield member_bytes.decode('utf-8', errors='replace')

docs = list(extract_documents())

In [None]:
# Tokenize the documents.
from nltk.tokenize import RegexpTokenizer

# Split the documents into tokens.
tokenizer = RegexpTokenizer(r'\w+')
for idx in range(len(docs)):
    docs[idx] = docs[idx].lower()  # Convert to lowercase.
    docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.

# Remove numbers, but not words that contain numbers.
docs = [[token for token in doc if not token.isnumeric()] for doc in docs]

# Remove words that are only one character.
docs = [[token for token in doc if len(token) > 1] for doc in docs]

In [None]:
# Lemmatize the documents.
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]

In [None]:
# Compute bigrams.
from gensim.models import Phrases

# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(docs, min_count=20)
for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)

In [None]:
# Remove rare and common tokens.
from gensim.corpora import Dictionary

# Create a dictionary representation of the documents.
dictionary = Dictionary(docs)

# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)

In [None]:
# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]

In [None]:
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 8644
Number of documents: 1740


In [None]:
# Train LDA model.
from gensim.models import LdaModel

# Set training parameters.
num_topics = 50
chunksize = 2000
passes = 20
#iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make a index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

KeyboardInterrupt: 

In [None]:
top_topics = model.top_topics(corpus) #, num_words=20)

avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(top_topics)

Average topic coherence: -1.1779.
[([(0.007037914, 'gaussian'),
   (0.0063822237, 'matrix'),
   (0.0059578186, 'density'),
   (0.0048813866, 'noise'),
   (0.004702009, 'approximation'),
   (0.004515913, 'prior'),
   (0.00432641, 'bayesian'),
   (0.0042887274, 'solution'),
   (0.004117072, 'likelihood'),
   (0.004024778, 'mixture'),
   (0.0039797793, 'component'),
   (0.0036772664, 'log'),
   (0.003604146, 'estimate'),
   (0.003441111, 'rule'),
   (0.003437534, 'sample'),
   (0.0034183024, 'variance'),
   (0.0033686808, 'posterior'),
   (0.0031950378, 'field'),
   (0.0031777166, 'xi'),
   (0.0029016682, 'optimal')],
  -0.9087692348419095),
 ([(0.020912563, 'neuron'),
   (0.017845048, 'cell'),
   (0.00756838, 'spike'),
   (0.007193502, 'response'),
   (0.006995943, 'synaptic'),
   (0.0067574014, 'activity'),
   (0.005945886, 'stimulus'),
   (0.0058641955, 'firing'),
   (0.0047210045, 'connection'),
   (0.0045576114, 'cortex'),
   (0.0043599596, 'field'),
   (0.0042713624, 'visual'),
   (

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=fb3de8ef-fb47-4eee-bfce-0a5c05122f97' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>