## Bag of Words

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
class Category:
    language = 'LANGUAGE'
    back_end = 'BACK_END'

train_sample_X = ['I am good in python scripting language',
                'I have expertize in julia and R languages',
                    'I have knowledge of mongodb database',
                    'I have 5.5 years of work experience in docker and docker hub']

train_sample_y = [Category.language, Category.language, 
                  Category.back_end, Category.back_end]

* ngram_range will take into consideration of two words and one word together

In [3]:
vectorizer = CountVectorizer(ngram_range = (1,2))
train_vectors_X = vectorizer.fit_transform(train_sample_X)

In [4]:
print(vectorizer.get_feature_names())

['am', 'am good', 'and', 'and docker', 'and languages', 'database', 'docker', 'docker and', 'docker hub', 'experience', 'experience in', 'expertize', 'expertize in', 'good', 'good in', 'have', 'have expertize', 'have knowledge', 'have years', 'hub', 'in', 'in docker', 'in julia', 'in python', 'julia', 'julia and', 'knowledge', 'knowledge of', 'language', 'languages', 'mongodb', 'mongodb database', 'of', 'of mongodb', 'of work', 'python', 'python scripting', 'scripting', 'scripting language', 'work', 'work experience', 'years', 'years of']


In [5]:
train_vectors_X.toarray()

array([[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0,
        0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0],
       [0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0,
        1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 1, 0, 0, 2, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1]],
      dtype=int64)

### Build a small SVM classifier

In [6]:
from sklearn import svm

In [7]:
svm_model = svm.SVC(kernel = 'linear')

In [8]:
svm_model.fit(X = train_vectors_X,
              y = train_sample_y)

SVC(kernel='linear')

In [9]:
sample_test_sentences = ['I am good in C++ language',
                         'I have knowledge in MySQL database']

In [10]:
test_samples = vectorizer.transform(sample_test_sentences)
svm_model.predict(test_samples)

array(['LANGUAGE', 'BACK_END'], dtype='<U8')

* The bag of words is not good in handling the words, which are outside the training dataset words
* The exact same word should be in the training data, even if there is a small change in word, bag of words will not work well in prediction

## Word Vectors

In [11]:
import spacy

nlp = spacy.load('en_core_web_lg')

In [12]:
doc = [nlp(sentences) for sentences in train_sample_X]
doc

[I am good in python scripting language,
 I have expertize in julia and R languages,
 I have knowledge of mongodb database,
 I have 5.5 years of work experience in docker and docker hub]

* Below is the average word embeddings for each of the words for the first sentence 

In [13]:
[sentences.vector for sentences in doc[0]]

[array([ 1.8733e-01,  4.0595e-01, -5.1174e-01, -5.5482e-01,  3.9716e-02,
         1.2887e-01,  4.5137e-01, -5.9149e-01,  1.5591e-01,  1.5137e+00,
        -8.7020e-01,  5.0672e-02,  1.5211e-01, -1.9183e-01,  1.1181e-01,
         1.2131e-01, -2.7212e-01,  1.6203e+00, -2.4884e-01,  1.4060e-01,
         3.3099e-01, -1.8061e-02,  1.5244e-01, -2.6943e-01, -2.7833e-01,
        -5.2123e-02, -4.8149e-01, -5.1839e-01,  8.6262e-02,  3.0818e-02,
        -2.1253e-01, -1.1378e-01, -2.2384e-01,  1.8262e-01, -3.4541e-01,
         8.2611e-02,  1.0024e-01, -7.9550e-02, -8.1721e-01,  6.5621e-03,
         8.0134e-02, -3.9976e-01, -6.3131e-02,  3.2260e-01, -3.1625e-02,
         4.3056e-01, -2.7270e-01, -7.6020e-02,  1.0293e-01, -8.8653e-02,
        -2.9087e-01, -4.7214e-02,  4.6036e-02, -1.7788e-02,  6.4990e-02,
         8.8451e-02, -3.1574e-01, -5.8522e-01,  2.2295e-01, -5.2785e-02,
        -5.5981e-01, -3.9580e-01, -7.9849e-02, -1.0933e-02, -4.1722e-02,
        -5.5576e-01,  8.8707e-02,  1.3710e-01, -2.9

In [14]:
train_word_vector_X = [sentences.vector for sentences in doc]

In [15]:
svm_model_word_vector = svm.SVC(kernel = 'linear')

In [16]:
svm_model_word_vector.fit(X = train_word_vector_X,
                          y = train_sample_y)

SVC(kernel='linear')

In [17]:
sample_text_for_testing_word_vector = 'I have work experience in C++ langauge'
sample_text_for_testing_word_vector = nlp(sample_text_for_testing_word_vector)
svm_model_word_vector.predict([sample_text_for_testing_word_vector.vector])

array(['LANGUAGE'], dtype='<U8')

In [18]:
# sample_text_for_testing_word_vector = [sample_text_for_testing_word_vector]
test_samples = vectorizer.transform(['I have work experience in C++ langauge'])
svm_model.predict(test_samples)

array(['BACK_END'], dtype='<U8')

Advantage of Word2Vec
* The Bag of words model predicts the same sentence as back_end, whereas the word vectors is predicting the language label correctly

Disadvantage of Word2Vec
* With more number of words inside a sentence, and more layers, the average word embeddings will loose the meaning of some important words
* If a word work is used a verb or a noun, it still treats the word work as a similar, hence this many not capture the essence

### Regex

In [19]:
import re

In [20]:
pattern = '^a.*n$'
text = ['arjun', 'ayana', 'knowledge']
# re.findall(pattern = pattern, string = text)
re.compile(pattern = pattern)

re.compile(r'^a.*n$', re.UNICODE)

In [21]:
[re.findall(pattern, word) for word in text]

[['arjun'], [], []]

### Techniques for spell correction, sentiment

In [22]:
from textblob import TextBlob

In [23]:
sample_phrase = 'I am goed swimmer'
tb_phrase = TextBlob(text = sample_phrase)

In [24]:
corrected_sentence = tb_phrase.correct()
corrected_sentence

TextBlob("I am good swimmer")

* POS tagging from Textblob

In [25]:
corrected_sentence.tags

[('I', 'PRP'), ('am', 'VBP'), ('good', 'JJ'), ('swimmer', 'NN')]

* POS tagging from spacy

In [26]:
doc_pos = nlp(text = str(corrected_sentence))
[(words.text, words.pos_) for words in doc_pos]

[('I', 'PRON'), ('am', 'AUX'), ('good', 'ADJ'), ('swimmer', 'NOUN')]

In [27]:
tb_phrase.sentiment

Sentiment(polarity=0.0, subjectivity=0.0)

### Recurrent Neural Networks
* Drawbacks
    * Longer dependencies do not always perform well
    * Sequential nature of RNNs make it tough to parellelize and effectively use modern GPUs

### Transformers Architecture

In [30]:
import spacy
import torch

In [29]:
spacy.load('en_trf_bertbaseuncased_lg')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




<spacy_transformers.language.TransformersLanguage at 0x29b6d196278>

In [32]:
class Category:
    language = 'LANGUAGE'
    back_end = 'BACK_END'

train_sample_X = ['I am good in python scripting language',
                'I have expertize in julia and R languages',
                  'I know C++, Java',
                  'I know to work with my SQL',
                    'I have knowledge of mongodb database',
                    'I have 5.5 years of work experience in docker and docker hub']

train_sample_y = [Category.language, Category.language, Category.language, 
                  Category.back_end, Category.back_end, Category.back_end]

#### Building SVM model using BERT

In [34]:
from sklearn import svm

doc_bert = [nlp(sentences) for sentences in train_sample_X]

train_bert_X = [sentences.vector for sentences in doc_bert]

svm_model_word_vector = svm.SVC(kernel = 'linear')

svm_model_word_vector.fit(X = train_bert_X,
                          y = train_sample_y)

SVC(kernel='linear')

#### Testing the model

In [40]:
sample_text_for_testing_word_vector = 'Experience with javascript language'
sample_text_for_testing_word_vector = nlp(sample_text_for_testing_word_vector)
svm_model_word_vector.predict([sample_text_for_testing_word_vector.vector])

array(['LANGUAGE'], dtype='<U8')