### BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural language processing.

# Which vocabulary size should I pick?
The vocabulary size is the sum of the number of BPE merge operations and the number of characters in the training data. The number of BPE merge operations determines if the resulting symbol sequences will tend to be short (few merge operations) or longer (more merge operations). Using very few merge operations will produce mostly character unigrams, bigrams, and trigrams, while peforming a large number of merge operations will create symbols representing the most frequent words. 

-->The advantage of having few operations is that this results in a smaller vocabulary of symbols. You need less data to learn representations (embeddings) of these symbols.          
-->The disadvantage is that you need data to learn how to compose those symbols into meaningful units (e.g. words).

-->The advantage of having many operations is that many frequent words get their own symbols, so you don't have to learn how what the word railway means by composing it from the symbols r, ail, and way.       
-->The disadvantage is that you need more data to train good embeddings for these longer symbols, which is available for high-resource languages like English, but less so for low-resource languages like Khmer.

In [1]:
from bpemb import BPEmb

In [2]:
# load English BPEmb model with default vocabulary size (10k) and 50-dimensional embeddings
bpemb_en = BPEmb(lang="en", dim=50)

downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.model


100%|███████████████████████████████████████████████████████████████████████| 400869/400869 [00:02<00:00, 144013.01B/s]


downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.d50.w2v.bin.tar.gz


100%|█████████████████████████████████████████████████████████████████████| 1924908/1924908 [00:11<00:00, 172315.66B/s]
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)


In [3]:
bpemb_en.encode("nepotism")        #because we have only 10000 vocabulary size(few merge operation)

['▁ne', 'p', 'ot', 'ism']

In [5]:
bpemb_zh = BPEmb(lang="zh", vs=1000)        #we use very small merge operations
# apply Chinese BPE subword segmentation model
bpemb_zh.encode("这是一个中文句子")  # "This is a Chinese sentence."
# ['▁这是一个', '中文', '句子']  # ["This is a", "Chinese", "sentence"]

BPEmb fallback: zh from vocab size 1000 to 10000
downloading https://nlp.h-its.org/bpemb/zh/zh.wiki.bpe.vs10000.model



  0%|                                                                                        | 0/360798 [00:00<?, ?B/s][A
  3%|█▉                                                                        | 9216/360798 [00:00<00:07, 47420.48B/s][A
  9%|██████▊                                                                  | 33792/360798 [00:00<00:05, 59354.76B/s][A
 22%|███████████████▉                                                         | 78848/360798 [00:00<00:03, 74563.51B/s][A
 35%|█████████████████████████▎                                              | 126976/360798 [00:00<00:02, 95906.99B/s][A
 49%|██████████████████████████████████▍                                    | 175104/360798 [00:01<00:01, 113843.92B/s][A
 64%|█████████████████████████████████████████████▎                         | 230400/360798 [00:01<00:00, 137471.71B/s][A
 80%|████████████████████████████████████████████████████████▌              | 287744/360798 [00:01<00:00, 138490.05B/s][A
100%|██████████

downloading https://nlp.h-its.org/bpemb/zh/zh.wiki.bpe.vs10000.d100.w2v.bin.tar.gz



  0%|                                                                                       | 0/3752816 [00:00<?, ?B/s][A
  0%|▏                                                                        | 9216/3752816 [00:00<01:17, 48380.55B/s][A
  1%|▋                                                                       | 33792/3752816 [00:00<01:01, 60628.47B/s][A
  2%|█▍                                                                      | 74752/3752816 [00:00<00:49, 75044.08B/s][A
  3%|██▎                                                                    | 123904/3752816 [00:00<00:37, 96864.86B/s][A
  5%|███▏                                                                  | 173056/3752816 [00:01<00:31, 114216.04B/s][A
  6%|████▏                                                                 | 222208/3752816 [00:01<00:25, 135865.29B/s][A
  7%|█████                                                                 | 271360/3752816 [00:01<00:22, 156403.73B/s][A
  9%|██████▏   

['▁', '这', '是一个', '中文', '句', '子']

In [6]:
# Embeddings are wrapped in a gensim KeyedVectors object
print(type(bpemb_en.emb))

# You can use BPEmb objects like gensim KeyedVectors
bpemb_en.most_similar("ford")

<class 'gensim.models.keyedvectors.Word2VecKeyedVectors'>


[('ton', 0.9460763931274414),
 ('ston', 0.9455490112304688),
 ('bury', 0.9116721153259277),
 ('ington', 0.9063345193862915),
 ('well', 0.8849488496780396),
 ('▁chester', 0.8847955465316772),
 ('field', 0.8794451951980591),
 ('wick', 0.8759092092514038),
 ('ingham', 0.8693511486053467),
 ('worth', 0.8684747219085693)]

In [9]:
type(bpemb_en.vectors)
bpemb_en.vectors.shape               #10000 symbols and each has dimension 50

(10000, 50)

In [13]:
# To use subword embeddings in your neural network, either encode your input into subword IDs:

ids = bpemb_en.encode_ids("nepotism")
print(ids)
bpemb_en.vectors[ids].shape



[140, 9929, 85, 679]


(4, 50)

In [14]:

# Or use the embed method:
# apply Chinese subword segmentation and perform embedding lookup
bpemb_en.embed("nepotism").shape


(4, 50)

# Part-of-speech Tagging

In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech,based on both its definition and its context. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

Once performed by hand, POS tagging is now done in the context of computational linguistics, using algorithms which associate discrete terms, as well as hidden parts of speech, by a set of descriptive tags. POS-tagging algorithms fall into two distinctive groups: rule-based and stochastic. E. Brill's tagger, one of the first and most widely used English POS-taggers, employs rule-based algorithms.

We will also see how tagging is the second step in the typical NLP pipeline, following tokenization.

The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The collection of tags used for a particular task is known as a tagset. Our emphasis in this chapter is on exploiting tags, and tagging text automatically.



### 1   Using a Tagger
A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of speech tag to each word (don't forget to import nltk):

 	
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html   -->match shortform to part of speech


In [17]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\UMANG PATEL\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [18]:
text = nltk.word_tokenize("And now for something which is completely different")
nltk.pos_tag(text)

[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('which', 'WDT'),
 ('is', 'VBZ'),
 ('completely', 'RB'),
 ('different', 'JJ')]

Here we see that and is CC,--> a coordinating conjunction;    
now and completely are RB, --> adverbs;    
for is IN --> a preposition;     
something is NN -->a noun;      
different is JJ --> an adjective.

In [20]:
text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")
nltk.pos_tag(text)

[('They', 'PRP'),
 ('refuse', 'VBP'),
 ('to', 'TO'),
 ('permit', 'VB'),
 ('us', 'PRP'),
 ('to', 'TO'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN')]


Notice that refuse and permit both appear as a present tense verb (VBP) and a noun (NN). E.g. refUSE is a verb meaning "deny," while REFuse is a noun meaning "trash" (i.e. they are not homophones). Thus, we need to know which word is being used in order to pronounce the text correctly. (For this reason, text-to-speech systems usually perform POS-tagging.)## 


In [22]:
nltk.download('brown')
 
text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())


[nltk_data] Downloading package brown to C:\Users\UMANG
[nltk_data]     PATEL\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\brown.zip.


man time day year car moment world house family child country boy
state job place way war girl work word
None
made said done put had seen found given left heard was been brought
set got that took in told felt
None
in on to of and for with from at by that into as up out down through
is all about


In [23]:
print(text.similar('woman'),"\n")  #noun

print(text.similar('bought'),"\n")  #verb
print(text.similar('over'),"\n")   #preposition
print(text.similar('the'))       #determinere

man time day year car moment world house family child country boy
state job place way war girl work word
None 

made said done put had seen found given left heard was been brought
set got that took in told felt
None 

in on to of and for with from at by that into as up out down through
is all about
None 

a his this their its her an that our any all one these my in your no
some other and
None


In [26]:
nltk.download('universal_tagset')
nltk.download('treebank')
print(nltk.corpus.brown.tagged_words(tagset='universal'))

nltk.corpus.treebank.tagged_words(tagset='universal')



[nltk_data] Downloading package universal_tagset to C:\Users\UMANG
[nltk_data]     PATEL\AppData\Roaming\nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!
[nltk_data] Downloading package treebank to C:\Users\UMANG
[nltk_data]     PATEL\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\treebank.zip.


[('The', 'DET'), ('Fulton', 'NOUN'), ...]


[('Pierre', 'NOUN'), ('Vinken', 'NOUN'), (',', '.'), ...]

In [28]:
# Tagged corpora for several other languages are distributed with NLTK, including Chinese, Hindi, Portuguese, Spanish, Dutch and Catalan. These usually contain non-ASCII text, and Python always displays this in hexadecimal when printing a larger structure such as a list.
nltk.download('indian')

# nltk.corpus.sinica_treebank.tagged_words()

nltk.corpus.indian.tagged_words()

# nltk.corpus.mac_morpho.tagged_words()


[nltk_data] Downloading package indian to C:\Users\UMANG
[nltk_data]     PATEL\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\indian.zip.


[('মহিষের', 'NN'), ('সন্তান', 'NN'), (':', 'SYM'), ...]

In [33]:
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories="news",tagset="universal")
tag_freq = nltk.FreqDist(tag for (word,tag) in brown_news_tagged)
tag_freq.most_common()

[('NOUN', 30654),
 ('VERB', 14399),
 ('ADP', 12355),
 ('.', 11928),
 ('DET', 11389),
 ('ADJ', 6706),
 ('ADV', 3349),
 ('CONJ', 2717),
 ('PRON', 2535),
 ('PRT', 2264),
 ('NUM', 2166),
 ('X', 92)]

In [109]:
list(tag_freq.keys())
sum(list(tag_freq.values())[:5])/sum(list(tag_freq.values()))   #75% words are tagged usiing most common 5 tags

0.7508701792071921

## How to use POS (part of speech) features in scikit learn classfiers (SVM) etc


In [145]:
#we do this because [(have,VB),....] are then not fit in CountVectorize,TF-IDF

text = nltk.word_tokenize('Someone should have this ring to a volcano')

text_tagged = nltk.pos_tag(text)

new_text = [word_tag[0] + "_"  +word_tag[1] for word_tag in text_tagged]
doc_text = " ".join(new_text)
doc_text = [doc_text + "_umang"]
doc_text

['Someone_NN should_MD have_VB this_DT ring_NN to_TO a_DT volcano_NN_umang']

In [146]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
dtm = cv.fit_transform(doc_text)
dtm.toarray()
cv.get_feature_names()

['a_dt',
 'have_vb',
 'ring_nn',
 'should_md',
 'someone_nn',
 'this_dt',
 'to_to',
 'volcano_nn_umang']