# Machine learning entity recognition


Objetive:Recognize a set of “Machine learning” entities from the analysis of ML text

Examples of entities could be:

    • Supervised ML algorithm (decision tree, random forest, SVM, etc.)
    • Unsupervised ML algorithm (clustering, knn, event detection)
    • ML software (keras, sklearn, etc.)


For this approach we will
    
    •Train a model using a data set of IOB formatted sentences found in Kaggle
    •Build a custom data set using machine learning texts
    •Train our model using this custom data
    •Train a spacy model using our custom data set
    •Compare our model whit the spacy model
    

## Processing the data set


We procced to import the data.



In [1]:
import pandas as pd
import numpy as np
data = pd.read_csv("ner_dataset.csv", encoding="latin1")

In [2]:
data.head(20)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O
5,,through,IN,O
6,,London,NNP,B-geo
7,,to,TO,O
8,,protest,VB,O
9,,the,DT,O


Our data set is conformed by 4 columns 

    • Sentence #: indicates the sentence number each new sentence is indicated by a new sentence number
    • Word: the word of the sentence
    • POS: the part of speech tag for each word
    • Tag: the entity classification according to IOB format  when I prefix indicates the word is inside of a chunk , B indicates the word is in the beggining of a chunk and O that the word is outside of a chunk (non classiication/ non entity)
        geo = Geographical Entity
        org = Organization
        per = Person
        gpe = Geopolitical Entity
        tim = Time indicator
        art = Artifact
        eve = Event
        nat = Natural Phenomenon
        
After, we will filter the model's prediction in order to obtain only the ML entities


As we can see there is plenty missing values in the dataset, it will be better for the classifier if we fill the missing values (sentece number) 

In [3]:
df = data.fillna(method='ffill')[:150000]


In [4]:
df.head(20)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,Sentence: 1,of,IN,O
2,Sentence: 1,demonstrators,NNS,O
3,Sentence: 1,have,VBP,O
4,Sentence: 1,marched,VBN,O
5,Sentence: 1,through,IN,O
6,Sentence: 1,London,NNP,B-geo
7,Sentence: 1,to,TO,O
8,Sentence: 1,protest,VB,O
9,Sentence: 1,the,DT,O


In [5]:
print(f" We have {df['Sentence #'].nunique()} sentences which contains {df.Word.nunique()} unique words and {df.Tag.nunique()} unique tags")
print("And also we have the following distribution of tags")

 We have 6834 sentences which contains 13379 unique words and 17 unique tags
And also we have the following distribution of tags


In [6]:
df.groupby('Tag').size().reset_index(name='counts')


Unnamed: 0,Tag,counts
0,B-art,91
1,B-eve,80
2,B-geo,5130
3,B-gpe,2535
4,B-nat,46
5,B-org,2799
6,B-per,2519
7,B-tim,2789
8,I-art,50
9,I-eve,61


## Preprocessing
Now we will use some preprocess for the data and using conditional random fields which is used to parse and label sequencial data  

In [7]:
import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics
from collections import Counter

In [8]:
# Iterator for getting sentences and groping it into [[Word,Pos tag,Entity tag],[Word,Pos tag,Entity tag],[Word,Pos tag,Entity tag]]
# where each value of the sentences list is a sentece formated in this manner
class SentenceGetter(object):

    def __init__(self, data):
        """
        :param data:a pandas data frame whit the structure : | Sentence # | Word | POS Tag |
            ref: Annotated Corpus for Named Entity Recognition feature Engineered Corpus annotated with IOB and POS tags
            https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus#ner_dataset.csv
        :attribute n_sent: number of sentence (iterator)
        :attribute data: the pandas dataframe fixed in it)
        :attribute empty: boolean indicates if there is no data fixed in it
        :lambda agg_func: grouping function to structure the sentences for processing
        :attribute grouped: a pandas series which contains indexed by sentence number
        :attribute empty: boolean indicates if there is no data fixed in it
        :attribute sentences: a list of sentences which contains a a list of tagged words in the following format
              [[(Word,Pos tag,Entity tag),....,(Word,Pos tag,Entity)],....,[(Word,Pos tag,Entity tag),..,(Word,Pos tag,Entity)]]
        """
        self.n_sent = 1
        self.data = data
        self.empty = False
        agg_func = lambda s: [(w, p, t) for w, p, t in zip(s['Word'].values.tolist(),
                                                           s['POS'].values.tolist(),
                                                           s['Tag'].values.tolist())]
        self.grouped = self.data.groupby('Sentence #').apply(agg_func)
        self.sentences = [s for s in self.grouped]

    def get_next(self):
        """
        :return: a sentence iterating one by one
        """
        try:
            s = self.grouped['Sentence: {}'.format(self.n_sent)]
            self.n_sent += 1
            return s
        except:
            return None

In [9]:
getter = SentenceGetter(df)
sentences = getter.sentences

In [10]:
df.groupby('Sentence #').head(1)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
24,Sentence: 2,Families,NNS,O
54,Sentence: 3,They,PRP,O
68,Sentence: 4,Police,NNS,O
83,Sentence: 5,The,DT,O
108,Sentence: 6,The,DT,O
132,Sentence: 7,The,DT,O
153,Sentence: 8,The,DT,O
181,Sentence: 9,Iran,NNP,B-gpe
196,Sentence: 10,Iranian,JJ,B-gpe


  ### A little explanation:
  we are grouping our data by sentences in such a way that we will construct a list of sentences, which contains a list of its' words where the words have its' POS
  tag and its' entity tag in this manner
  
  [[(Word,Pos tag,Entity tag),....,(Word,Pos tag,Entity)],....,[(Word,Pos tag,Entity tag),..,(Word,Pos tag,Entity)]]
  

In [11]:
sentences[:2]

[[('Thousands', 'NNS', 'O'),
  ('of', 'IN', 'O'),
  ('demonstrators', 'NNS', 'O'),
  ('have', 'VBP', 'O'),
  ('marched', 'VBN', 'O'),
  ('through', 'IN', 'O'),
  ('London', 'NNP', 'B-geo'),
  ('to', 'TO', 'O'),
  ('protest', 'VB', 'O'),
  ('the', 'DT', 'O'),
  ('war', 'NN', 'O'),
  ('in', 'IN', 'O'),
  ('Iraq', 'NNP', 'B-geo'),
  ('and', 'CC', 'O'),
  ('demand', 'VB', 'O'),
  ('the', 'DT', 'O'),
  ('withdrawal', 'NN', 'O'),
  ('of', 'IN', 'O'),
  ('British', 'JJ', 'B-gpe'),
  ('troops', 'NNS', 'O'),
  ('from', 'IN', 'O'),
  ('that', 'DT', 'O'),
  ('country', 'NN', 'O'),
  ('.', '.', 'O')],
 [('Iranian', 'JJ', 'B-gpe'),
  ('officials', 'NNS', 'O'),
  ('say', 'VBP', 'O'),
  ('they', 'PRP', 'O'),
  ('expect', 'VBP', 'O'),
  ('to', 'TO', 'O'),
  ('get', 'VB', 'O'),
  ('access', 'NN', 'O'),
  ('to', 'TO', 'O'),
  ('sealed', 'JJ', 'O'),
  ('sensitive', 'JJ', 'O'),
  ('parts', 'NNS', 'O'),
  ('of', 'IN', 'O'),
  ('the', 'DT', 'O'),
  ('plant', 'NN', 'O'),
  ('Wednesday', 'NNP', 'B-tim'),
  ('

## Feature Extraction
Then we will extract  more  features and transform our data in  format defined by the sklearn crf suite wrapper
https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html


In [12]:
def word2features(sent, i):
    """
    :param sent: a sentence in the format  [[(Word,Pos tag,Entity tag),....,(Word,Pos tag,Entity)],....,[(Word,Pos tag,Entity tag),..,(Word,Pos tag,Entity)]]
    :param i: position of the word desired to extract its features
    :return: a dictionary of features in  the sklearn CRF wrapper defined format
    """
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        #weight of the word
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.isdigit()': word.isdigit(),
        #postag and type
        'postag': postag,
        #simple postag
        'postag[:2]': postag[:2],
    }
    if i > 0:
        #previous word in the sentence info (if is not the first word in sentence)
        word1 = sent[i - 1][0]
        postag1 = sent[i - 1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True
    if i < len(sent) - 1:
        #next word info (if is not the last word in sentence)
        word1 = sent[i + 1][0]
        postag1 = sent[i + 1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        #if is the End Of the Sentence
        features['EOS'] = True
    return features


def sent2features(sent):
    """
    :param sent: a sentence to obtain its features
    :return: an array of features per each word  in the sentence
    """
    return [word2features(sent, i) for i in range(len(sent))]


def sent2labels(sent):
    """ 
    :param sent: a sentence to obtain the labels present in 
    :return: an aray of labels present in the sentences (entities)
    """
    return [label for token, postag, label in sent]


def sent2tokens(sent):
    """
    :param sent:  a sentence to obtain the tokens  in 
    :return: an array of tokens of the sentences (words)
    """
    return [token for token, postag, label in sent]

# Training

Then we will build our train and test sets



In [13]:
from sklearn.model_selection import train_test_split
X = [sent2features(s) for s in sentences]
y = [sent2labels(s) for s in sentences]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

lets visualize how our data is arranged

In [14]:
print(X_train[0][:5],"\n\n\n",y_train[0])

[{'bias': 1.0, 'word.lower()': 'philip', 'word[-3:]': 'lip', 'word[-2:]': 'ip', 'word.isupper()': False, 'word.isdigit()': False, 'postag': 'NNP', 'postag[:2]': 'NN', 'BOS': True, '+1:word.lower()': 'alston', '+1:word.isupper()': False, '+1:postag': 'NNP', '+1:postag[:2]': 'NN'}, {'bias': 1.0, 'word.lower()': 'alston', 'word[-3:]': 'ton', 'word[-2:]': 'on', 'word.isupper()': False, 'word.isdigit()': False, 'postag': 'NNP', 'postag[:2]': 'NN', '-1:word.lower()': 'philip', '-1:word.istitle()': True, '-1:word.isupper()': False, '-1:postag': 'NNP', '-1:postag[:2]': 'NN', '+1:word.lower()': 'said', '+1:word.isupper()': False, '+1:postag': 'VBD', '+1:postag[:2]': 'VB'}, {'bias': 1.0, 'word.lower()': 'said', 'word[-3:]': 'aid', 'word[-2:]': 'id', 'word.isupper()': False, 'word.isdigit()': False, 'postag': 'VBD', 'postag[:2]': 'VB', '-1:word.lower()': 'alston', '-1:word.istitle()': True, '-1:word.isupper()': False, '-1:postag': 'NNP', '-1:postag[:2]': 'NN', '+1:word.lower()': 'he', '+1:word.is


we will train our model , in this case we will use conditional random fields which allow us to take care of the sequence in the data 

we will use the Limited memory BFGS optimization algorithm because of the limited memory for training the model,this is an alternative to gradient descent algorithm and it's also to minimize the error 

https://en.wikipedia.org/wiki/Limited-memory_BFGS




In [15]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)
crf.fit(X_train, y_train)

CRF(algorithm='lbfgs', all_possible_states=None,
  all_possible_transitions=True, averaging=None, c=None, c1=0.1, c2=0.1,
  calibration_candidates=None, calibration_eta=None,
  calibration_max_trials=None, calibration_rate=None,
  calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
  gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
  max_linesearch=None, min_freq=None, model_filename=None,
  num_memories=None, pa_type=None, period=None, trainer_cls=None,
  variance=None, verbose=False)

## Testing and evaluating the model


In [16]:
y_pred = crf.predict(X_test)

In [17]:
Tags=np.unique(df["Tag"]).tolist()
Tags

['B-art',
 'B-eve',
 'B-geo',
 'B-gpe',
 'B-nat',
 'B-org',
 'B-per',
 'B-tim',
 'I-art',
 'I-eve',
 'I-geo',
 'I-gpe',
 'I-nat',
 'I-org',
 'I-per',
 'I-tim',
 'O']

Now because O (non entity) is our most common tag it will make our results looks better than they are , so when we'll measure the quality of the model we will avoid O value so

In [18]:
Tags.pop()
Tags

['B-art',
 'B-eve',
 'B-geo',
 'B-gpe',
 'B-nat',
 'B-org',
 'B-per',
 'B-tim',
 'I-art',
 'I-eve',
 'I-geo',
 'I-gpe',
 'I-nat',
 'I-org',
 'I-per',
 'I-tim']

In [19]:
print(metrics.flat_classification_report(y_test, y_pred, labels = Tags))

              precision    recall  f1-score   support

       B-art       0.33      0.12      0.18        16
       B-eve       0.78      0.41      0.54        17
       B-geo       0.79      0.90      0.84      1047
       B-gpe       0.92      0.78      0.85       509
       B-nat       0.67      0.44      0.53         9
       B-org       0.74      0.66      0.70       580
       B-per       0.82      0.82      0.82       501
       B-tim       0.91      0.83      0.87       562
       I-art       0.00      0.00      0.00         7
       I-eve       0.89      0.62      0.73        13
       I-geo       0.78      0.83      0.80       197
       I-gpe       1.00      0.36      0.53        11
       I-nat       0.75      1.00      0.86         3
       I-org       0.76      0.71      0.74       463
       I-per       0.79      0.89      0.83       543
       I-tim       0.84      0.72      0.77       170

   micro avg       0.81      0.80      0.81      4648
   macro avg       0.74   

In [20]:
countert=Counter(y_test[1])
counterp=Counter(y_pred[1])
print("Testing a use case \n",countert,"\n",counterp)

Testing a use case 
 Counter({'O': 16, 'B-gpe': 1, 'B-tim': 1}) 
 Counter({'O': 16, 'B-gpe': 1, 'B-tim': 1})


### Focus on the main objective: "identify ML entities "

Now just for making a look we will check how our current training is working in a real enviroment ( ML text) 

So we have the following text.



In [21]:
text= "Named-entity recognition (NER) refers to a data extraction task that is responsible for finding, storing and sorting textual content into default categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values and percentages."

Now we need to apply the same ***preprocessing*** we do in our training data in order to use our model

Then we need to separate the text in sentences and also POS tag the sentences, we will use spacy library to do that and also compare results in NER 

In [22]:
!python -m spacy download en_core_web_md
import spacy
nlp = spacy.load('en_core_web_md')
doc = nlp(text)
print(f"Our doc is:\n{doc}\n\nIts conformed by the sentences\n\n{list(doc.sents)}")

[33mYou are using pip version 10.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m

[93m    Linking successful[0m
    /anaconda3/envs/mlprojectsy/lib/python3.6/site-packages/en_core_web_md
    -->
    /anaconda3/envs/mlprojectsy/lib/python3.6/site-packages/spacy/data/en_core_web_md

    You can now load the model via spacy.load('en_core_web_md')

Our doc is:
Named-entity recognition (NER) refers to a data extraction task that is responsible for finding, storing and sorting textual content into default categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values and percentages.

Its conformed by the sentences

[Named-entity recognition (NER) refers to a data extraction task that is responsible for finding, storing and sorting textual content into default categories such as the names of persons, organizations, locations, expressions of times, quantities, moneta

In [23]:
#Picking up a sentence from the text.
sentence=doc.sents.__next__()
ent=sentence[0]
print(sentence,"\n")

Named-entity recognition (NER) refers to a data extraction task that is responsible for finding, storing and sorting textual content into default categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values and percentages. 



In [24]:
#preprocessing data
translation=str.maketrans('','','-')
#we will remove the - character from the spacy pos tags because in our train set we havent tags whit that character
processedSentence=[(w.text,w.tag_.translate(translation)) for w in sentence]

In [25]:
processedSentence

[('Named', 'VBN'),
 ('-', 'HYPH'),
 ('entity', 'NN'),
 ('recognition', 'NN'),
 ('(', 'LRB'),
 ('NER', 'NNP'),
 (')', 'RRB'),
 ('refers', 'VBZ'),
 ('to', 'IN'),
 ('a', 'DT'),
 ('data', 'NN'),
 ('extraction', 'NN'),
 ('task', 'NN'),
 ('that', 'WDT'),
 ('is', 'VBZ'),
 ('responsible', 'JJ'),
 ('for', 'IN'),
 ('finding', 'VBG'),
 (',', ','),
 ('storing', 'VBG'),
 ('and', 'CC'),
 ('sorting', 'VBG'),
 ('textual', 'JJ'),
 ('content', 'NN'),
 ('into', 'IN'),
 ('default', 'NN'),
 ('categories', 'NNS'),
 ('such', 'JJ'),
 ('as', 'IN'),
 ('the', 'DT'),
 ('names', 'NNS'),
 ('of', 'IN'),
 ('persons', 'NNS'),
 (',', ','),
 ('organizations', 'NNS'),
 (',', ','),
 ('locations', 'NNS'),
 (',', ','),
 ('expressions', 'NNS'),
 ('of', 'IN'),
 ('times', 'NNS'),
 (',', ','),
 ('quantities', 'NNS'),
 (',', ','),
 ('monetary', 'JJ'),
 ('values', 'NNS'),
 ('and', 'CC'),
 ('percentages', 'NNS'),
 ('.', '.')]

#### Now let's extract features and predict the classes

In [26]:
sentenceFeatures=sent2features(processedSentence)

In [27]:
prediction=crf.predict([sentenceFeatures])

from tabulate import tabulate
from spacy import displacy
displacy.render(sentence, style='ent', jupyter=True)



In [28]:
for i in range(len(prediction[0])):
    print(f"\nWord:{processedSentence[i]} | Prediction:{prediction[0][i]}")


Word:('Named', 'VBN') | Prediction:O

Word:('-', 'HYPH') | Prediction:O

Word:('entity', 'NN') | Prediction:O

Word:('recognition', 'NN') | Prediction:O

Word:('(', 'LRB') | Prediction:O

Word:('NER', 'NNP') | Prediction:B-org

Word:(')', 'RRB') | Prediction:O

Word:('refers', 'VBZ') | Prediction:O

Word:('to', 'IN') | Prediction:O

Word:('a', 'DT') | Prediction:O

Word:('data', 'NN') | Prediction:O

Word:('extraction', 'NN') | Prediction:O

Word:('task', 'NN') | Prediction:O

Word:('that', 'WDT') | Prediction:O

Word:('is', 'VBZ') | Prediction:O

Word:('responsible', 'JJ') | Prediction:O

Word:('for', 'IN') | Prediction:O

Word:('finding', 'VBG') | Prediction:O

Word:(',', ',') | Prediction:O

Word:('storing', 'VBG') | Prediction:O

Word:('and', 'CC') | Prediction:O

Word:('sorting', 'VBG') | Prediction:O

Word:('textual', 'JJ') | Prediction:O

Word:('content', 'NN') | Prediction:O

Word:('into', 'IN') | Prediction:O

Word:('default', 'NN') | Prediction:O

Word:('categories', 'NNS') 



As  we can see our predictions are quite accurate, so we also propose the next steps to improve our prediction
- Do some data mining and create a small dataset including ML text sentences
- Then Including in this custom dataset a custom "POS TAG" named ML. This could help us to know where ML learning entities use to appear in a ML text 




### Building our ML custom data set

We want to build a custom dataset and also a helper to extract data from machine learning pdfs



In [29]:
!pip install pdfminer.six
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter#process_pdf
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO

[33mYou are using pip version 10.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [30]:
def pdf_to_text(pdfname):
    # PDFMiner boilerplate
    rsrcmgr = PDFResourceManager()
    sio = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, sio, codec=codec, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)

    # Extract text
    fp = open(pdfname, 'rb')
    for page in PDFPage.get_pages(fp):
        interpreter.process_page(page)
    fp.close()

    # Get text from StringIO
    text = sio.getvalue()

    # Cleanup
    device.close()
    sio.close()

    return text

In [31]:
data=pdf_to_text("mltop10.pdf")


We will build a new data set using a ml post from medium.com and try to get sentences that contains important and common machine learning entities.


In [32]:
import spacy
nlp = spacy.load('en_core_web_md')
doc = nlp(data)

In [33]:
sentences=(list(doc.sents))

Les make a look in the data

In [34]:
print(sentences[10:14])

[Of course, the algorithms you try must be appropriate for your problem,
which is where picking the right machine learning task comes in., As an
analogy, if you need to clean your house, you might use a vacuum, a
broom, or a mop, but you wouldn’t bust out a shovel and start digging.

, The Big Principle

, However, there is a common principle that underlies all supervised
machine learning algorithms for predictive modeling.

]


We will use the spacy entity recognition module to identify some ML entities and then we will defined for getting an improved dataset

In [35]:
print(doc.ents)

(Newbies, James Le Jan 20, 

, the “, No Free Lunch, 

, 

, 

, 

, 

, 

, Y, 

, 

, Y =, Y, 

, 10, Linear Regression, one, 

, 

, B1, B0, B1, 

, more than 200 years, two, 

, 

, 0, 0 and 1, less than 0.5, 

, 3, Linear Discriminant Analysis, two, more than two, Linear Discriminant Analysis, 

, 

1, 2, 

, Linear Discriminant Analysis, Gaussian, 

, 4, 

, 5, 

, two, 1, 2, Bayes Theorem, Gaussian, 6, KNN, KNN, 

, 

, 
inches, 

, 

, 

, 7, K-Nearest Neighbors, The Learning Vector Quantization, 

, LVQ, K-Nearest Neighbors, between 0 and 1, 

, KNN, LVQ, 

, 8, Vector Machines, one, 

, SVM, two-dimensions, SVM, 

, two, 

, one, 

, 9, Random Forest, Random Forest, one, 

, 

, Random Forest, 

, 

, 

, 10, second, first, first, AdaBoost, 

, first, 

, 

, 1, 2, 3, 4, 

, Machine Learning, Machine Learning, 

, — —

, GitHub, https://jameskle.com/., LinkedIn, 

 )


As we can see there are many ML entities.
we have selected the sentences that contains that ML entities 
and that are enough larger to contribute  significant feaures for our model 

In [36]:
definedMLentites=["KNN","Random","Forest","SVM","K-Nearest","Neighbors","Regression","Trees","Linear","Regression","LVQ","Naive","Bayes","Linear","Discriminant","Analysis","LDA","Random","Forest","Neural","Network","Logistic","Regression"]
lowerDML=[mlen.lower() for mlen in definedMLentites]
customData=[sentence for sentence in doc.sents if len(sentence)>9 and  any(mlEnt in sentence.text for mlEnt in definedMLentites)]

In [37]:
customData[:4]

[Linear regression is perhaps one of the most well-known and well-,
 Linear regression has been around for more than 200 years and has been
 extensively studied.,
 Logistic regression is another technique borrowed by machine learning
 from the field of statistics.,
 Logistic regression is like linear regression in that the goal is to find the
 values for the coefficients that weight each input variable.]


As we can see we have a good selection of sentences that contain the most common ML entities so we will extract these sentences and tag them using a custom IOB TAG named "I/O-ML" that will reference that we are talking about machine learning

Now we will POS tag the dataset and do some entity recognition, after we will change the tags to indicate where ML entities are located 

In [38]:
customData=[[(word.text,word.tag_.translate(translation),"B-ML" if word.text in definedMLentites else "I-ML" if word.text in lowerDML else "O" if word.ent_iob_ == "O"  else f"{word.ent_iob_}-{word.ent_type_.lower()}") for word in sentence] for sentence in customData]
customData[:2]

[[('Linear', 'NNP', 'B-ML'),
  ('regression', 'NN', 'I-ML'),
  ('is', 'VBZ', 'O'),
  ('perhaps', 'RB', 'O'),
  ('one', 'CD', 'B-cardinal'),
  ('of', 'IN', 'O'),
  ('the', 'DT', 'O'),
  ('most', 'RBS', 'O'),
  ('well', 'RB', 'O'),
  ('-', 'HYPH', 'O'),
  ('known', 'VBN', 'O'),
  ('and', 'CC', 'O'),
  ('well-', 'JJ', 'O'),
  ('\n', '', 'O')],
 [('Linear', 'NNP', 'B-ML'),
  ('regression', 'NN', 'I-ML'),
  ('has', 'VBZ', 'O'),
  ('been', 'VBN', 'O'),
  ('around', 'RB', 'O'),
  ('for', 'IN', 'O'),
  ('more', 'JJR', 'B-date'),
  ('than', 'IN', 'I-date'),
  ('200', 'CD', 'I-date'),
  ('years', 'NNS', 'I-date'),
  ('and', 'CC', 'O'),
  ('has', 'VBZ', 'O'),
  ('been', 'VBN', 'O'),
  ('\n', '', 'O'),
  ('extensively', 'RB', 'O'),
  ('studied', 'VBD', 'O'),
  ('.', '.', 'O')]]

Now let's add this custom data to our training data and see what happens

First we have to extract the features
### Building the training and testing set 


In [39]:
#qq=[]testing
#qq.extend(list(map(sent2features,customData)))
X.extend([sent2features(s) for s in customData])
y.extend([sent2labels(s) for s in customData])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [40]:
crf.fit(X_train, y_train)

CRF(algorithm='lbfgs', all_possible_states=None,
  all_possible_transitions=True, averaging=None, c=None, c1=0.1, c2=0.1,
  calibration_candidates=None, calibration_eta=None,
  calibration_max_trials=None, calibration_rate=None,
  calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
  gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
  max_linesearch=None, min_freq=None, model_filename=None,
  num_memories=None, pa_type=None, period=None, trainer_cls=None,
  variance=None, verbose=False)

Now lets test our model using some ml texts

In [41]:
test="To solve a classification task by a supervised machine learning model like SVM, the task usually involves with training and testing data, which consist of some data instances."

Again we have to preprocess the ml text

In [42]:
test = nlp(test)
#Picking up a sentence from the text.
sentence=test.sents.__next__()

In [43]:
#preprocessing data
translation=str.maketrans('','','-')
#we will remove the - character from the spacy pos tags because in our train set we havent tags whit that character
processedSentence=[(w.text,w.tag_.translate(translation)) for w in sentence]

In [44]:
processedSentence

[('To', 'TO'),
 ('solve', 'VB'),
 ('a', 'DT'),
 ('classification', 'NN'),
 ('task', 'NN'),
 ('by', 'IN'),
 ('a', 'DT'),
 ('supervised', 'JJ'),
 ('machine', 'NN'),
 ('learning', 'VBG'),
 ('model', 'NN'),
 ('like', 'IN'),
 ('SVM', 'NNP'),
 (',', ','),
 ('the', 'DT'),
 ('task', 'NN'),
 ('usually', 'RB'),
 ('involves', 'VBZ'),
 ('with', 'IN'),
 ('training', 'NN'),
 ('and', 'CC'),
 ('testing', 'NN'),
 ('data', 'NNS'),
 (',', ','),
 ('which', 'WDT'),
 ('consist', 'VBP'),
 ('of', 'IN'),
 ('some', 'DT'),
 ('data', 'NN'),
 ('instances', 'NNS'),
 ('.', '.')]

In [45]:
sentenceFeatures=sent2features(processedSentence)

Now we will make a prediction using spacy and then comparing the prediction using our model

In [46]:
from spacy import displacy 
displacy.render(sentence, style='ent', jupyter=True)

In [47]:
prediction=crf.predict([sentenceFeatures])

In [48]:
for i in range(len(prediction[0])):
    print(f"\nWord:{processedSentence[i]} | Prediction:{prediction[0][i]}")


Word:('To', 'TO') | Prediction:O

Word:('solve', 'VB') | Prediction:O

Word:('a', 'DT') | Prediction:O

Word:('classification', 'NN') | Prediction:O

Word:('task', 'NN') | Prediction:O

Word:('by', 'IN') | Prediction:O

Word:('a', 'DT') | Prediction:O

Word:('supervised', 'JJ') | Prediction:O

Word:('machine', 'NN') | Prediction:O

Word:('learning', 'VBG') | Prediction:O

Word:('model', 'NN') | Prediction:O

Word:('like', 'IN') | Prediction:O

Word:('SVM', 'NNP') | Prediction:B-org

Word:(',', ',') | Prediction:O

Word:('the', 'DT') | Prediction:O

Word:('task', 'NN') | Prediction:O

Word:('usually', 'RB') | Prediction:O

Word:('involves', 'VBZ') | Prediction:O

Word:('with', 'IN') | Prediction:O

Word:('training', 'NN') | Prediction:O

Word:('and', 'CC') | Prediction:O

Word:('testing', 'NN') | Prediction:O

Word:('data', 'NNS') | Prediction:O

Word:(',', ',') | Prediction:O

Word:('which', 'WDT') | Prediction:O

Word:('consist', 'VBP') | Prediction:O

Word:('of', 'IN') | Prediction:

the model correctly predics that SVM is a ML entity 

Now we will meeasure the accuracy 


In [49]:
y_pred=crf.predict(X_test)
Tags.extend(["I-ML","B-ML"])

In [50]:
print(metrics.flat_classification_report(y_test, y_pred, labels = Tags))

              precision    recall  f1-score   support

       B-art       0.40      0.14      0.21        14
       B-eve       0.78      0.44      0.56        16
       B-geo       0.81      0.87      0.84      1060
       B-gpe       0.92      0.80      0.86       484
       B-nat       0.83      0.45      0.59        11
       B-org       0.73      0.70      0.71       581
       B-per       0.81      0.84      0.83       499
       B-tim       0.91      0.85      0.88       565
       I-art       0.00      0.00      0.00         3
       I-eve       0.86      0.40      0.55        15
       I-geo       0.82      0.73      0.77       219
       I-gpe       1.00      0.25      0.40        12
       I-nat       0.75      1.00      0.86         3
       I-org       0.76      0.73      0.75       497
       I-per       0.81      0.90      0.86       575
       I-tim       0.79      0.70      0.75       162
        I-ML       0.00      0.00      0.00         2
        B-ML       0.00    

  'precision', 'predicted', average, warn_for)


As we can see: When we perfom the measure we got some 0 values for the ML Tags but this only happens because of the big difference of data that we have (150000) vs 24 from the custom dataset. As a result we can say that our model predicts ML entities correctly, but could be further improved, if we would have sufficient data.

Now lets use spacy for build a custom Model

## Training the spacy model

In [51]:
from __future__ import unicode_literals, print_function

import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding

In [52]:
# new entity label
LABEL = 'ML'

Now spacy uses it owns training data format so  in the following format

TRAIN_DATA = [("Phrase", {'entities': [(0, 6, 'ANIMAL')]}),("Phrase", {'entities': []})]

So lets us the data that we have obtain previously.

In [53]:
definedMLentites=["KNN","Random","Forest","SVM","K-Nearest","Neighbors","Regression","Trees","Linear","Regression","LVQ","Naive Bayes","Linear","Discriminant","Analysis","LDA","Random","Forest","Neural","Network","Logistic","Regression"]
lowerDML=[mlen.lower() for mlen in definedMLentites]
customData=[sentence for sentence in doc.sents if len(sentence)>9 and  any(mlEnt in sentence.text for mlEnt in definedMLentites)]

In [54]:
customData[:5]

[Linear regression is perhaps one of the most well-known and well-,
 Linear regression has been around for more than 200 years and has been
 extensively studied.,
 Logistic regression is another technique borrowed by machine learning
 from the field of statistics.,
 Logistic regression is like linear regression in that the goal is to find the
 values for the coefficients that weight each input variable.,
 Logistic Regression is a classification algorithm traditionally limited to
 only two-class classification problems.]

In [55]:
definedMLentites=["KNN","Random Forest","SVM","K-Nearest Neighbors","Decision Trees","Trees","Linear Regression","LVQ","Naive Bayes","Linear Discriminant Analysis","Learning Vector Quantization","LDA","Random Forest","Neural Network","Logistic Regression"]

#TRAIN_DATA = [(sentence, {'entities': [(0, 6, 'ANIMAL')]
import re

TRAIN_DATA=list()

regexEnt=[re.compile(ent, re.IGNORECASE) for ent in definedMLentites]
for sentence in customData:
    matches=[]
    for regex in regexEnt:
        for match in regex.finditer(sentence.text):
            matches.append((match.start(),match.end(),"ML"))
    TRAIN_DATA.append((sentence.text,{"entities":matches}))

print(TRAIN_DATA[:5])

[('Linear regression is perhaps one of the most well-known and well-\n', {'entities': [(0, 17, 'ML')]}), ('Linear regression has been around for more than 200 years and has been\nextensively studied.', {'entities': [(0, 17, 'ML')]}), ('Logistic regression is another technique borrowed by machine learning\nfrom the field of statistics.', {'entities': [(0, 19, 'ML')]}), ('Logistic regression is like linear regression in that the goal is to find the\nvalues for the coefficients that weight each input variable.', {'entities': [(28, 45, 'ML'), (0, 19, 'ML')]}), ('Logistic Regression is a classification algorithm traditionally limited to\nonly two-class classification problems.', {'entities': [(0, 19, 'ML')]})]


In [56]:
new_model_name="ML"
output_dir="SpacyModels"
def trainSpacyNER(model=None,new_model_name='ML', output_dir="SPacy", n_iter=10):
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank('en')  # create blank Language class
        print("Created blank 'en' model")
    # Add entity recognizer to model if it's not in the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner)
    # otherwise, get it, so we can add labels to it
    else:
        ner = nlp.get_pipe('ner')

    ner.add_label(LABEL)   # add new entity label to entity recognizer
    if model is None:
        optimizer = nlp.begin_training()
    else:
        # Note that 'begin_training' initializes the models, so it'll zero out
        # existing entity types.
        optimizer = nlp.entity.create_optimizer()

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)
            losses = {}
            # batch up the examples using spaCy's minibatch
            batches = minibatch(TRAIN_DATA, size=compounding(4., 32., 1.001))
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, drop=0.35,
                           losses=losses)
            print('Losses', losses)
    # save model to output directory
    print(output_dir)
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.meta['name'] = new_model_name  # rename model
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)
    return nlp

        

In [57]:
nlp=trainSpacyNER(None,new_model_name=new_model_name,output_dir="SpacyModels")

Created blank 'en' model
Losses {'ner': 14.03367133517014}
Losses {'ner': 4.3819994722815485}
Losses {'ner': 3.863196328151602}
Losses {'ner': 2.616340077769049}
Losses {'ner': 1.7884353150897003}
Losses {'ner': 1.0731156938768376}
Losses {'ner': 1.3629447333280211}
Losses {'ner': 0.5518898429205578}
Losses {'ner': 0.6829992751232408}
Losses {'ner': 0.4944052013255248}
SpacyModels
Saved model to SpacyModels


## Making the prediction and visualizing  the resuls

In [68]:
# test the trained model
test_text = """A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane."""
doc = nlp(test_text)
#Lets vizualize the entities in the test_text
colors = {'ML': 'linear-gradient(90deg, #aa9cfc, #fc9ce7)'}
options = {'ents': ['ML'], 'colors': colors}
displacy.render(doc, style='ent',jupyter=True,options=options)