# Fine-Tune embeddings based on SPAM dataset



*   It comprises of popular and state-of-the-art word embeddings, such as GloVe, BERT, ELMo, Character Embeddings, etc. There are very easy to use thanks to the Flair API
*   Flair’s interface allows us to combine different word embeddings and use them to embed documents. This in turn leads to a significant uptick in results

*   ‘Flair Embedding’ is the signature embedding provided within the Flair library. It is powered by contextual string embeddings. 
*   It is also possible to fine-tune embeddings to your specific domain before using them for classification/tagging tasks!
*   Flair supports a number of languages – and is always looking to add new ones








## Flair supports 3 main models as of the time of this notebook: TextClassifier, TextRegressor and SequenceTagger - ONLY RUNS ON COLLAB BECAUSE OF MODEL SIZES

In [1]:
!pip install tiny-tokenizer flair



In [2]:
import pandas as pd
import os
import io
from pathlib import Path
from sklearn.model_selection import train_test_split
from flair.data import Corpus
from flair.datasets import ClassificationCorpus, TREC_6              
from typing import List

from flair.data import Corpus
from flair.datasets import TREC_6, WIKINER_ENGLISH


from flair.data_fetcher import NLPTaskDataFetcher
from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentLSTMEmbeddings, CharacterEmbeddings 
from flair.embeddings import TokenEmbeddings, StackedEmbeddings, DocumentRNNEmbeddings, BertEmbeddings, OpenAIGPTEmbeddings
from flair.models import TextClassifier, SequenceTagger
from flair.visual.training_curves import Plotter
from flair.trainers import ModelTrainer

from flair.data import Dictionary
from flair.models import LanguageModel
from flair.trainers.language_model_trainer import LanguageModelTrainer, TextCorpus

from google.colab import drive

# Text Classification on Spam 

### Flair expects FastText format of data!

In [3]:
drive.mount('/content/gdrive')
datapath = 'gdrive/My Drive/spam.csv'                                           # spam.csv has to be in your google drive!
df = pd.read_csv(datapath, encoding='latin-1')

# from google.colab import files                                                # alternative to opening GDrive files - upload them manually
# uploaded = files.upload()
#df = pd.read_csv(io.StringIO(uploaded['spam.csv'].decode('latin-1')))

df = df[['v1', 'v2']].rename(columns={"v1":"label", "v2":"text"})
df['label'] = '__label__' + df['label'].astype(str)                             # That is how Flair expects the labels as of here:
                                                                                # https://towardsdatascience.com/text-classification-with-state-of-the-art-nlp-library-flair-b541d7add21f
                                                                                # a.k.a FastText format (Facebook)
df.head()

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


Unnamed: 0,label,text
0,__label__ham,"Go until jurong point, crazy.. Available only ..."
1,__label__ham,Ok lar... Joking wif u oni...
2,__label__spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,__label__ham,U dun say so early hor... U c already then say...
4,__label__ham,"Nah I don't think he goes to usf, he lives aro..."


In [0]:
## Split Train/Test/Validation

from sklearn.model_selection import train_test_split

pathfiles = 'gdrive/My Drive/'
data_folder = 'gdrive/My Drive/'


df.iloc[0:int(len(df)*0.8)].to_csv('train.csv', sep='\t', index = False, header = False)
df.iloc[int(len(df)*0.8):int(len(df)*0.9)].to_csv('test.csv', sep='\t', index = False, header = False)
df.iloc[int(len(df)*0.9):].to_csv('dev.csv', sep='\t', index = False, header = False);

# train, test = train_test_split(df, test_size=0.2, random_state=1)
# train, validation = train_test_split(train, test_size=0.2, random_state=1)

# train.to_csv(pathfiles+'train.csv')
# test.to_csv(pathfiles+'test.csv')
# validation.to_csv(pathfiles+'dev.csv')

In [5]:
#corpus: Corpus = NLPTaskDataFetcher.load_classification_corpus(Path(data_folder))           # old classification corpus load - depricated

corpus = NLPTaskDataFetcher.load_classification_corpus(Path('./'), 
                                                       test_file='test.csv', 
                                                       dev_file='dev.csv', 
                                                       train_file='train.csv')               # WORKS for classification

#corpus : Corpus = ClassificationCorpus(data_folder, test_file='test.csv', dev_file='dev.csv', train_file='train.csv')
  
# corpus: Corpus = TextCorpus(data_folder,                                                   # maybe used for fine tuning
#                     dictionary=Dictionary,
#                     #is_forward_lm=True,
#                     character_level=True)

label_dict = corpus.make_label_dictionary()

print(corpus)                                                                                
print(label_dict)                                                                            # Prints the different labels
print(corpus.get_label_distribution())                                                       # label distribution in the dataset

2020-01-18 19:09:51,313 Reading data from .
2020-01-18 19:09:51,316 Train: train.csv
2020-01-18 19:09:51,319 Dev: dev.csv
2020-01-18 19:09:51,320 Test: test.csv


  after removing the cwd from sys.path.
  train_file, tokenizer=tokenizer, max_tokens_per_doc=max_tokens_per_doc
  test_file, tokenizer=tokenizer, max_tokens_per_doc=max_tokens_per_doc
  dev_file, tokenizer=tokenizer, max_tokens_per_doc=max_tokens_per_doc


2020-01-18 19:09:53,305 Computing label dictionary. Progress:


100%|██████████| 4457/4457 [00:00<00:00, 339835.53it/s]

2020-01-18 19:09:53,345 [b'ham', b'spam']
Corpus: 4457 train + 558 dev + 557 test sentences
<flair.data.Dictionary object at 0x7f8bcb000e80>
defaultdict(<function Corpus.get_label_distribution.<locals>.<lambda> at 0x7f8bd04cc8c8>, {'ham': 3855, 'spam': 602})





In [6]:
#3. Choose your embedding types!
word_embeddings = [WordEmbeddings('glove')
                   ,FlairEmbeddings('news-forward-fast') 
                   ,FlairEmbeddings('news-backward-fast')
                  #,OpenAIGPTEmbeddings()
                   #,BertEmbeddings()
                  ]

#4. Init document embedding
document_embeddings = DocumentRNNEmbeddings(word_embeddings, hidden_size=512, reproject_words=True, reproject_words_dimension=256)

#5. Create the text classifier
#classifier = TextClassifier(document_embeddings, label_dictionary=corpus.make_label_dictionary(), multi_label=False)
classifier = TextClassifier(document_embeddings, label_dictionary=corpus.make_label_dictionary(), multi_label=False)    # Try with multi_label = True


#6. Initialize the text classifier trainer
#trainer = TextClassifier(classifier, corpus, label_dict)

classifier = TextClassifier(document_embeddings, label_dict, multi_label=False)

trainer = ModelTrainer(classifier, corpus)

#7. Start the training
trainer.train('./' 
              #, learning_rate=0.1
              #, mini_batch_size=32
              #, anneal_factor=0.5
              #, patience=5
              , max_epochs=10)

# #8. Predict something
# sentence = classifier.predict(Sentence('hello'))
# print(sentence.labels)

#9. plot training curves (optional)                                             # need to locate files on disk to work!
# plotter = Plotter()
# plotter.plot_training_curves('./resources/sentiment_classifier-11classes-en/results/loss.tsv')
# plotter.plot_weights('./resources/sentiment_classifier-11classes-en/results/weights.txt')

2020-01-18 19:10:00,252 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/glove.gensim.vectors.npy not found in cache, downloading to /tmp/tmpon4ywkqf


100%|██████████| 160000128/160000128 [00:09<00:00, 17194814.96B/s]

2020-01-18 19:10:10,025 copying /tmp/tmpon4ywkqf to cache at /root/.flair/embeddings/glove.gensim.vectors.npy





2020-01-18 19:10:10,220 removing temp file /tmp/tmpon4ywkqf
2020-01-18 19:10:11,326 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/glove.gensim not found in cache, downloading to /tmp/tmp17qwaxh4


100%|██████████| 21494764/21494764 [00:01<00:00, 11978502.29B/s]

2020-01-18 19:10:13,592 copying /tmp/tmp17qwaxh4 to cache at /root/.flair/embeddings/glove.gensim
2020-01-18 19:10:13,612 removing temp file /tmp/tmp17qwaxh4



  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


2020-01-18 19:10:15,887 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/lm-news-english-forward-1024-v0.2rc.pt not found in cache, downloading to /tmp/tmpw4qna2ud


100%|██████████| 19689779/19689779 [00:01<00:00, 10467083.04B/s]

2020-01-18 19:10:18,312 copying /tmp/tmpw4qna2ud to cache at /root/.flair/embeddings/lm-news-english-forward-1024-v0.2rc.pt
2020-01-18 19:10:18,331 removing temp file /tmp/tmpw4qna2ud





2020-01-18 19:10:29,184 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/lm-news-english-backward-1024-v0.2rc.pt not found in cache, downloading to /tmp/tmp3vlhf4nm


100%|██████████| 19689779/19689779 [00:01<00:00, 11111307.98B/s]

2020-01-18 19:10:31,530 copying /tmp/tmp3vlhf4nm to cache at /root/.flair/embeddings/lm-news-english-backward-1024-v0.2rc.pt
2020-01-18 19:10:31,551 removing temp file /tmp/tmp3vlhf4nm





2020-01-18 19:10:32,118 Computing label dictionary. Progress:


100%|██████████| 4457/4457 [00:00<00:00, 250630.30it/s]

2020-01-18 19:10:32,139 [b'ham', b'spam']
2020-01-18 19:10:32,146 ----------------------------------------------------------------------------------------------------
2020-01-18 19:10:32,147 Model: "TextClassifier(
  (document_embeddings): DocumentRNNEmbeddings(
    (embeddings): StackedEmbeddings(
      (list_embedding_0): WordEmbeddings('glove')
      (list_embedding_1): FlairEmbeddings(
        (lm): LanguageModel(
          (drop): Dropout(p=0.25, inplace=False)
          (encoder): Embedding(275, 100)
          (rnn): LSTM(100, 1024)
          (decoder): Linear(in_features=1024, out_features=275, bias=True)
        )
      )
      (list_embedding_2): FlairEmbeddings(
        (lm): LanguageModel(
          (drop): Dropout(p=0.25, inplace=False)
          (encoder): Embedding(275, 100)
          (rnn): LSTM(100, 1024)
          (decoder): Linear(in_features=1024, out_features=275, bias=True)
        )
      )
    )
    (word_reprojection_map): Linear(in_features=2148, out_features=2




2020-01-18 19:10:32,488 epoch 1 - iter 0/140 - loss 0.77019781 - samples/sec: 1413.99
2020-01-18 19:10:35,268 epoch 1 - iter 14/140 - loss 0.33843742 - samples/sec: 162.02
2020-01-18 19:10:37,501 epoch 1 - iter 28/140 - loss 0.28109490 - samples/sec: 202.10
2020-01-18 19:10:39,409 epoch 1 - iter 42/140 - loss 0.24306910 - samples/sec: 236.94
2020-01-18 19:10:41,723 epoch 1 - iter 56/140 - loss 0.23098593 - samples/sec: 194.91
2020-01-18 19:10:43,881 epoch 1 - iter 70/140 - loss 0.21363495 - samples/sec: 209.30
2020-01-18 19:10:46,030 epoch 1 - iter 84/140 - loss 0.19400018 - samples/sec: 210.57
2020-01-18 19:10:48,126 epoch 1 - iter 98/140 - loss 0.17833633 - samples/sec: 215.86
2020-01-18 19:10:50,498 epoch 1 - iter 112/140 - loss 0.16816837 - samples/sec: 190.16
2020-01-18 19:10:52,554 epoch 1 - iter 126/140 - loss 0.16221067 - samples/sec: 219.74
2020-01-18 19:10:54,566 ----------------------------------------------------------------------------------------------------
2020-01-18 19

  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "


2020-01-18 19:11:00,040 ----------------------------------------------------------------------------------------------------
2020-01-18 19:11:00,116 epoch 2 - iter 0/140 - loss 0.01854147 - samples/sec: 6255.28
2020-01-18 19:11:00,945 epoch 2 - iter 14/140 - loss 0.09392258 - samples/sec: 550.44
2020-01-18 19:11:01,760 epoch 2 - iter 28/140 - loss 0.10018810 - samples/sec: 560.94
2020-01-18 19:11:02,572 epoch 2 - iter 42/140 - loss 0.08292114 - samples/sec: 563.27
2020-01-18 19:11:03,463 epoch 2 - iter 56/140 - loss 0.07423747 - samples/sec: 518.36
2020-01-18 19:11:04,298 epoch 2 - iter 70/140 - loss 0.07306584 - samples/sec: 549.96
2020-01-18 19:11:05,107 epoch 2 - iter 84/140 - loss 0.07050647 - samples/sec: 565.29
2020-01-18 19:11:05,910 epoch 2 - iter 98/140 - loss 0.07722138 - samples/sec: 570.50
2020-01-18 19:11:06,688 epoch 2 - iter 112/140 - loss 0.07726955 - samples/sec: 588.79
2020-01-18 19:11:07,535 epoch 2 - iter 126/140 - loss 0.08307600 - samples/sec: 539.59
2020-01-18 19

{'test_score': 0.9803,
 'dev_score_history': [0.9695,
  0.9731,
  0.9821,
  0.9857,
  0.9444,
  0.9875,
  0.9767,
  0.9767,
  0.9875,
  0.9875],
 'train_loss_history': [0.1562643663824669,
  0.08526597598434559,
  0.07627157708629966,
  0.060274979184448185,
  0.05797219698184303,
  0.057871729982019005,
  0.05044327272501375,
  0.042743298105363335,
  0.04400325102532016,
  0.03971474029655967],
 'dev_loss_history': [tensor(0.0839, device='cuda:0'),
  tensor(0.0892, device='cuda:0'),
  tensor(0.0567, device='cuda:0'),
  tensor(0.0523, device='cuda:0'),
  tensor(0.1743, device='cuda:0'),
  tensor(0.0465, device='cuda:0'),
  tensor(0.0680, device='cuda:0'),
  tensor(0.0599, device='cuda:0'),
  tensor(0.0388, device='cuda:0'),
  tensor(0.0451, device='cuda:0')]}

# Text Classification on TREC_6 

In [0]:
# 1. get the corpus

corpus = TREC_6()

# 2. create the label dictionary
label_dict = corpus.make_label_dictionary()

print(len(corpus.train))

# 3. make a list of word embeddings
word_embeddings = [#WordEmbeddings('glove')
                    #,FlairEmbeddings('news-forward')                           # unknown CUDA error when trying to load these - It's because of GPU VRAM
                    #,FlairEmbeddings('news-forward-fast')                         
                    #,FlairEmbeddings('news-backward')          
                    #,FlairEmbeddings('multi-forward')
                    #,FlairEmbeddings('multi-backward')
                    BertEmbeddings('bert-base-multilingual-uncased')            # BERT + GPT embeddings = really slow training!
                    ,OpenAIGPTEmbeddings()
                    #,BytePairEmbeddings(language='en')                         # new in Flair, supposedly very good for small models
                    #,CharacterEmbeddings()                     
                  ]

# 4. initialize document embedding by passing list of word embeddings
# Can choose between many RNN types (GRU by default, to change use rnn_type parameter)
document_embeddings: DocumentRNNEmbeddings = DocumentRNNEmbeddings(word_embeddings
                                                                     , hidden_size=512
                                                                     , reproject_words=True
                                                                     , reproject_words_dimension=256
                                                                     #, rnn_type='LSTM'
                                                                     )

# 5. create the text classifier
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict)

# 6. initialize the text classifier trainer
trainer = ModelTrainer(classifier, corpus)

# 7. start the training
trainer.train('resources/taggers/ag_news',                                      
              learning_rate=0.1,
              mini_batch_size=32,
              anneal_factor=0.5,
              patience=5,
              max_epochs=15,
             checkpoint=True)                                                                # checkpoint set to True enables you to stop training and resume later!

# checkpoint = tagger.load_checkpoint(Path('resources/taggers/example-ner/checkpoint.pt'))   # Resume from Checkpoint if previously paused training
# trainer = ModelTrainer.load_from_checkpoint(checkpoint, corpus)
# trainer.train('resources/taggers/example-ner',
#               EvaluationMetric.MICRO_F1_SCORE,
#               learning_rate=0.1,
#               mini_batch_size=32,
#               anneal_factor=0.5,
#               max_epochs=15,
#               checkpoint=True)

# 8. plot training curves (optional)                                                         # need to locate files on disk to work!
# plotter = Plotter()
# plotter.plot_training_curves('loss.tsv')
# plotter.plot_weights('weights.txt'

OSError: ignored

# NER on custom dataset (Kaggle)

In [7]:
# IMPORTANT: Data is expected to be of 'Obama N B-PER' (word pos ner) format, WITHOUT full stop lines, and a gap line separating different sentences

# Load data csv

drive.mount('/content/gdrive')
os.path.exists('gdrive/My Drive/ner_dataset.csv')

# from google.colab import files                                                # alternatively upload it manually using this snippet
# uploaded = files.upload()
# df = pd.read_csv(io.StringIO(uploaded['train_ner.csv']
#                              .decode('latin-1').fillna(method="ffill")))

drive.mount('/content/gdrive')
datapath = 'gdrive/My Drive/ner_dataset.csv'                                    # train_ner.csv has to be in your google drive!
df = pd.read_csv(datapath, encoding='latin-1').fillna(method="ffill")

# split and save train/test/dev to csvs because flair only reads shit from disk when loading corpus

df = df[['Word', 'POS', 'Tag']]                                                 # discard sentence no

df.rename(columns={'Word': 'text', 'POS': 'pos', 'Tag':'ner'}, inplace=True)    # rename it as flair expects

data_folder = 'gdrive/My Drive/nerReformed'

### replace rows where text=='.' with a blank row (GH proposal)

df.text = df.text.replace('.', '')
df.pos = df.pos.replace('.', '')
df.ner = df.pos.replace('', '')

#df.loc[(df['pos']=='.')]                 # should return nothing

if not os.path.exists(data_folder):
  os.mkdir(data_folder)
  
# ensure that a single sentence is not spitted in different train/dev/test sets



# df.iloc[0:int(len(df)*0.8)].to_csv(data_folder+'/train.csv', sep='\t', index = False, header = False)
# df.iloc[int(len(df)*0.8):int(len(df)*0.9)].to_csv(data_folder+'/test.csv', sep='\t', index = False, header = False)
# df.iloc[int(len(df)*0.9):].to_csv(data_folder+'/dev.csv', sep='\t', index = False, header = False);

df.head(25)          # check if data is as we want it to be

### For Flair ColumnCorpus we need this format:

# - George N B-PER
# - Washington N I-PER
# - went V O
# - to P O
# - Washington N B-LOC
# - ...

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


FileNotFoundError: ignored

In [0]:
trainstart = df.index[0]
print("Trainstart: "+str(trainstart))                                           # should return 0 
tmpdf = df[int(len(df)*0.8):]                                                   
trainstop = tmpdf.loc[0:tmpdf['text'].gt('').idxmin(),:].index[-1]              # from 0.8*df to next blank => end of sentence!
print("Trainstop: "+str(trainstop))                                             # should return 838861 (it's where the last train sentence ends)
train = df.iloc[0:trainstop]
#print(train.tail())


teststart = trainstop+1
#tmpdf = df.iloc[teststart:int(len(df)),]
tmpdf = df[int(len(df)*0.9):]
print("Teststart: "+str(int(teststart)))
print(tmpdf['text'].gt('').idxmin())
teststop = tmpdf.loc[int(len(df)*0.9):tmpdf['text'].gt('').idxmin(),:].index[-1]
print("Teststop: "+str(teststop))
test = df.iloc[teststart:teststop]
print(test.tail())

devstart = teststop+1
print("Devstart: "+str(devstart))
dev = df.iloc[devstart:]
print("Devstop: "+str(dev.index[-1]))
print(dev.tail())


train.to_csv(data_folder+'/train.csv', sep='\t', index = False, header = False)
test.to_csv(data_folder+'/test.csv', sep='\t', index = False, header = False)
dev.to_csv(data_folder+'/dev.csv', sep='\t', index = False, header = False);

Trainstart: 0
Trainstop: 838861
Teststart: 838862
943742
Teststop: 943742
          text  pos  ner
943737  linked  VBN  VBN
943738      to   TO   TO
943739     the   DT   DT
943740  Juarez  NNP  NNP
943741  cartel   NN   NN
Devstart: 943743
Devstop: 1048574
              text  pos  ner
1048570       they  PRP  PRP
1048571  responded  VBD  VBD
1048572         to   TO   TO
1048573        the   DT   DT
1048574     attack   NN   NN


In [0]:
from flair.datasets import ColumnCorpus

# define columns

columns = {0: 'text', 1: 'pos', 2: 'ner'}
tag_type = 'ner'

# 1. Load Corpus

# load corpus by pointing to folder. Train, dev and test gets identified automatically. 
corpus: Corpus = ColumnCorpus(data_folder, columns,
                              train_file='train.csv',
                              test_file='test.csv',
                              dev_file='dev.csv')                               # out of memory if no downsampling done! 

2019-07-22 17:37:07,954 Reading data from gdrive/My Drive/nerReformed
2019-07-22 17:37:07,956 Train: gdrive/My Drive/nerReformed/train.csv
2019-07-22 17:37:07,963 Dev: gdrive/My Drive/nerReformed/dev.csv
2019-07-22 17:37:07,964 Test: gdrive/My Drive/nerReformed/test.csv


In [0]:
print("Train has a size of "+str(len(corpus.train))+" sentences")             # Flair thinks each split is only a sentence!
#print(corpus.train[0][1])                                                     # list of sentences, so [][] is token/word

Train has a size of 38188 sentences


In [0]:
tag_dictionary = corpus.make_tag_dictionary('ner')

print(corpus)                                                                                
print(tag_dictionary.get_items())                                                                          # Prints the different labels
print(corpus.get_label_distribution())                                                       # label distribution in the dataset

Corpus: 38188 train + 4815 dev + 4757 test sentences
['<unk>', 'O', 'NNS', 'IN', 'VBP', 'VBN', 'NNP', 'TO', 'VB', 'DT', 'NN', 'CC', 'JJ', 'VBD', 'WP', '``', 'CD', 'PRP', 'VBZ', 'POS', 'VBG', 'RB', ',', 'WRB', 'PRP$', 'MD', 'WDT', 'JJR', ':', 'JJS', 'WP$', 'RP', 'PDT', 'NNPS', 'EX', 'RBS', 'LRB', 'RRB', '$', 'RBR', ';', '', 'UH', 'FW', '<START>', '<STOP>']
defaultdict(<function Corpus.get_label_distribution.<locals>.<lambda> at 0x7fe41898f9d8>, {})


In [0]:
# ALSO: in_memory=False  :    using PyTorch dataloaders, does not keep whole dataset in memory - can support much larger datasets

# 2. Create Tag dictionary

tag_dictionary = corpus.make_tag_dictionary('ner')

# 3. Embeddings
embedding_types: List[TokenEmbeddings] = [
    WordEmbeddings('glove'),                                                    # Recommended by Flair paper: glove + bi-directional Flair
    FlairEmbeddings('news-forward'),
    FlairEmbeddings('news-backward')
    # OpenAIGPTEmbeddings(),
    # BertEmbeddings()                                                          # BERT embeddings = VERY slow training (30 mins for first iteration!)
    # CharacterEmbeddings(),
    # FlairEmbeddings('multi-forward',
    # FlairEmbeddings('multi-backward')
    # BertEmbeddings('bert-base-multilingual-uncased'
    # BytePairEmbeddings(language='en')                                         # new in Flair, supposedly very good for small models
]

embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

# 4. Initialize sequence tagger
tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type,
                                        use_crf=True)
#tagger = SequenceTagger.load('/path/to/model.pt')                              # to load model from .pt file
  
# 5. initialize trainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus)
  
# 6. start training
trainer.train('resources/taggers/example-ner'
              ,learning_rate=0.1
              ,mini_batch_size=64
              ,max_epochs=5
             #,checkpoint=True                                                  # to stop and resume training in case of large models!
             )

# 7. plot training curves (optional)                                            # need to locate files on disk to work!
# plotter = Plotter()
# plotter.plot_training_curves('loss.tsv')
# plotter.plot_weights('weights.txt'

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


2019-07-22 18:04:39,425 ----------------------------------------------------------------------------------------------------
2019-07-22 18:04:39,426 Evaluation method: MICRO_F1_SCORE
2019-07-22 18:04:39,971 ----------------------------------------------------------------------------------------------------
2019-07-22 18:04:41,664 epoch 1 - iter 0/597 - loss 90.36824036
2019-07-22 18:05:37,665 epoch 1 - iter 59/597 - loss 45.03847485
2019-07-22 18:06:26,436 epoch 1 - iter 118/597 - loss 31.50181136
2019-07-22 18:07:19,312 epoch 1 - iter 177/597 - loss 24.92577699
2019-07-22 18:08:07,531 epoch 1 - iter 236/597 - loss 21.07376778
2019-07-22 18:09:01,452 epoch 1 - iter 295/597 - loss 18.43487070
2019-07-22 18:09:50,972 epoch 1 - iter 354/597 - loss 16.46917385
2019-07-22 18:10:45,472 epoch 1 - iter 413/597 - loss 14.93163178
2019-07-22 18:11:34,911 epoch 1 - iter 472/597 - loss 13.70888507
2019-07-22 18:12:28,972 epoch 1 - iter 531/597 - loss 12.72079313
2019-07-22 18:13:18,820 epoch 1 - i

# NER on WIKINER_ENGLISH

In [0]:
# 1. get the corpus
corpus: Corpus = WIKINER_ENGLISH().downsample(0.1)
print(corpus)

# 2. what tag do we want to predict?
tag_type = 'ner'

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
print(tag_dictionary.idx2item)

# 4. initialize embeddings
embedding_types: List[TokenEmbeddings] = [

    WordEmbeddings('glove'),
    # comment in this line to use character embeddings
    # CharacterEmbeddings(),

    # comment in these lines to use flair embeddings
    FlairEmbeddings('news-forward'),                                            # Stacked Glove + Flair embeddings supposedly gives the best performance
    FlairEmbeddings('news-backward'),
]

embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

# 5. initialize sequence tagger
from flair.models import SequenceTagger

tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type,
                                        use_crf=True)

# 6. initialize trainer
from flair.trainers import ModelTrainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

# 7. start training
trainer.train('resources/taggers/example-ner',
              learning_rate=0.1,
              mini_batch_size=64,
              max_epochs=5)

# 8. plot training curves (optional)
from flair.visual.training_curves import Plotter
plotter = Plotter()
plotter.plot_training_curves('resources/taggers/example-ner/loss.tsv')
plotter.plot_weights('resources/taggers/example-ner/weights.txt')

2019-07-12 09:51:24,770 Reading data from /root/.flair/datasets/wikiner_english
2019-07-12 09:51:24,772 Train: /root/.flair/datasets/wikiner_english/aij-wikiner-en-wp3.train
2019-07-12 09:51:24,773 Dev: None
2019-07-12 09:51:24,774 Test: None
Corpus: 11514 train + 1279 dev + 1422 test sentences
[b'<unk>', b'O', b'B-ORG', b'E-ORG', b'S-ORG', b'S-LOC', b'S-PER', b'B-LOC', b'E-LOC', b'B-MISC', b'E-MISC', b'S-MISC', b'B-PER', b'E-PER', b'I-MISC', b'I-PER', b'I-ORG', b'I-LOC', b'<START>', b'<STOP>']


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


2019-07-12 09:52:07,751 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings-v0.4.1/big-news-forward--h2048-l1-d0.05-lr30-0.25-20/news-forward-0.4.1.pt not found in cache, downloading to /tmp/tmpbnvua6c_


100%|██████████| 73034624/73034624 [00:04<00:00, 15843433.68B/s]

2019-07-12 09:52:12,820 copying /tmp/tmpbnvua6c_ to cache at /root/.flair/embeddings/news-forward-0.4.1.pt





2019-07-12 09:52:12,911 removing temp file /tmp/tmpbnvua6c_
2019-07-12 09:52:13,927 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings-v0.4.1/big-news-backward--h2048-l1-d0.05-lr30-0.25-20/news-backward-0.4.1.pt not found in cache, downloading to /tmp/tmp3cy9r8nw


100%|██████████| 73034575/73034575 [00:04<00:00, 16848970.40B/s]

2019-07-12 09:52:18,823 copying /tmp/tmp3cy9r8nw to cache at /root/.flair/embeddings/news-backward-0.4.1.pt





2019-07-12 09:52:18,904 removing temp file /tmp/tmp3cy9r8nw
2019-07-12 09:52:19,484 ----------------------------------------------------------------------------------------------------
2019-07-12 09:52:19,486 Evaluation method: MICRO_F1_SCORE
2019-07-12 09:52:20,134 ----------------------------------------------------------------------------------------------------
2019-07-12 09:52:22,857 epoch 1 - iter 0/180 - loss 83.74026489
2019-07-12 09:52:41,735 epoch 1 - iter 18/180 - loss 23.84901428
2019-07-12 09:53:01,097 epoch 1 - iter 36/180 - loss 18.43657202
2019-07-12 09:53:22,616 epoch 1 - iter 54/180 - loss 15.74591389
2019-07-12 09:53:41,659 epoch 1 - iter 72/180 - loss 13.75866139
2019-07-12 09:54:02,691 epoch 1 - iter 90/180 - loss 12.37682143
2019-07-12 09:54:22,523 epoch 1 - iter 108/180 - loss 11.36019380
2019-07-12 09:54:42,604 epoch 1 - iter 126/180 - loss 10.50427462
2019-07-12 09:55:01,799 epoch 1 - iter 144/180 - loss 9.82579675
2019-07-12 09:55:21,525 epoch 1 - iter 162/180