# SentimentClassification using transformer models

Requires installing transformers, pytorch and simpletransformers

In [None]:
!conda install pytorch cpuonly -c pytorch
!pip install transformers
!pip install simpletransformers

Huggingface transfomers provides an option to create a **pipeline** to perform a NLP task with a pretrained model.

In [40]:
from transformers import pipeline

We load a transformer model 'distilbert-base-uncased-finetuned-sst-2-english' that is fine-tuned for binary classification from the Hugging face repository:

https://huggingface.co/models

We need to load the model for the sequence classifcation and the tokenizer to convert the sentences into tokens according to the vocabulary of the model.

Loading the model takes some time.

In [31]:
sentimentenglish = pipeline("sentiment-analysis", 
                            model="distilbert-base-uncased-finetuned-sst-2-english", 
                            tokenizer="distilbert-base-uncased-finetuned-sst-2-english")

In [42]:
sentence_pos_en = "Nice hotel and the service is great"

In [33]:
sentimentenglish(sentence_pos_en)

[{'label': 'POSITIVE', 'score': 0.9998814463615417}]

In [43]:
sentence_neg_en = "The rooms are dirty and the wifi does not work"

In [44]:
sentimentenglish(sentence_neg_en)

[{'label': 'NEGATIVE', 'score': 0.9997869729995728}]

We can use a fine-tuned Dutch model for Dutch sentiment analysis. Again loading this model takes some time.

In [36]:
sentimentdutch = pipeline("sentiment-analysis", 
                          model="wietsedv/bert-base-dutch-cased-finetuned-sentiment", 
                          tokenizer="wietsedv/bert-base-dutch-cased-finetuned-sentiment")

HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=1233.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=436379402.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=241441.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=112.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=40.0), HTML(value='')))




In [45]:
sentence_pos_nl="Mooi hotel en de service is geweldig"
sentence_neg_nl="De kamers zijn smerig en de wifi doet het niet"

In [46]:
sentimentdutch(sentence_pos_nl)

[{'label': 'pos', 'score': 0.9999955892562866}]

In [47]:
sentimentdutch(sentence_neg_nl)

[{'label': 'neg', 'score': 0.6675440073013306}]

We can use the same models to check how the sentences are represented by a transformer model. We use the simpletransformer package to encode sentences: https://simpletransformers.ai 

We load the English model again. You can ignore the warnings.

In [41]:
from simpletransformers.language_representation import RepresentationModel
        
#sentences = ["Example sentence 1", "Example sentence 2"]
model = RepresentationModel(
        model_type="bert",
        model_name="distilbert-base-uncased-finetuned-sst-2-english",
        use_cuda=False
    )

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing BertForTextRepresentation: ['distilbert.embeddings.word_embeddings.weight', 'distilbert.embeddings.position_embeddings.weight', 'distilbert.embeddings.LayerNorm.weight', 'distilbert.embeddings.LayerNorm.bias', 'distilbert.transformer.layer.0.attention.q_lin.weight', 'distilbert.transformer.layer.0.attention.q_lin.bias', 'distilbert.transformer.layer.0.attention.k_lin.weight', 'distilbert.transformer.layer.0.attention.k_lin.bias', 'distilbert.transformer.layer.0.attention.v_lin.weight', 'distilbert.transformer.layer.0.attention.v_lin.bias', 'distilbert.transformer.layer.0.attention.out_lin.weight', 'distilbert.transformer.layer.0.attention.out_lin.bias', 'distilbert.transformer.layer.0.sa_layer_norm.weight', 'distilbert.transformer.layer.0.sa_layer_norm.bias', 'distilbert.transformer.layer.0.ffn.lin1.weight', 'distilbert.transformer.layer.0.ffn.lin1.bias', 'distilbert.

NameError: name 'sentence_pos_en' is not defined

In [48]:
word_vectors = model.encode_sentences(sentence_pos_en, combine_strategy=None)
word_vectors.shape
#assert word_vectors.shape === (2, 5, 768) # token vector for every token in each sentence, bert based models add 2 tokens per sentence by default([CLS] & [SEP])

(35, 3, 768)

In [49]:
model.tokenizer(sentence_pos)

{'input_ids': [101, 100, 3309, 4372, 2139, 2326, 2003, 16216, 8545, 6392, 8004, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [50]:
for i, wv in enumerate(word_vectors):
    print(i, wv)

0 [[-0.17017488  0.71369517 -0.01806724 ...  0.5349348  -0.24569088
   0.13944192]
 [ 0.08725815 -0.14023611  0.5224909  ...  0.68421304  0.17557685
   0.9815214 ]
 [-0.5099201  -0.15960257  0.19829248 ...  0.11305754  0.2703912
   0.4930439 ]]
1 [[-0.2635178   0.69835997  0.1807023  ...  0.12218485 -0.05363609
  -0.04349846]
 [-0.2290503  -0.31323934 -0.09026391 ... -0.11378837  0.6890996
   0.9761355 ]
 [-0.6575065  -0.14601189  0.4783722  ... -0.13051917  0.5676145
   0.34388614]]
2 [[-0.04677839  1.058596   -0.0518712  ...  0.4523391  -0.06119518
  -0.13534532]
 [ 0.03040627 -0.02507277  0.5108627  ...  0.42033428  0.59341174
   0.5665082 ]
 [-0.28787726  0.1826487   0.23353215 ...  0.13552424  0.53988576
   0.10681154]]
3 [[-0.30855674  1.112034    0.229579   ...  0.3355292   0.19497839
  -0.04163709]
 [ 0.25500843  0.24027602  1.689946   ...  0.40422124  0.8796059
   0.60989505]
 [-0.6100894   0.24360615  0.4677887  ... -0.00281549  0.77212596
   0.15940882]]
4 [[-0.38781768  1.0

In [51]:
word_vectors = model.encode_sentences(sentence_pos_en, combine_strategy="mean")

In [52]:
word_vectors.shape

(35, 768)

In [53]:
for i, wv in enumerate(word_vectors):
    print(i, wv)

0 [-1.97612286e-01  1.37952149e-01  2.34238729e-01  4.79142070e-01
 -4.50602323e-01 -6.22054040e-01 -1.66488504e+00 -1.21125042e-01
  6.89874947e-01  1.31480500e-01  1.51602225e-02  1.06349230e+00
  7.99287021e-01 -1.37319791e+00 -4.00674671e-01 -5.47820143e-02
 -5.79391658e-01 -8.63535181e-02  1.40607521e-01  4.01484698e-01
  2.90431708e-01  8.92159641e-01  8.11023235e-01  2.42092550e-01
 -1.40116835e+00 -1.44791412e+00  1.37597620e+00  3.82016778e-01
  2.09350049e-01  1.29595912e+00 -3.54021460e-01  7.03827217e-02
  1.97698367e+00  1.82605803e+00 -7.05033720e-01  8.48804653e-01
 -3.76876265e-01  1.33903992e+00  7.07435682e-02  3.68672490e-01
 -8.87290955e-01  5.07061183e-03 -1.62070656e+00  4.08390164e-02
 -2.72857487e-01 -1.15605533e+00  9.71599877e-01  6.34618461e-01
 -1.77257493e-01 -2.40391493e-03  5.68731844e-01 -1.10419428e+00
  1.65958214e+00 -6.16028011e-01 -4.08925146e-01 -3.78011733e-01
  1.17622805e+00  5.37027597e-01  3.95034961e-02  1.10219347e+00
  5.27991951e-01  7.879

# End of this notebook