## This notebook is a description of how you can use the soraberta. 


In [1]:
!pip install -q transformers  #lets install the huggingface liberary 

[K     |████████████████████████████████| 2.8 MB 12.8 MB/s 
[K     |████████████████████████████████| 52 kB 1.3 MB/s 
[K     |████████████████████████████████| 636 kB 43.0 MB/s 
[K     |████████████████████████████████| 3.3 MB 49.4 MB/s 
[K     |████████████████████████████████| 895 kB 38.1 MB/s 
[?25h

## Import the pipeline from  huggingface . the pipeline are a great and easy way to use models for inference. 

In [2]:
from transformers import pipeline  


## We specialize in the task at hand . In our case, it is the fill mask and the template name that will be used. The fill mask consists of masking an arbitrarily chosen word with the <mask> token and the model must predict the word that should be in place of the mask. 

In [3]:
unmasker = pipeline('fill-mask', model='abdouaziiz/soraberta')


Downloading:   0%|          | 0.00/676 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/334M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/192k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112k [00:00<?, ?B/s]

In [4]:
unmasker("juroom naari jullit man nanoo boole jend aw nag walla <mask>.")


[{'score': 0.9785267114639282,
  'sequence': 'juroom naari jullit man nanoo boole jend aw nag walla gileem.',
  'token': 4621,
  'token_str': ' gileem'},
 {'score': 0.009154203347861767,
  'sequence': 'juroom naari jullit man nanoo boole jend aw nag walla jend.',
  'token': 2155,
  'token_str': ' jend'},
 {'score': 0.0028022238984704018,
  'sequence': 'juroom naari jullit man nanoo boole jend aw nag walla aw.',
  'token': 704,
  'token_str': ' aw'},
 {'score': 0.0010274219093844295,
  'sequence': 'juroom naari jullit man nanoo boole jend aw nag walla pel.',
  'token': 1171,
  'token_str': ' pel'},
 {'score': 0.0005228584632277489,
  'sequence': 'juroom naari jullit man nanoo boole jend aw nag walla juum.',
  'token': 5820,
  'token_str': ' juum'}]

## Tokenization is an essential task in natural language processing used to break up a string of words into semantically useful units called tokens. 

*   import the AutoTokenizer from HuggingFace
*   This is a generic tokenizer class that will be instantiated as one of the tokenizer classes of the library when created with the AutoTokenizer.from_pretrained() class method.



In [5]:
from transformers import AutoTokenizer

In [6]:
# Introduce the tokenizer with the name of the model, in this case it will be abdouaziiz/soraberta.
tokenizer = AutoTokenizer.from_pretrained('abdouaziiz/soraberta')

In [7]:
text= "juroom naari jullit man nanoo boole jend aw nag walla "  # the text to encode 

In [8]:
text_encoded = tokenizer.encode_plus(text , 
                                return_tensors='pt')

In [9]:
text_encoded

{'input_ids': tensor([[   0, 5547,  623,  694,  390, 1412,  894, 2155,  704,  392,  502,  225,
            2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [10]:
text_encoded["input_ids"]  # the text encoding 

tensor([[   0, 5547,  623,  694,  390, 1412,  894, 2155,  704,  392,  502,  225,
            2]])

In [11]:
decode_text = " ".join(tokenizer.decode(i.item()) for i in text_encoded["input_ids"].flatten())  # A simple way to decode the token to the text original

In [12]:
decode_text   

'<s> juroom  naari  jullit  man  nanoo  boole  jend  aw  nag  walla   </s>'

### Use the Model

In [13]:
from transformers import AutoConfig, AutoModel
# Download configuration from huggingface.co and cache.
config = AutoConfig.from_pretrained('abdouaziiz/soraberta')

In [14]:
model = AutoModel.from_config(config)

In [15]:
ids = text_encoded["input_ids"]
mask = text_encoded["attention_mask"]

In [16]:
outut = model(ids , mask)

In [17]:
outut

BaseModelOutputWithPoolingAndCrossAttentions([('last_hidden_state',
                                               tensor([[[-0.8085,  1.7774,  0.7821,  ..., -0.3124, -1.6087,  0.7636],
                                                        [-0.1598,  2.5424,  0.4493,  ...,  1.2420,  0.0758,  0.5907],
                                                        [-1.1785,  3.5558, -0.1567,  ..., -0.5016,  0.3329,  1.3283],
                                                        ...,
                                                        [-1.9481,  2.9195,  0.5745,  ...,  1.4593, -0.8427,  0.9266],
                                                        [-0.1577,  2.2856, -1.3407,  ...,  0.2362, -1.3839,  0.6557],
                                                        [-0.8987,  0.8281, -1.0878,  ...,  0.8771, -0.3006,  1.0715]]],
                                                      grad_fn=<NativeLayerNormBackward>)),
                                              ('pooler_output',
      