## This notebook is a description of how you can use the bert-base-wolof. 


In [1]:
!pip install -q transformers  #lets install the huggingface liberary 

[K     |████████████████████████████████| 3.1 MB 10.9 MB/s 
[K     |████████████████████████████████| 59 kB 4.0 MB/s 
[K     |████████████████████████████████| 895 kB 21.5 MB/s 
[K     |████████████████████████████████| 596 kB 35.2 MB/s 
[K     |████████████████████████████████| 3.3 MB 39.8 MB/s 
[?25h

## Import the pipeline from  huggingface . the pipeline are a great and easy way to use models for inference. 

In [2]:
from transformers import pipeline  

## We specialize in the task at hand . In our case, it is the fill mask and the template name that will be used. The fill mask consists of masking an arbitrarily chosen word with the <mask> token and the model must predict the word that should be in place of the mask. 

In [3]:
unmasker = pipeline('fill-mask', model='abdouaziiz/bert-base-wolof')


Downloading:   0%|          | 0.00/432 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/217M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/95.3k [00:00<?, ?B/s]

## In this example we take the sentences " kuy yoot du seqet" and we mask the word seqet with token [MASK] 


In [4]:
unmasker("kuy yoot du [MASK].")

[{'score': 0.09505137801170349,
  'sequence': 'kuy yoot du seqet.',
  'token': 13578,
  'token_str': 'seqet'},
 {'score': 0.08882257342338562,
  'sequence': 'kuy yoot du daw.',
  'token': 679,
  'token_str': 'daw'},
 {'score': 0.05779007449746132,
  'sequence': 'kuy yoot du yoot.',
  'token': 5117,
  'token_str': 'yoot'},
 {'score': 0.05671064183115959,
  'sequence': 'kuy yoot du seqat.',
  'token': 4992,
  'token_str': 'seqat'},
 {'score': 0.04699993506073952,
  'sequence': 'kuy yoot du yaqu.',
  'token': 1735,
  'token_str': 'yaqu'}]

## Tokenization is an essential task in natural language processing used to break up a string of words into semantically useful units called tokens. 

*   import the AutoTokenizer from HuggingFace
*   This is a generic tokenizer class that will be instantiated as one of the tokenizer classes of the library when created with the AutoTokenizer.from_pretrained() class method.



In [6]:
from transformers import AutoTokenizer

In [7]:
# Introduce the tokenizer with the name of the model, in this case it will be abdouaziiz/bert-base-wolof.
tokenizer = AutoTokenizer.from_pretrained('abdouaziiz/bert-base-wolof')

In [8]:
text= "bala ngay joxe sa tont dangay nasal sa xel bu baax"  # the text to encode 

In [9]:
text_encoded = tokenizer.encode_plus(text , 
                                return_tensors='pt')

In [10]:
text_encoded

{'input_ids': tensor([[   2,  740,  532,  644,  192, 2959,  857, 9294,  192,  354,  152,  333,
            3]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [11]:
text_encoded["input_ids"]  # the text encoding 

tensor([[   2,  740,  532,  644,  192, 2959,  857, 9294,  192,  354,  152,  333,
            3]])

In [12]:
decode_text = " ".join(tokenizer.decode(i.item()) for i in text_encoded["input_ids"].flatten())  # A simple way to decode the token to the text original

In [13]:
decode_text   

'[CLS] bala ngay joxe sa tont dangay nasal sa xel bu baax [SEP]'

### Use the Model

In [14]:
from transformers import AutoConfig, AutoModel
# Download configuration from huggingface.co and cache.
config = AutoConfig.from_pretrained('abdouaziiz/bert-base-wolof')

In [15]:
model = AutoModel.from_config(config)

In [16]:
ids = text_encoded["input_ids"]
mask = text_encoded["attention_mask"]

In [17]:
outut = model(ids , mask)

In [18]:
outut

BaseModelOutputWithPoolingAndCrossAttentions([('last_hidden_state',
                                               tensor([[[-0.3933, -0.6215, -0.4389,  ...,  0.1719, -0.6878,  0.0537],
                                                        [-0.1656, -0.7140,  0.5332,  ...,  0.5549, -0.2320, -0.7053],
                                                        [ 0.7974,  0.3598,  0.4698,  ..., -0.4585, -0.8902, -2.2183],
                                                        ...,
                                                        [-0.2315, -1.4880,  0.3105,  ..., -0.0174,  0.2971,  0.5815],
                                                        [-0.7146, -0.7959, -0.0891,  ..., -1.1802, -0.9113, -1.9362],
                                                        [-0.4821, -0.0262,  0.1877,  ..., -0.6339,  0.7420, -0.5355]]],
                                                      grad_fn=<NativeLayerNormBackward0>)),
                                              ('pooler_output',
     