In [5]:
!pip install transformers



In [6]:
!pip install huggingface_hub



Let's have a quick look at the 🤗 Transformers library features. The library downloads pretrained models for Natural
Language Understanding (NLU) tasks, such as analyzing the sentiment of a text, and Natural Language Generation (NLG),
such as completing a prompt with new text or translating in another language.

First we will see how to easily leverage the pipeline API to quickly use those pretrained models at inference. Then, we
will dig a little bit more and see how the library gives you access to those models and helps you preprocess your data.

The easiest way to use a pretrained model on a given task is to use `pipeline`. 🤗 Transformers
provides the following tasks out of the box:

- Sentiment analysis: is a text positive or negative?
- Text generation (in English): provide a prompt and the model will generate what follows.
- Name entity recognition (NER): in an input sentence, label each word with the entity it represents (person, place,
  etc.)
- Question answering: provide the model with some context and a question, extract the answer from the context.
- Filling masked text: given a text with masked words (e.g., replaced by `[MASK]`), fill the blanks.
- Summarization: generate a summary of a long text.
- Translation: translate a text in another language.
- Feature extraction: return a tensor representation of the text.

Let's see how this work for sentiment analysis (the other tasks are all covered in the [task summary](https://huggingface.co/transformers/task_summary.html)):

In [7]:
from transformers import pipeline

In [8]:
classifier = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [9]:
results =  classifier("wverything is fine")
results

[{'label': 'POSITIVE', 'score': 0.999466598033905}]

In [10]:
results =  classifier(["everything is fine",
                       "everything is bad",
                       "everything is good",
                       "everything is terrible",
                       "everything is great",])
results

[{'label': 'POSITIVE', 'score': 0.999861478805542},
 {'label': 'NEGATIVE', 'score': 0.9998108744621277},
 {'label': 'POSITIVE', 'score': 0.9998520612716675},
 {'label': 'NEGATIVE', 'score': 0.999581515789032},
 {'label': 'POSITIVE', 'score': 0.9998767375946045}]

## Multiliguistic BERT

In [11]:
classifier = pipeline('sentiment-analysis', model ="nlptown/bert-base-multilingual-uncased-sentiment")

In [12]:
classifier(["tout va bien",
                       "tout est mauvais",
                       "Tout est bon",
                       "tout est terrible",
                       "tout est bon",])

[{'label': '5 stars', 'score': 0.42690595984458923},
 {'label': '1 star', 'score': 0.8061236143112183},
 {'label': '5 stars', 'score': 0.5202799439430237},
 {'label': '1 star', 'score': 0.5367847681045532},
 {'label': '5 stars', 'score': 0.5202799439430237}]

This classifier can now deal with texts in English, French, but also Dutch, German, Italian and Spanish! You can also
replace that name by a local folder where you have saved a pretrained model (see below). You can also pass a model
object and its associated tokenizer.

We will need two classes for this. The first is `AutoTokenizer`, which we will use to download the
tokenizer associated to the model we picked and instantiate it. The second is
`AutoModelForSequenceClassification` (or
`TFAutoModelForSequenceClassification` if you are using TensorFlow), which we will use to download
the model itself. Note that if we were using the library on an other task, the class of the model would change. The
[task summary](https://huggingface.co/transformers/task_summary.html) tutorial summarizes which class is used for which task.

In [13]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

In [14]:
model_name ="nlptown/bert-base-multilingual-uncased-sentiment"
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, from_pt=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

All the weights of TFBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [15]:
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

In [16]:
classifier("I am a good boy")

[{'label': '4 stars', 'score': 0.4229269027709961}]

In [17]:
tf_batch = tokenizer(
    ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="tf"
)

In [18]:
for key, value in tf_batch.items():
    print(f"{key}: {value.numpy().tolist()}")

input_ids: [[101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102], [101, 11312, 18763, 10855, 11530, 112, 162, 39487, 10197, 119, 102, 0, 0, 0]]
token_type_ids: [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]


## Redoing Fake News Detection with Transformers

In [19]:
import pandas as pd

In [20]:
df = pd.read_csv('../data/fake-news/train.csv')

In [21]:
df.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [22]:
df.isnull().sum()

id           0
title      558
author    1957
text        39
label        0
dtype: int64

In [23]:
df = df.dropna(ignore_index=True)

In [24]:
df.isnull().sum()

id        0
title     0
author    0
text      0
label     0
dtype: int64

In [25]:
## indepentent features
X=df.drop('label',axis=1)

In [26]:
## dependent feature
y = df['label']

In [27]:
messages = X['title'].copy()

In [28]:
messages

0        House Dem Aide: We Didn’t Even See Comey’s Let...
1        FLYNN: Hillary Clinton, Big Woman on Campus - ...
2                        Why the Truth Might Get You Fired
3        15 Civilians Killed In Single US Airstrike Hav...
4        Iranian woman jailed for fictional unpublished...
                               ...                        
18280    Rapper T.I.: Trump a ’Poster Child For White S...
18281    N.F.L. Playoffs: Schedule, Matchups and Odds -...
18282    Macy’s Is Said to Receive Takeover Approach by...
18283    NATO, Russia To Hold Parallel Exercises In Bal...
18284                            What Keeps the F-35 Alive
Name: title, Length: 18285, dtype: object

In [29]:
messages = list(messages)

In [30]:
messages

['House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It',
 'FLYNN: Hillary Clinton, Big Woman on Campus - Breitbart',
 'Why the Truth Might Get You Fired',
 '15 Civilians Killed In Single US Airstrike Have Been Identified',
 'Iranian woman jailed for fictional unpublished story about woman stoned to death for adultery',
 'Jackie Mason: Hollywood Would Love Trump if He Bombed North Korea over Lack of Trans Bathrooms (Exclusive Video) - Breitbart',
 'Benoît Hamon Wins French Socialist Party’s Presidential Nomination - The New York Times',
 'A Back-Channel Plan for Ukraine and Russia, Courtesy of Trump Associates - The New York Times',
 'Obama’s Organizing for Action Partners with Soros-Linked ‘Indivisible’ to Disrupt Trump’s Agenda',
 'BBC Comedy Sketch "Real Housewives of ISIS" Causes Outrage',
 'Russian Researchers Discover Secret Nazi Military Base ‘Treasure Hunter’ in the Arctic [Photos]',
 'US Officials See No Link Between Trump and Russia',
 'Re: Yes, Th

In [31]:
y = list(y)
y


[1,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 0,


In [32]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(messages, y, test_size=.2, random_state=0)

In [33]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [34]:
tokenized_data = tokenizer(X_train, return_tensors="tf", padding=True)
# Tokenizer returns a BatchEncoding, but we convert that to a dict for Keras
tokenized_data = dict(tokenized_data)

In [35]:
import numpy as np
labels = np.array(y_train)

In [36]:
labels

array([1, 1, 1, ..., 1, 0, 0])

In [37]:
from transformers import TFAutoModelForSequenceClassification
from tensorflow.keras.optimizers import Adam

model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
model.compile(optimizer=Adam(lr=5e-5), loss="binary_crossentropy", metrics=["accuracy"])


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_projector.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

In [38]:
model.fit(tokenized_data, labels, epochs=2, batch_size=32)

Epoch 1/2
 20/458 [>.............................] - ETA: 43:51 - loss: 7.3542 - accuracy: 0.4453

KeyboardInterrupt: 