<a href="https://colab.research.google.com/github/ValdazoAmerico/transformers-pipeline/blob/main/transformers_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [25]:
!pip install transformers



In [26]:
from transformers import pipeline

In [27]:
all_tasks = ['feature-extraction',
             'text-classification',
             'sentiment-analysis',
             'token-classification',
             'ner'
             'question-answering',
             'fill-mask',
             'summarization',
             'translation_xx_to_yy',
             'text2text-generation',
             'text-generation',
             'zero-shot-classification',
             'conversational']

# Sentiment analysis

In [28]:
m = pipeline('sentiment-analysis')

In [29]:
m('i am happy')

[{'label': 'POSITIVE', 'score': 0.9998801946640015}]

In [30]:
import pandas as pd
from nltk.corpus import twitter_samples

In [31]:
import nltk

In [32]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

In [33]:
documents = ([(t, "POSITIVE") for t in twitter_samples.strings("positive_tweets.json")] +
             [(t, "neg") for t in twitter_samples.strings("negative_tweets.json")])

In [34]:
df = pd.DataFrame(documents, columns=['tweet','label'])

In [35]:
df.shape

(10000, 2)

In [36]:
df['label'].value_counts()

POSITIVE    5000
neg         5000
Name: label, dtype: int64

In [None]:
pred = m(df.tweet.tolist())

In [37]:
import numpy as np

In [38]:
tweets = df.tweet.tolist()

In [47]:
m(tweets[2])

[{'label': 'POSITIVE', 'score': 0.999620258808136}]

In [42]:
tweets_chunks = np.array_split(tweets, 10)

In [70]:
pred = []
for i in tweets[0:100]:
  pred.append(m(i))

In [58]:
df_new = df[0:100]

In [71]:
pred_label = []
for p in pred:
  pred_label.append((p[0]['label']))

In [73]:
df_new['pred'] = pred_label

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [75]:
df_new

Unnamed: 0,tweet,label,pred
0,#FollowFriday @France_Inte @PKuchly57 @Milipol...,POSITIVE,POSITIVE
1,@Lamb2ja Hey James! How odd :/ Please call our...,POSITIVE,POSITIVE
2,@DespiteOfficial we had a listen last night :)...,POSITIVE,POSITIVE
3,@97sides CONGRATS :),POSITIVE,POSITIVE
4,yeaaaah yippppy!!! my accnt verified rqst has...,POSITIVE,NEGATIVE
...,...,...,...
95,Those friends know themselves :),POSITIVE,POSITIVE
96,waiting for nudes :-),POSITIVE,NEGATIVE
97,@JacobWhitesides go sleep u ! :))))))))),POSITIVE,NEGATIVE
98,Stats for the day have arrived. 1 new follower...,POSITIVE,NEGATIVE


In [77]:
from sklearn.metrics import accuracy_score
accuracy_score(df_new['label'], df_new['pred'])

0.56

# Named Entity Recognition

In [78]:
m = pipeline("ner")
m("Hugging Face is a French company based in New-York")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=998.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1334448817.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=60.0, style=ProgressStyle(description_w…




[{'end': 2,
  'entity': 'I-ORG',
  'index': 1,
  'score': 0.9968093,
  'start': 0,
  'word': 'Hu'},
 {'end': 7,
  'entity': 'I-ORG',
  'index': 2,
  'score': 0.93339366,
  'start': 2,
  'word': '##gging'},
 {'end': 12,
  'entity': 'I-ORG',
  'index': 3,
  'score': 0.978439,
  'start': 8,
  'word': 'Face'},
 {'end': 24,
  'entity': 'I-MISC',
  'index': 6,
  'score': 0.99816346,
  'start': 18,
  'word': 'French'},
 {'end': 45,
  'entity': 'I-LOC',
  'index': 10,
  'score': 0.9978009,
  'start': 42,
  'word': 'New'},
 {'end': 46,
  'entity': 'I-LOC',
  'index': 11,
  'score': 0.94690883,
  'start': 45,
  'word': '-'},
 {'end': 50,
  'entity': 'I-LOC',
  'index': 12,
  'score': 0.99755406,
  'start': 46,
  'word': 'York'}]

# Fill mask

In [83]:
nlp_fill = pipeline('fill-mask')
nlp_fill('Hugging Face is a French company based in <mask>')

[{'score': 0.27758949995040894,
  'sequence': 'Hugging Face is a French company based in Paris',
  'token': 2201,
  'token_str': ' Paris'},
 {'score': 0.14941278100013733,
  'sequence': 'Hugging Face is a French company based in Lyon',
  'token': 12790,
  'token_str': ' Lyon'},
 {'score': 0.045764125883579254,
  'sequence': 'Hugging Face is a French company based in Geneva',
  'token': 11559,
  'token_str': ' Geneva'},
 {'score': 0.04576260223984718,
  'sequence': 'Hugging Face is a French company based in France',
  'token': 1470,
  'token_str': ' France'},
 {'score': 0.040675751864910126,
  'sequence': 'Hugging Face is a French company based in Brussels',
  'token': 6497,
  'token_str': ' Brussels'}]

In [99]:
pd.DataFrame(nlp_fill("Argentina is a country located in South <mask>", top_k=5))


Unnamed: 0,sequence,score,token,token_str
0,Argentina is a country located in South America,0.64124,730,America
1,Argentina is a country located in South Africa,0.333807,1327,Africa
2,Argentina is a country located in South Asia,0.015436,1817,Asia
3,Argentina is a country located in South Sudan,0.003906,6312,Sudan
4,Argentina is a country located in South Korea,0.001698,1101,Korea


# Feature extraction

In [133]:
tokens = "Argentina is a third wolrd country located in South America".split()

In [134]:
nlp_features = pipeline('feature-extraction')
output = nlp_features(tokens)
res = np.array(output)

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [136]:
res.shape

(10, 5, 768)

# text generation

In [100]:
m = pipeline("text-generation")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=548118077.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355256.0, style=ProgressStyle(descript…




In [127]:
m("As far I am concerned, I will", do_sample=True, max_length=100, temperature=1.0)[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'As far I am concerned, I will be making no such statements. I think what is clear for you is that, even though we are concerned about some of the issues, in that sense, we should not be concerned about those on the political right – because that was why I did not support him in some fashion whatsoever."\n\nIn the end Michael Dugher was elected to take a seat held by him and has claimed his supporters "hate gays and lesbians". But it is an election he won'