# Pipeline de transformers out-of-the-box

### Referencias
* [Summary of the tasks](https://huggingface.co/transformers/task_summary.html)

In [1]:
import numpy as np
import pandas as pd

from nltk.corpus import twitter_samples
from sklearn.metrics import accuracy_score
from transformers import pipeline
from transformers import set_seed


all_tasks = [
"sentiment-analysis",
"ner",
"fill-mask",
"text-generation",
"feature-extraction",

"translation_xx_to_yy",
"summarization",
"question-answering",
    

"text-classification",
"token-classification",
"text2text-generation",
"zero-shot-classification",
"conversational"]

INFO:tensorflow:Enabling eager execution
INFO:tensorflow:Enabling v2 tensorshape
INFO:tensorflow:Enabling resource variables
INFO:tensorflow:Enabling tensor equality
INFO:tensorflow:Enabling control flow v2


# Sentiment analysis

In [3]:
m = pipeline('sentiment-analysis', framework="pt")

In [4]:
m("I want to die")

[{'label': 'NEGATIVE', 'score': 0.9984995722770691}]

In [5]:
m('We are very happy to show you the ðŸ¤— Transformers library.')

[{'label': 'POSITIVE', 'score': 0.9997795224189758}]

### Sentiment analisis sobre el dataset de tweets de nltk

In [7]:
documents = ([(t, "POSITIVE") for t in twitter_samples.strings("positive_tweets.json")] + 
             [(t, "NEGATIVE") for t in twitter_samples.strings("negative_tweets.json")])

In [8]:
df = pd.DataFrame(documents, columns=["tweet", "label"])

In [9]:
df.shape

(10000, 2)

In [11]:
tweets = df.tweet.tolist()

In [25]:
tweets_chunks = np.array_split(tweets, 100)

In [27]:
%%time
pred = []
for i, chunk in enumerate(tweets_chunks):
    print(i)
    pred += m(list(chunk))

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
CPU times: user 24min 11s, sys: 1min 43s, total: 25min 55s
Wall time: 4min 25s


In [40]:
df_pred = pd.DataFrame(pred)
df_pred.head()

Unnamed: 0,label,score
0,POSITIVE,0.926064
1,POSITIVE,0.999161
2,POSITIVE,0.99962
3,POSITIVE,0.992427
4,NEGATIVE,0.972595


In [41]:
df_final = df.join(df_pred.rename(columns={'label': 'pred'}))
df_final.label.value_counts(normalize=True)

POSITIVE    0.5
NEGATIVE    0.5
Name: label, dtype: float64

In [42]:
accuracy_score(df_final["label"], df_final["pred"])

0.6622

# [NER: Named Entity Recognition](https://huggingface.co/transformers/task_summary.html#named-entity-recognition)


In [45]:
m = pipeline('ner', framework="pt")
res = m('Hugging Face is a French company based in New-York.')

In [47]:
pd.DataFrame(res)

Unnamed: 0,word,score,entity,index,start,end
0,Hu,0.997094,I-ORG,1,0,2
1,##gging,0.934575,I-ORG,2,2,7
2,Face,0.978706,I-ORG,3,8,12
3,French,0.9982,I-MISC,6,18,24
4,New,0.998305,I-LOC,10,42,45
5,-,0.891346,I-LOC,11,45,46
6,York,0.997952,I-LOC,12,46,50


In [48]:
res = m('The winner of the Oscar of 2020 was Joaquin Phoenix')
pd.DataFrame(res)

Unnamed: 0,word,score,entity,index,start,end
0,Oscar,0.997312,I-MISC,5,18,23
1,of,0.806631,I-MISC,6,24,26
2,2020,0.685957,I-MISC,7,27,31
3,Joaquin,0.997767,I-PER,9,36,43
4,Phoenix,0.998577,I-PER,10,44,51


In [50]:
m = pipeline('ner', framework="pt")
res = m('Argentina, Brasil and Uruguay attended the Mercosur Gathering')
pd.DataFrame(res)

Unnamed: 0,word,score,entity,index,start,end
0,Argentina,0.999763,I-LOC,1,0,9
1,Brasil,0.999834,I-LOC,3,11,17
2,Uruguay,0.999801,I-LOC,5,22,29
3,Me,0.998633,I-MISC,8,43,45
4,##rc,0.983118,I-MISC,9,45,47
5,##os,0.857216,I-MISC,10,47,49
6,##ur,0.978866,I-MISC,11,49,51
7,Gathering,0.997672,I-MISC,12,52,61


## [Fill mask](https://huggingface.co/transformers/task_summary.html#masked-language-modeling)

In [51]:
m = pipeline('fill-mask', framework="pt")
res = m('Hugging Face is a French company based in <mask>')

In [53]:
pd.DataFrame(res)

Unnamed: 0,sequence,score,token,token_str
0,Hugging Face is a French company based in Paris,0.277589,2201,Paris
1,Hugging Face is a French company based in Lyon,0.149412,12790,Lyon
2,Hugging Face is a French company based in Geneva,0.045764,11559,Geneva
3,Hugging Face is a French company based in France,0.045763,1470,France
4,Hugging Face is a French company based in Brus...,0.040676,6497,Brussels


In [55]:
pd.set_option("display.max_colwidth", 500)

In [56]:
pd.DataFrame(m('Argentina is a third world country located in <mask>'))

Unnamed: 0,sequence,score,token,token_str
0,Argentina is a third world country located in Africa,0.17979,1327,Africa
1,Argentina is a third world country located in Antarctica,0.133594,27593,Antarctica
2,Argentina is a third world country located in Argentina,0.121582,5244,Argentina
3,Argentina is a third world country located in Asia,0.087151,1817,Asia
4,Argentina is a third world country located in Chile,0.055938,9614,Chile


In [57]:
pd.DataFrame(m('Argentina is a third-world country located in <mask>'))

Unnamed: 0,sequence,score,token,token_str
0,Argentina is a third-world country located in Argentina,0.194115,5244,Argentina
1,Argentina is a third-world country located in Africa,0.126507,1327,Africa
2,Argentina is a third-world country located in Antarctica,0.070002,27593,Antarctica
3,Argentina is a third-world country located in Asia,0.068985,1817,Asia
4,Argentina is a third-world country located in Switzerland,0.06082,6413,Switzerland


In [58]:
# This pipeline only works for inputs with exactly one token masked.
# Explain why not South America...
pd.DataFrame(m('Argentina is a country located in <mask>'))

Unnamed: 0,sequence,score,token,token_str
0,Argentina is a country located in Argentina,0.437706,5244,Argentina
1,Argentina is a country located in Chile,0.105872,9614,Chile
2,Argentina is a country located in Africa,0.055104,1327,Africa
3,Argentina is a country located in Uruguay,0.046751,17609,Uruguay
4,Argentina is a country located in Russia,0.027996,798,Russia


In [63]:
pd.DataFrame(m('Argentina is a <mask> located in South America'))

Unnamed: 0,sequence,score,token,token_str
0,Argentina is a country located in South America,0.416754,247,country
1,Argentina is a region located in South America,0.160597,976,region
2,Argentina is a republic located in South America,0.126918,16441,republic
3,Argentina is a nation located in South America,0.085056,1226,nation
4,Argentina is a province located in South America,0.050522,2791,province


In [62]:
# Weird
pd.DataFrame(m('Argentina is a <mask> located in South America', targets=["province", "company"]))

The specified target token `province` does not exist in the model vocabulary. Replacing with `prov`.


Unnamed: 0,sequence,score,token,token_str
0,Argentina is acompany located in South America,2.124516e-06,24233,company
1,Argentina is aprov located in South America,4.395487e-08,13138,prov


In [72]:
pd.options.display.float_format = '{:.12f}'.format

In [73]:
pd.DataFrame(m('Argentina is a <mask> located in South America', targets=[" province", " nation", " company", " apple"]))

Unnamed: 0,sequence,score,token,token_str
0,Argentina is a nation located in South America,0.085056111217,1226,nation
1,Argentina is a province located in South America,0.050522368401,2791,province
2,Argentina is a company located in South America,0.004292198457,138,company
3,Argentina is a apple located in South America,1.38e-10,15162,apple


In [84]:
pd.DataFrame(m('Argentina is a <mask> located in South America', top_k=20))

Unnamed: 0,sequence,score,token,token_str
0,Argentina is a country located in South America,0.4168,247,country
1,Argentina is a region located in South America,0.1606,976,region
2,Argentina is a republic located in South America,0.1269,16441,republic
3,Argentina is a nation located in South America,0.0851,1226,nation
4,Argentina is a province located in South America,0.0505,2791,province
5,Argentina is a city located in South America,0.0184,343,city
6,Argentina is a territory located in South America,0.0154,4284,territory
7,Argentina is a colony located in South America,0.0145,21878,colony
8,Argentina is a state located in South America,0.0145,194,state
9,Argentina is a municipality located in South America,0.0129,17300,municipality


In [76]:
pd.options.display.float_format = '{:.4f}'.format
pd.DataFrame(m('Argentina is a country located in South <mask>', top_k=20))

Unnamed: 0,sequence,score,token,token_str
0,Argentina is a country located in South America,0.6412,730,America
1,Argentina is a country located in South Africa,0.3338,1327,Africa
2,Argentina is a country located in South Asia,0.0154,1817,Asia
3,Argentina is a country located in South Sudan,0.0039,6312,Sudan
4,Argentina is a country located in South Korea,0.0017,1101,Korea
5,Argentina is a country located in South China,0.0011,436,China
6,Argentina is a country located in South Americas,0.0006,11685,Americas
7,Argentina is a country located in South Pacific,0.0003,3073,Pacific
8,Argentina is a country located in South Carolina,0.0003,1961,Carolina
9,Argentina is a country located in South American,0.0002,470,American


In [85]:
m = pipeline('fill-mask', framework="pt")
pd.DataFrame(m('During my breakfast I had an <mask> juice', top_k=10))

Unnamed: 0,sequence,score,token,token_str
0,During my breakfast I had an orange juice,0.6901,8978,orange
1,During my breakfast I had an apple juice,0.2874,15162,apple
2,During my breakfast I had an Orange juice,0.0055,5726,Orange
3,During my breakfast I had an extra juice,0.0024,1823,extra
4,During my breakfast I had an avocado juice,0.0018,25358,avocado
5,During my breakfast I had an almond juice,0.0017,32473,almond
6,During my breakfast I had an olive juice,0.0015,14983,olive
7,During my breakfast I had an herbal juice,0.001,30868,herbal
8,During my breakfast I had an ice juice,0.0007,2480,ice
9,During my breakfast I had an acidic juice,0.0006,41314,acidic


In [78]:
pd.DataFrame(m('Argentina is a country located in South America <mask>', top_k=1))

Unnamed: 0,sequence,score,token,token_str
0,Argentina is a country located in South America.,0.7924,4,.


In [80]:
pd.DataFrame(m('Argentina is a country located in South America. <mask>', top_k=5))

Unnamed: 0,sequence,score,token,token_str
0,Argentina is a country located in South America. Â®,0.2803,39110,Â®
1,Argentina is a country located in South America.,0.2438,2,</s>
2,Argentina is a country located in South America.â€‹,0.034,13635,â€‹
3,Argentina is a country located in South America. Contents,0.0196,36422,Contents
4,Argentina is a country located in South America..,0.0178,479,.


# [Text generation](https://huggingface.co/transformers/task_summary.html#text-generation)

https://www.kaggle.com/julian3833/gpt-2-large-774m-w-pytorch-not-that-impressive

In [102]:
from transformers import set_seed
set_seed(42)
#generator = pipeline('text-generation', model='gpt2')

#generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)


In [101]:
set_seed??

In [86]:
m = pipeline("text-generation", framework="pt")

In [88]:
res = m("As far as I am concerned, I will", max_length=50, do_sample=False)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [92]:
print(res[0]['generated_text'])

As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a "free market." I think that the idea of a free market is a bit of a stretch. I think that the idea


In [93]:
m("As far as I am concerned, I will", max_length=100, do_sample=False)[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a "free market." I think that the idea of a free market is a bit of a stretch. I think that the idea of a free market is a bit of a stretch. I think that the idea of a free market is a bit of a stretch. I think that the idea of a free market is a bit of a stretch. I think that the idea of a'

In [107]:
res = m("As far as I am concerned, I will", max_length=100, num_return_sequences=5)
for r in res:
    print(r['generated_text'])
    print("-----")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


As far as I am concerned, I will never see what they've got, and I certainly don't want to have a bad influence on their future," he said. "They got a ton of money from it this year. That's a lot of money to make people talk about. So why not make more money?"


Bennet said he's always been concerned about giving away the $500,000 he spent in 2012-2013 on the club he founded as well as its history
-----
As far as I am concerned, I will play it all out in a few hours (and preferably more!). I'll look at some of the more common features from this post.

When I'm done with this, let's look at a few other options. On PC, I can play Super Mario Brothers and Super Puzzle Mode. If you aren't familiar, this is where you can download your Super Mario Bros. 2 game file for PC and play that level in Super Mario Bros. 3
-----
As far as I am concerned, I will never admit that the whole thing was a lie and a lie never happened. It was a lie, a lie, and I still believe that."

He continues, "I have n

In [108]:
m("As far as I am concerned, I will", max_length=100, do_sample=True)[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'As far as I am concerned, I will remain ignorant of what really happened and of how it happened," she told HuffPost. "There is no question that my actions violated my responsibilities and as a result, the University of Wisconsin was not in compliance."\n\nFor the second year in a row, the university\'s legal team made numerous requests for details about Hoehn to give in December on the matter. Hoehn never replied, her attorney said Thursday afternoon in a court filing.\n\n'

In [109]:
m("As far as I am concerned, I will", max_length=100, do_sample=True)[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'As far as I am concerned, I will call myself a fan of a game called World of Tanks, but I\'m only interested in the game because of its story," says the 30-year-old.\n\n"I can see it\'s a beautiful story. It doesn\'t feel like a real story. It just feels like a game. I want the players to explore things with their own minds. [The campaign system that\'s the focus] is designed to make sure that players have fun'

In [111]:
m("As far as I am concerned, I will", max_length=100, do_sample=True, temperature=0.05)[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'As far as I am concerned, I will not be able to do this. I will be able to do this. I will be able to do this. I will be able to do this. I will be able to do this. I will be able to do this. I will be able to do this. I will be able to do this. I will be able to do this. I will be able to do this. I will be able to do this. I will be able'

In [112]:
m("As far as I am concerned, I will", max_length=100, do_sample=True, temperature=0.05)[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'As far as I am concerned, I will not be able to do this. I will be able to do this in the future. I will be able to do this in the future. I will be able to do this in the future. I will be able to do this in the future. I will be able to do this in the future. I will be able to do this in the future. I will be able to do this in the future. I will be able to do this'

In [113]:
m("As far as I am concerned, I will", max_length=100, do_sample=True, temperature=1.0)[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"As far as I am concerned, I will not make a prediction based on the data that was available. Rather it will be because I think it is my job to check the data in order to figure out further. But my job is only to tell the tale of what happened and the possible consequences. I do not have to be on the front lines of this stuff. This will happen.\n\nI think everybody has something on their mind that could end up making this point to them. It's"

## Feature extraction

In [114]:
m = pipeline('feature-extraction', framework="pt")
output = m('Argentina is a third-world country located in South America')
res = np.array(output)   # (Samples, Tokens, Vector Size)

Downloading:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertModel: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

In [118]:
len('Argentina is a third-world country located in South America'.split())

9

In [121]:
# Asumiendo que third world cuente como dos tokens, me siguen faltando 3

In [120]:
pd.DataFrame(res[0])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,758,759,760,761,762,763,764,765,766,767
0,0.3142,-0.0928,-0.1445,-0.3743,-0.2684,0.0025,0.4324,-0.011,0.2053,-1.005,...,0.1891,0.495,-0.2091,0.0907,0.0479,0.234,0.1152,-0.1387,0.2466,-0.1291
1,-0.2211,-0.1835,0.0413,-0.1406,0.114,-0.1885,0.4402,0.5387,0.6047,0.0857,...,-0.3314,0.3388,-0.4116,-0.07,0.1156,0.1657,0.0293,0.192,0.2495,0.0474
2,0.121,0.0032,-0.013,0.298,0.2649,0.1956,0.2696,0.2847,0.0151,-0.2208,...,-0.1311,0.5167,-0.2849,-0.0949,-0.513,0.0704,0.1531,0.4746,0.2661,-0.3561
3,0.0767,-0.1741,-0.3404,0.1893,0.3162,-0.0561,0.3991,0.6855,-0.0327,0.0884,...,-0.0888,0.8325,-0.3268,-0.1733,-0.3021,0.2575,0.4495,0.6846,0.2719,-0.3861
4,0.1867,-0.2539,-0.5851,-0.2226,0.2334,-0.0126,0.2989,0.263,-0.0889,-0.1003,...,-0.0888,0.7264,-0.5789,-0.0024,-0.5184,-0.1718,-0.116,0.6531,0.4978,0.069
5,0.518,-0.4451,-0.5487,0.1668,0.4793,-0.1344,0.2978,0.1596,0.037,-0.1935,...,0.175,0.4874,-0.7388,-0.1075,-0.2085,0.5319,0.5084,0.9223,0.0692,-0.4615
6,0.5295,-0.0484,-0.3013,0.1545,0.2818,0.0397,0.2347,0.3955,-0.214,-0.0846,...,-0.3223,0.6083,-0.3618,0.0352,-0.0974,0.0688,-0.0991,0.8469,-0.0829,-0.2676
7,-0.0829,-0.3128,-0.1534,-0.1733,0.0785,-0.1729,0.1882,0.2299,-0.2923,-0.0341,...,-0.1456,0.4434,-0.1903,-0.0701,-0.0783,0.3054,0.609,0.5938,-0.1504,-0.214
8,0.1025,-0.294,-0.197,0.2706,0.3495,0.019,0.1514,0.1349,-0.1195,0.2497,...,-0.014,0.1871,-0.3999,0.2439,-0.0265,0.1519,0.3591,0.6956,0.3446,-0.3511
9,0.2896,-0.1953,-0.0789,-0.1754,0.4192,0.1717,0.3444,0.0749,-0.1034,0.3028,...,0.0881,0.4991,-0.4306,0.1792,0.2573,0.067,0.1462,0.5654,0.1686,-0.717


In [125]:
tokens = 'Argentina is a third world country located in South America'.split()

In [126]:
m = pipeline('feature-extraction', framework="pt")
output_new = m(tokens)
res_new = np.array(output_new)   # (Samples, Tokens, Vector Size)

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertModel: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [127]:
res_new.shape

(10, 3, 768)

In [137]:
# Me perdi mas ahora... esperaba (10, 1, 768)

In [135]:
res_new[4]

array([[ 0.28765035,  0.09892987, -0.0486661 , ..., -0.08023636,
         0.3199439 , -0.05657339],
       [ 0.36360279, -0.1752699 ,  0.17662662, ..., -0.03108238,
         0.29846355, -0.09122071],
       [ 0.87552035,  0.68321186, -0.10243569, ..., -0.11852566,
         1.20282555, -0.31995371]])

# Translation

In [142]:
m = pipeline("translation_en_to_de", framework="pt")
res = m("Argentina is a third world country located in South America")

Downloading:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [143]:
res

[{'translation_text': 'Argentinien ist ein Land der Dritten Welt in SÃ¼damerika'}]

In [144]:
m = pipeline("translation_en_to_fr", framework="pt")
res = m("Argentina is a third world country located in South America")

In [145]:
res

[{'translation_text': "L'Argentine est un pays du tiers monde situÃ© en AmÃ©rique du Sud."}]

In [2]:
m = pipeline("translation_en_to_es", framework="pt")
res = m("Argentina is a third world country located in South America")

In [3]:
res

[{'translation_text': 'located in South America. Argentina is a third world country located in South America. It'}]

###  [Summary](https://huggingface.co/transformers/task_summary.html#summarization)

In [5]:
m = pipeline("summarization", framework="pt")

Downloading:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

In [10]:
TEXT = "World War II or the Second World War, often abbreviated as WWII or WW2, was a global war that lasted from 1939 to 1945. It involved the vast majority of the world's countriesâ€”including all of the great powersâ€”forming two opposing military alliances: the Allies and the Axis powers. In a total war directly involving more than 100 million personnel from more than 30 countries, the major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, blurring the distinction between civilian and military resources."

In [11]:
len(TEXT.split())

87

In [12]:
m(TEXT, min_length=20, max_length=40)

[{'summary_text': ' World War II or the Second World War, often abbreviated as WWII or WW2, was a global war that lasted from 1939 to 1945 . It involved the vast majority of the world'}]

In [9]:
m("An apple a day, keeps the doctor away", min_length=2, max_length=6)

[{'summary_text': ' An apple a'}]

In [13]:
TEXT2 = """World War II or the Second World War, often abbreviated as WWII or WW2, was a global war that lasted from 1939 to 1945. It involved the vast majority of the world's countriesâ€”including all of the great powersâ€”forming two opposing military alliances: the Allies and the Axis powers. In a total war directly involving more than 100 million personnel from more than 30 countries, the major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, blurring the distinction between civilian and military resources. Aircraft played a major role in the conflict, enabling the strategic bombing of population centres and the only two uses of nuclear weapons in war to this day. World War II was by far the deadliest conflict in human history, and resulted in 70 to 85 million fatalities, a majority being civilians. Tens of millions of people died due to genocides (including the Holocaust), starvation, massacres, and disease. In the wake of the Axis defeat, Germany and Japan were occupied, and war crimes tribunals were conducted against German and Japanese leaders.

World War II is generally considered to have begun on 1 September 1939, when Nazi Germany, under Adolf Hitler, invaded Poland. The United Kingdom and France subsequently declared war on Germany on the 3rd. Under the Molotovâ€“Ribbentrop Pact of August 1939, Germany and the Soviet Union had partitioned Poland and marked out their "spheres of influence" across Finland, Romania and the Baltic states. From late 1939 to early 1941, in a series of campaigns and treaties, Germany conquered or controlled much of continental Europe, and formed the Axis alliance with Italy and Japan (along with other countries later on). Following the onset of campaigns in North Africa and East Africa, and the fall of France in mid-1940, the war continued primarily between the European Axis powers and the British Empire, with war in the Balkans, the aerial Battle of Britain, the Blitz of the UK, and the Battle of the Atlantic. On 22 June 1941, Germany led the European Axis powers in an invasion of the Soviet Union, opening the Eastern Front, the largest land theatre of war in history and trapping the Axis powers, crucially the German Wehrmacht, in a war of attrition."""

In [14]:
len(TEXT2.split())

372

In [15]:
m(TEXT2, min_length=20, max_length=80)

[{'summary_text': " World War II or the Second World War, often abbreviated as WWII or WW2, was a global war that lasted from 1939 to 1945 . It involved the vast majority of the world's countries, including all of the great powers . Aircraft played a major role in the conflict, enabling strategic bombing of population centres and the only two uses of nuclear weapons in war to this day ."}]

In [17]:
qa = pipeline("question-answering", framework="pt")

context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
"""

In [18]:
result = qa(question="What is extractive question answering?", context=context)

In [19]:
result

{'score': 0.6177273988723755,
 'start': 34,
 'end': 95,
 'answer': 'the task of extracting an answer from a text given a question'}