# About

The goal of this notebook is to experiment with the dataset:
* Wikipedia
* tweets https://github.com/mkearney/trumptweets/tree/master/data and https://www.kaggle.com/austinreese/trump-tweets
* trump speech https://www.kaggle.com/christianlillelund/donald-trumps-rallies
    
Using the model to generate questions for the text https://github.com/patil-suraj/question_generation

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [3]:
! pip install wikipedia --quiet
! pip install -U transformers==3.0.0 --quiet
! python -m nltk.downloader punkt

! git clone https://github.com/patil-suraj/question_generation.git
    

[nltk_data] Downloading package punkt to /Users/viktor/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
fatal: destination path 'question_generation' already exists and is not an empty directory.


In [25]:
import pandas as pd
import os
import wikipedia
import glob 

### Example of model

In [5]:
from question_generation.pipelines import pipeline

nlp = pipeline("question-generation")
nlp("42 is the answer to life, the universe and everything.")

[{'answer': '42',
  'question': 'What is the answer to life, the universe and everything?'}]

In [7]:
nlp_multi_qg = pipeline("multitask-qa-qg")
nlp_qg = pipeline("question-generation")
nlp_e2e_qg = pipeline("e2e-qg")
nlp_multi_qg("Python is a programming language. Created by Guido van Rossum and first released in 1991.")


[{'answer': 'Python',
  'question': 'What language was created by Guido van Rossum?'},
 {'answer': 'Guido van Rossum', 'question': 'Who created Python?'},
 {'answer': '1991', 'question': 'When was Python first released?'}]

In [8]:
nlp_qg("Python is a programming language. Created by Guido van Rossum and first released in 1991.")



[{'answer': 'Python', 'question': 'What is a programming language?'},
 {'answer': 'Guido van Rossum', 'question': 'Who created Python?'},
 {'answer': '1991', 'question': 'When was Python first released?'}]

In [9]:
nlp_e2e_qg("Python is a programming language. Created by Guido van Rossum and first released in 1991.")



['What is a programming language?',
 'Who created Python?',
 'When was Python first released?']

## Using the model to generate question on Wikipedia data

In [101]:
wikipedia.summary("Donald Trump", sentences=1)

'Donald John Trump (born June 14, 1946) is the 45th and current president of the United States.'

In [11]:
nlp_qg(wikipedia.summary("Donald Trump", sentences=1))

[{'answer': 'June 14, 1946', 'question': 'When was Donald John Trump born?'},
 {'answer': 'Donald John Trump',
  'question': 'Who is the 45th president of the United States?'}]

## Let's generate question on tweets

In [12]:
path_to_data = os.getcwd() + "/data"

In [13]:
trump_tweets = pd.read_csv(path_to_data+"/tweets/final-trump.csv", engine='python', usecols=['tweet_id', 'tweet_text']).drop_duplicates().reset_index(drop=True)

In [17]:
trump_tweets.head()

Unnamed: 0,tweet_id,tweet_text
0,1179422987684077568,The Do Nothing Democrats should be focused on ...
1,1197503790729121794,Corrupt politician Adam Schiff’s lies are grow...
2,1163961882945970176,Denmark is a very special country with incredi...
3,1211969354499284992,The Democrats will do anything to avoid a tria...
4,1212121012151689217,"The U.S. Embassy in Iraq is, & has been for ho..."


In [21]:
tweet = trump_tweets.loc[0, 'tweet_text']

In [24]:
nlp_qg(tweet)

[{'answer': 'BULLSHIT',
  'question': 'What is the name of the organization that the Do Nothing Democrats should not waste their time and energy on?'},
 {'answer': 'you’ll need it!',
  'question': 'What do Do Nothing Democrats need to do to get a better candidate?'}]

In [22]:
for twe in tweet.split('.'):
    print(twe)

The Do Nothing Democrats should be focused on building up our Country, not wasting everyone’s time and energy on BULLSHIT, which is what they have been doing ever since I got overwhelmingly elected in 2016, 223-306
 Get a better candidate this time, you’ll need it!


In [23]:
for twe in tweet.split('.'):
    print(nlp_qg(twe))

[{'answer': 'BULLSHIT', 'question': 'What does Do Nothing Democrats spend their time and energy on?'}]
[{'answer': 'Get a better candidate', 'question': 'What type of candidate will you need this time?'}]


## Let's genarete questions based on Trump speech

In [29]:
one_speech_path = glob.glob(path_to_data + "/speech/*.txt")[1]

In [36]:
with open(one_speech_path, 'r') as fd:
    speech_text = fd.read()

In [41]:
speech_text_sample = speech_text[:300]
speech_text_sample

'ell, thank you very much. And hello, Tupelo. This is great to be with you tonight, the great state of Mississippi. The great state of Mississippi, and by the way, the birthplace of a gentleman, not too many people heard of him, Elvis Presley. But to be with thousands of incredible patriots who put t'

In [33]:
nlp_qg(speech_text_sample)

[{'answer': 'thank you',
  'question': 'What did Tupelo say very much about Elvis Presley?'},
 {'answer': 'Tupelo', 'question': "Who was Elvis Presley's birthplace?"},
 {'answer': 'tonight', 'question': 'What is the great state of Mississippi?'},
 {'answer': 'Elvis Presley', 'question': 'Who was the birthplace of Tupelo?'},
 {'answer': 'thousands',
  'question': 'How many incredible patriots were there to be with Tupelo?'}]

# Conclusion

Using the model to generate questions we can improve our dataset by adding more Trump's information.
The most difficult to convert speech and text to conversion with context. One option to split the text into sentences and for each sentence asks a question, as a result, we will have a conversion. This direction should be investigated..