# Data Scraping and Text Generation with a Ready-to-Use Model

### Data Scraping

In [1]:
!pip3 install newspaper3k # a scraping tool found in github

Collecting newspaper3k
[?25l  Downloading https://files.pythonhosted.org/packages/d7/b9/51afecb35bb61b188a4b44868001de348a0e8134b4dfa00ffc191567c4b9/newspaper3k-0.2.8-py3-none-any.whl (211kB)
[K     |█▌                              | 10kB 15.2MB/s eta 0:00:01[K     |███                             | 20kB 9.3MB/s eta 0:00:01[K     |████▋                           | 30kB 6.0MB/s eta 0:00:01[K     |██████▏                         | 40kB 5.5MB/s eta 0:00:01[K     |███████▊                        | 51kB 4.2MB/s eta 0:00:01[K     |█████████▎                      | 61kB 4.7MB/s eta 0:00:01[K     |██████████▉                     | 71kB 4.8MB/s eta 0:00:01[K     |████████████▍                   | 81kB 5.0MB/s eta 0:00:01[K     |██████████████                  | 92kB 5.1MB/s eta 0:00:01[K     |███████████████▌                | 102kB 4.1MB/s eta 0:00:01[K     |█████████████████               | 112kB 4.1MB/s eta 0:00:01[K     |██████████████████▋             | 122kB 4.1MB/

In [2]:
#let's add some of the popular websites below
import newspaper 

websites = []

buzzfeed = newspaper.build('https://www.buzzfeednews.com/')
websites.append(buzzfeed)

cnn = newspaper.build('https://edition.cnn.com/')
websites.append(cnn)

usa_today = newspaper.build('https://www.usatoday.com/')
websites.append(usa_today)

In [3]:
for i in websites:
  for article in i.articles:
       print(article.url)

https://www.buzzfeednews.com/article/buzzfeednews/about-buzzfeed-news
https://www.buzzfeednews.com/article/emaoconnor/al-green-congress-should-have-impeached-trump-earlier
https://www.buzzfeednews.com/article/addybaird/house-votes-impeach-trump-twice-capitol-insurrection
https://www.buzzfeednews.com/article/rubycramer/andrew-yang-nyc-mayor-2021
https://www.buzzfeednews.com/article/olivianiland/breast-reduction-surgery-body-image
https://www.buzzfeednews.com/article/skbaer/flint-water-crisis-charges-rick-snyder
https://www.buzzfeednews.com/article/amberjamieson/klete-keller-olympian-charged-capitol
https://www.buzzfeednews.com/article/rubycramer/jaime-harrison-dnc-chair
https://www.buzzfeednews.com/article/mollyhensleyclancy/trump-supporters-voter-fraud-2020-conspiracy
https://www.buzzfeednews.com/article/nidhiprakash/biden-deportations-immigration-activists
https://www.buzzfeednews.com/article/zahrahirji/2020-tied-warmest-year-record
https://www.buzzfeednews.com/article/katienotopoulos

In [4]:
from newspaper import Article

In [5]:
titles = []

for i in websites:
  for article in i.articles:
    try:
        url = article.url
        a = Article(url, language = 'en')
        a.download()
        a.parse()
        titles.append(a.title)
    except:
        print('***FAILED TO DOWNLOAD***', a.url)
        continue   

***FAILED TO DOWNLOAD*** http://www.preview.cnn.com/2015/04/02/world/iyw-guatemala-gender-violence/index.html
***FAILED TO DOWNLOAD*** https://money.cnn.com/business/media
***FAILED TO DOWNLOAD*** https://money.cnn.com/2021/01/14/investing/banks-fossil-fuels-trump-regulators/index.html
***FAILED TO DOWNLOAD*** https://money.cnn.com/2021/01/14/business/airline-security-dc/index.html
***FAILED TO DOWNLOAD*** https://money.cnn.com/2021/01/14/economy/unemployment-benefits-coronavirus/index.html
***FAILED TO DOWNLOAD*** https://money.cnn.com/2021/01/14/perspectives/jobs-boom-2021/index.html
***FAILED TO DOWNLOAD*** https://money.cnn.com/2021/01/14/investing/blackrock-earnings-ishares-etfs/index.html
***FAILED TO DOWNLOAD*** https://money.cnn.com/2021/01/14/economy/china-trade-surplus-intl-hnk/index.html
***FAILED TO DOWNLOAD*** https://money.cnn.com/2021/01/14/investing/delta-record-loss/index.html
***FAILED TO DOWNLOAD*** https://money.cnn.com/2021/01/13/media/trump-presidency-reliable-sou

In [6]:
# now we will check if the title has spanish, chinese characters etc and remove those ones



In [6]:
titles # now it seems to be okay

['About BuzzFeed News',
 'The First Democrat Who Called For Impeachment Says Congress Should Have Acted Years Ago',
 'Trump Has Become The First President Ever To Be Impeached Twice, This Time For Inciting A Deadly Insurrection',
 'Andrew Yang Is Running For Mayor Of New York City',
 'My Breast Reduction Finally Made Me Feel At Home In My Body',
 "Michigan's Former Governor Has Been Charged With Willful Neglect In The Flint Water Crisis",
 'An Ex-Olympian Who Wore His Team USA Jacket At The Capitol Riot Has Been Charged',
 'Joe Biden Has Picked Jaime Harrison To Be Next Chair Of The Democratic Party',
 'She Believed Trump. Now She Doesn’t Believe In America.',
 'Immigrant Rights Groups Are Ramping Up Pressure On Biden To Uphold His 100-Day Deportation Ban',
 'Another Year, Another Record: 2020 Was One Of The Hottest Years Yet',
 'House Flippers And Real Estate Agents Are Going Viral On TikTok',
 'Trump Has Been Impeached. Now Mitch McConnell Will Have To Pick A Side.',
 'Men Have Eatin

### Preprocessing and Writing

In [7]:
# we have to remove any /n characters and write every sentence and title on a single line

def preprocess(w):
  w = re.sub(r"[\n]+", " ", w) 

  return w

In [8]:
f = open("news.txt", "w+")

In [9]:
import re

for i in range(len(titles)):
    titles[i] = preprocess(titles[i])
    f.write(titles[i] + "\n")

f.close()

In [10]:
!pip3 install textgenrnn



In [11]:
%tensorflow_version 1.x

TensorFlow 1.x selected.


In [12]:
from textgenrnn import textgenrnn

textgen = textgenrnn()

Using TensorFlow backend.


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


In [14]:
# we have made a fake news titles generator! 

textgen.train_from_file('news.txt', num_epochs=1)
textgen.generate()

993 texts collected.
Training on 68,513 character sequences.

Epoch 1/1
####################
Temperature: 0.2
####################
Best contraced to be a brazo set of the best second to a both the menance of the best first time in 2020 to be a bad en has been selfies and secret to be a protest the help of the China because they are travilinations for the best protest behavation of the best story of the best special second to 

What's better the mention to the power for the best protest behind the world and secret of the beach for the best from Capitol to be a lot of the first time and second and a little en has been selecting the for the best from the world of the best from the form of the holida contracity of Capitol i

A la de second the parative of the Changes for better than the for the market for the for the best face of the Changes of Star Ward to the orders and security to be a los in 2020

####################
Temperature: 0.5
####################
A transformer de Japannin cove