### This notebook generate fakes news using headlines given in a json file. It is mainly inspired by this article : https://medium.com/@ageitgey/deepfaking-the-news-with-nlp-and-transformer-models-5e057ebd697d

This notebook was originally made to run on google collaboratory

### Step 1: Load Headlines from Google News and preprocess (remove website name) them 

In [4]:
#### Load Headlines from Google News and preprocess (remove website name) them ####

import json

with open('../../data/generator/headlines/headlines_for_bert.json') as f:
  headlines_json = json.load(f)

HEADLINES = []

for keyword in headlines_json :
  for h in headlines_json[keyword]:
    h = h.rsplit('-', 1)
    HEADLINES.append(h[0])

### Step 2: Download Grover code and install requirements

In [3]:
%cd /content
!git clone https://github.com/rowanz/grover.git
%cd /content/grover
!python3 -m pip install regex jsonlines twitter-text-python feedparser

/content
Cloning into 'grover'...
remote: Enumerating objects: 104, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 104 (delta 0), reused 0 (delta 0), pack-reused 101[K
Receiving objects: 100% (104/104), 675.14 KiB | 21.78 MiB/s, done.
Resolving deltas: 100% (40/40), done.
/content/grover
Collecting jsonlines
  Downloading https://files.pythonhosted.org/packages/d4/58/06f430ff7607a2929f80f07bfd820acbc508a4e977542fefcc522cde9dff/jsonlines-2.0.0-py3-none-any.whl
Collecting twitter-text-python
  Downloading https://files.pythonhosted.org/packages/29/a9/3d9cc947dea07e42f55a3c9de741ceeea766f841bc08297605a6370dfca0/twitter-text-python-1.1.1.tar.gz
Collecting feedparser
[?25l  Downloading https://files.pythonhosted.org/packages/1c/21/faf1bac028662cc8adb2b5ef7a6f3999a765baa2835331df365289b0ca56/feedparser-6.0.2-py3-none-any.whl (80kB)
[K     |████████████████████████████████| 81kB 11.3MB/s 
[?25hCollecting sgmllib3k
  

### Step 3: Download Grover Pre-Trained 'Mega' Model

In [4]:
import os
import requests

model_type = "mega"

model_dir = os.path.join('/content/grover/models', model_type)
if not os.path.exists(model_dir):
    os.makedirs(model_dir)

for ext in ['data-00000-of-00001', 'index', 'meta']:
    r = requests.get(f'https://storage.googleapis.com/grover-models/{model_type}/model.ckpt.{ext}', stream=True)
    with open(os.path.join(model_dir, f'model.ckpt.{ext}'), 'wb') as f:
        file_size = int(r.headers["content-length"])
        if file_size < 1000:
            raise ValueError("File doesn't exist? idk")
        chunk_size = 1000
        for chunk in r.iter_content(chunk_size=chunk_size):
            f.write(chunk)
    print(f"Just downloaded {model_type}/model.ckpt.{ext}!", flush=True)

Just downloaded mega/model.ckpt.data-00000-of-00001!
Just downloaded mega/model.ckpt.index!
Just downloaded mega/model.ckpt.meta!


### Step 4: Define functions to generate Fake News using real headlines with Grover

In [5]:
%tensorflow_version 1.x
import tensorflow as tf
import numpy as np
import sys
import feedparser
import time
from datetime import datetime, timedelta
import requests
import base64
from ttp import ttp

sys.path.append('../')
from lm.modeling import GroverConfig, sample
from sample.encoder import get_encoder, _tokenize_article_pieces, extract_generated_target
import random


def get_fake_articles(domain = "www.nytimes.com" ):
    """
    Create article objects for each fake headline we have in 
    HEADLINES suitable for feeding into Grover
    to generate the story body. The domain name is used to control
    the style of the text generated by Grover - i.e. bbc.co.uk would generate
    results in British English while nytimes.com would generate US English.
    """
    articles = []
    
    headlines_to_inject = HEADLINES

    for fake_headline in headlines_to_inject:
        days_ago = random.randint(1, 7)
        pub_datetime = datetime.now() - timedelta(days=days_ago)

        publish_date = pub_datetime.strftime('%m-%d-%Y')
        iso_date = pub_datetime.isoformat()

        articles.append({
            'summary': "",
            'title': fake_headline,
            'domain': domain,
            'text': '',
            'authors': ["Staff Writer"],
            'publish_date': publish_date,
            'iso_date': iso_date,
        })

    return articles


def generate_article_attribute(sess, encoder, tokens, probs, article, target='article'):

    """
    Given attributes about an article (title, author, etc), use that context to generate
    a replacement for one of those attributes using the Grover model.

    This function is based on the Grover examples distributed with the Grover code.
    """

    # Tokenize the raw article text
    article_pieces = _tokenize_article_pieces(encoder, article)

    # Grab the article elements the model careas about - domain, date, title, etc.
    context_formatted = []
    for key in ['domain', 'date', 'authors', 'title', 'article']:
        if key != target:
            context_formatted.extend(article_pieces.pop(key, []))

    # Start formatting the tokens in the way the model expects them, starting with
    # which article attribute we want to generate.
    context_formatted.append(encoder.__dict__['begin_{}'.format(target)])
    # Tell the model which special tokens (such as the end token) aren't part of the text
    ignore_ids_np = np.array(encoder.special_tokens_onehot)
    ignore_ids_np[encoder.__dict__['end_{}'.format(target)]] = 0

    # We are only going to generate one article attribute with a fixed
    # top_ps cut-off of 95%. This simple example isn't processing in batches.
    gens = []
    article['top_ps'] = [0.95]

    # Run the input through the TensorFlow model and grab the generated output
    tokens_out, probs_out = sess.run(
        [tokens, probs],
        feed_dict={
            # Pass real values for the inputs that the
            # model needs to be able to run.
            initial_context: [context_formatted],
            eos_token: encoder.__dict__['end_{}'.format(target)],
            ignore_ids: ignore_ids_np,
            p_for_topp: np.array([0.95]),
        }
    )

    # The model is done! Grab the results it generated and format the results into normal text.
    for t_i, p_i in zip(tokens_out, probs_out):
        extraction = extract_generated_target(output_tokens=t_i, encoder=encoder, target=target)
        gens.append(extraction['extraction'])

    # Return the generated text.
    return gens[-1]


TensorFlow 1.x selected.



### Step 5: Generate the news

In [6]:
# Toss in the fakes articles

articles = get_fake_articles()

# Randomize the order the articles are generated
random.shuffle(articles)

# Load the pre-trained "huge" Grover model with 1.5 billion params
model_config_fn = '/content/grover/lm/configs/mega.json'
model_ckpt = '/content/grover/models/mega/model.ckpt'
encoder = get_encoder()
news_config = GroverConfig.from_json_file(model_config_fn)

generated_articles = []

# Set up TensorFlow session to make predictions
tf_config = tf.ConfigProto(allow_soft_placement=True)

with tf.Session(config=tf_config, graph=tf.Graph()) as sess:
    # Create the placehodler TensorFlow input variables needed to feed data to Grover model
    # to make new predictions.
    initial_context = tf.placeholder(tf.int32, [1, None])
    p_for_topp = tf.placeholder(tf.float32, [1])
    eos_token = tf.placeholder(tf.int32, [])
    ignore_ids = tf.placeholder(tf.bool, [news_config.vocab_size])

    # Load the model config to get it set up to match the pre-trained model weights
    tokens, probs = sample(
        news_config=news_config,
        initial_context=initial_context,
        eos_token=eos_token,
        ignore_ids=ignore_ids,
        p_for_topp=p_for_topp,
        do_topk=False
    )

    # Restore the pre-trained Grover 'huge' model weights
    saver = tf.train.Saver()
    saver.restore(sess, model_ckpt)

    # START MAKING SOME FAKE NEWS!!
    # Loop through each headline we scraped from an RSS feed or made up
    for article in articles:
        print(f"Building article from headline '{article['title']}'")

        # If the headline is one we made up about a specific person, it needs special handling
    
        article['text'] = generate_article_attribute(sess, encoder, tokens, probs, article, target="article")

        # Now, generate a fake headline that better fits the generated article body
        # This replaces the real headline so none of the original article content remains
        article['title'] = generate_article_attribute(sess, encoder, tokens, probs, article, target="title")

        print(f" - Generated fake article titled '{article['title']}'")

        generated_articles.append(article)








Instructions for updating:
Use keras.layers.Dense instead.
Instructions for updating:
Please use `layer.__call__` method instead.

Instructions for updating:
`tf.batch_gather` is deprecated, please use `tf.gather` with `batch_dims=-1` instead.
INFO:tensorflow:Restoring parameters from /content/grover/models/mega/model.ckpt
Building article from headline 'Statement by Press Secretary Jen Psaki on the President's Travel to the United Kingdom and Belgium '
 - Generated fake article titled 'Full Statement by White House Spokesman Stephanie Grisham on NATO Summit'
Building article from headline 'Barr says DOJ has not seen evidence of fraud that would change election results '
 - Generated fake article titled 'Giuliani: DOJ has found no evidence of election fraud'
Building article from headline 'Scoop: Trump eyes digital media empire to take on Fox News '
 - Generated fake article titled 'Trump Administration Preparing to Launch Conservative Web Venture, ‘TrumpCo’'
Building article fro

#### Step 5 : Save generated files to csv to train our models with them



In [7]:
import pandas as pd

df = pd.DataFrame(generated_articles)
df.to_csv('../../data/generator/generated/headlines_for_bert.csv')