In [1]:
import os
os.chdir('..')

# *Final Project for Gened 1023: Ignorance, Lies, Hogwash, and Humbug*
#### *Harvard College*

--------------------

# NewsBot

## Using AI to stop the spread of fake news on Twitter

![alt text](img/flowchart.png)

## Table of Contents:

#### I. [Introduction](#intro)
 1. Problem
 2. Intervention

#### II. [Content](#content)
 1. [The Tweets](#tweets)
 2. [The Model](#model)
 3. [The Bot](#bot)

#### III. [Limitations and Improvements](#limitations)
 1. Detecting which tweets contain links to news articles
 2. Deciding whose tweets to monitor
 3. Improving the NewsBot model
 4. Promoting the bot's visibility

#### IV. [Concluding Remarks](#conclusion)

<a id='intro'></a>

## Introduction

### Problem

In recent years, social media platforms like Facebook, Instagram, and Twitter have come under fire for enabling disinformation campaigns that have led us to the “post-truth era.” Some platforms have chosen to keep minimal restrictions on what can be posted on their websites in the name of free speech (e.g. Facebook), while others have chosen to do away with political ads entirely (e.g. Twitter). Yet, even in platforms that fall into this second category, disinformation remains a rampant problem. Whether maliciously or ignorantly, well-known influencers on Twitter frequently quote and spread false information, inflating the credibility of unreliable news organizations and bad actors on the platform. This often works to discredit trustworthy, essential institutions.

Voters rely on news and media to make decisions and hold their representatives accountable. When they lose trust in established sources of media to provide truthful information, they can make uninformed (or worse, misinformed) decisions. As we know from the 2016 election, these decisions have real consequences for our democracy. Knowing that both foreign and domestic agents are learning to leverage social media for their political gain, and given that more and more people are obtaining their news from social media, this is a growing concern. If we care at all about having a healthy democracy, then it’s clear that we must combat the spread of lies and hogwash on social media.


### Intervention

Given the sheer amount of content that gets posted on any major social media platform, no group of humans could possibly police the site for false information. Luckily, machines have gained the capacity to read and analyze huge sums of information in minimal time, and AI is starting to learn how to sort fake news from real news. With a small fleet of virtual robots, it seems quite possible to scan entire sites for misinformation.

Of course, major platforms like Twitter are reluctant to call out popular users who knowingly spread false information. However, Twitter provides a [fully-functional API](https://developer.twitter.com/en/docs) from which it is possible to read and reply to any Tweet on their site. With this API, I built a bot that patrols some of Twitter’s most popular political figures and verifies the credibility of any news that they share. Given a list of users, the bot will search their most recent Tweets for links to any news articles. When the bot finds a Tweet that contains an article, it computes various reliability scores for the article (using the model described [here](https://towardsdatascience.com/full-pipeline-project-python-ai-for-detecting-fake-news-with-nlp-bbb1eec4936d)) and reply to the Tweet with those scores. Users who happen to notice the reply will gain insight into how much they should trust the article. Over time, we hope the robot will cultivate a following large enough to boost its replies so that more users notice them. While this approach does not prevent users from posting misleading content, perhaps it can educate their followers on the reliability of the information they share.

This notebook will detail how the bot works, warn you about some of its shortcomings, and explain how it they can be improved.

<a id='content'></a>

## Content

If you intend to use this bot (or any other Twitter bot), you must first [apply for a Twitter Developer account](https://developer.twitter.com/en/application/in-review). Once you've done that, head to the Developer Dashboard, create an App, and set the app's permissions to read and write. You should be able to find your consumer API keys and access tokens under the "Keys and tokens" tab of your new app. Copy them, and save them in separate text files in a folder called `newsbot_api_credentials`. The folder should be kept in the directory that contains this repository (i.e. outside of this repository). Your directory should look like this:
```
newsbot/
    docs/
        demo.ipynb
    NewsBot/
        TweetManager/
            logs/
                tweets.csv
            api_manager.py
            news_sites.txt
            read_tweets.py
        Bot.py
        PublicModels.py
newsbot_api_credentials/
    access_token.txt
    access_token_secret.txt
    consumer_api_key.txt
    consumer_api_key_secret.txt
```

These keys must be kept secret, which is why I have not included them in this repository.

In order to access Twitter's API, I will use a popular Python library called [Tweepy](https://www.tweepy.org/). Tweepy is great for handling logins, fetching information from Twitter, and doing everything else that we'd want to do with the API. If you have not installed the library, just run `pip install tweepy` from the terminal. Then, you'll be able to import it.

In [2]:
import tweepy

<a id='tweets'></a>

### The Tweets

This section will focus on how we retrieve and process information from Twitter. All the code for this section can be found in `NewsBot/TweetManager/read_tweets.py`.

In [3]:
from NewsBot.TweetManager import read_tweets

At a high level, this component of our bot will accomplish 4 things:

1. Find relevant, recent tweets from popular accounts
2. Check that the bot has not already replied each tweet
3. Pick out the ones that contain a link to a news article
4. Save each remaining URL and remember which tweet it came from

Starting with number 1: we need to organize a list of popular accounts that frequently tweet about politics and often share poliical news. While it might be possible to automate this, the current implementation of this bot requires that we develop a list ourselves. We choose to do so by following all the relevant accounts we come across while manually operating the bot's account. In some ways, this strategy works to our advantage; by engaging with political content directly, Twitter's algorithm becomes more likely to show us political content, making our search for relevant accounts more efficient. For the purposes of this demonstration, I've only followed a handful of accounts..

We can retrieve the list of accounts we follow using the function `get_my_following` in `read_tweets.py`:

In [4]:
followed_accounts = read_tweets.get_my_following()
print(followed_accounts)

['SenatorCollins', 'seanhannity', 'TomiLahren', 'JoeBiden', 'JeffFlake', 'realDonaldTrump', 'gtconway3d']


Now, we need to get each tweet from each of these accounts in the last 7 days. We do so in a function called `get_user_tweets`. This function returns a list of tweets, which are encoded as `tweepy.models.Status` objects. These objects contain a lot of information; if you're interested in what they look like, uncomment the last line in the following cell.

In [5]:
tweet_list = []
for account in followed_accounts:
    tweet_list += read_tweets.get_user_tweets(screen_name=account, max_n_days=7)
# vars(tweet_list[0])

This function automatically filters out tweets that our robot has already replied to. However, it doesn't do so by using the API to read through all the replies - that would take far too long, and we would risk hitting Twitter's request rate cap. Instead, we store each of the bot's tweets in a file called `tweets.csv`, found in the `NewsBot/TweetManager/logs` directory. For each tweet that we want to reply to, we just check the file to see if we've already replied to it. This is a primitive solution, but at the scale that we plan to use this robot, it works just fine.

![alt text](img/tweets_log.png)

As per step 3, we must prune this list of tweets for the ones that contain a link to a news article. Though it might be best to take a model-based approach to finding which links are directed to politcal news, we take a simpler approach for ease of implementation. In a file called `news_sites.txt`, we write our own list of known news sites that are often shared on Twitter. For each candidate tweet, we just check that its link includes the base URL of any of these sites in a function called `is_news`. If it doesn't, we remove it from the list. The remaining tweets are the ones that we will send to the model and eventually reply to.

In [6]:
# Filter out tweets that don't contain news
tweet_list = list(filter(read_tweets.is_news, tweet_list))
print("Number of tweets that contain a link to a news article: ", len(tweet_list))

Number of tweets that contain a link to a news article:  12


<a id='model'></a>

### The Model

Now that we have a list of tweets that each contain a link to a news article, it is time figure out which articles are trustworthy and which ones are not. Given the time constraints for this project, I will be using [this publicly available fake news detector](https://www.unslanted.net/newsbot/) rather than building my own. We'll assume that it works reasonably well for now, but I will clarify some its limitations later in this notebook.

Luckily, the tool is very easy to use. You just input the URL of a news article on the first page (left), the model reads the article, and the website redirects you to a page that presents its results (right). As you can see, the underlying model spits out four probabilities, indicating four possible levels of trustworthiness: "fake," "dodgy," "mostly true," and "true". An example is shown below, using [this article](https://www.nytimes.com/2020/05/04/nyregion/coronavirus-ny-hospital-workers.html?action=click&module=Top%20Stories&pgtype=Homepage) from the New York Times.

![alt text](img/unslanted_model.png)

To access the fake news detector, I use a well-known package called `selenium`. Selenium makes it easy to automate web browser intercation in Python. Using Selenium, we can input the URLs from each of our news articles in the textbox shown above, and then scrape the results from the following page. All of this is wrapped in the Python class `FakeNewsDetector`, located in `NewsBot/PublicModels.py`. Here is how you would use it:

*Note: To prevent a new window from popping up every time you make a prediction, I use a webdriver called PhantomJS. If you don't have the driver, you might need to install it . You can download it [here](https://phantomjs.org/download.html).*

In [7]:
from NewsBot.PublicModels import FakeNewsDetector

# Instantiate the model
model = FakeNewsDetector()

# Get the reliability scores for each article (disregard the warning)
predictions = model.predict_proba(tweet_list)



In [14]:
print("Predictions for " + model.extract_url(tweet_list[0]))
labels = ['Probability of fake', 'Probability of dodgy', 'Probability of mostly true', 'Probability of true']
for i in range(len(labels)):
    print(labels[i] + ': ' + str(predictions[0][i]))

Predictions for https://www.foxnews.com/politics/media-that-rushed-to-report-kavanaugh-allegations-are-now-less-interested-in-biden-sexual-assault-claim.amp
Probability of fake: 0.01497309810171502
Probability of dodgy: 0.04901016770438447
Probability of mostly true: 0.8522951882764285
Probability of true: 0.08372154591747195


<a id='bot'></a>

### The Bot

While "The Bot" might be star of the show, it's actually the simplest component of the project. In short, it just ties everything together; it keeps track of which accounts the bot follows, which of those accounts' tweets contain links to news articles, the reliability scores of each of those articles, and all tweets that the bot has already replied to. It can do all of this using two basic functions:

- ```python
Bot.update(self, max_n_days: int = None, include_replies: bool = False)
```
  - This function retrieves all the tweets that contain news and computes the reliability scores for each news article. The argument `max_n_days` controls how far back the bot will look for tweets (so if `max_n_days=4`, it will only look for tweets in the past 4 days). If left as `None`, the bot will simply check when it last replied to a tweet, and look for all new tweets since then. If `include_replies is set to True, it will also search each account's replies for links to news articles. By default, this argument is set to False.
  
- ```python
Bot.post(self, clear_memory: bool = True)
```
  - This function simply replies to each tweet it found to contain news with the corresponding reliability scores, and saves the tweet to a log file called `tweets.csv`. If `clear_memory` is set to True, the bot will forget which tweets it was supposed to reply to. This is set to True by default in order to prevent replying multiple times.
  
In case you want to view a tweet that the bot wants to reply to before it posts anything, you can also use the static method ```view_tweet```, which will open your web browser to the tweet.

Here is an example:

In [2]:
from NewsBot.Bot import Bot

# Instantiate the bot
bot = Bot()

# Get all the tweets to reply to
bot.update(max_n_days=7, include_replies=False)



In [8]:
# Show that the update worked:
labels = ['Probability of fake', 'Probability of dodgy', 'Probability of mostly true', 'Probability of true']
print("Number of tweets to reply to: ", str(len(bot.tweets_to_reply_to)))
print("Predictions for the first tweet to reply to: " + bot.tweets_to_reply_to[1].entities['urls'][0]['expanded_url'])
for i in range(len(labels)):
    print(labels[i] + ': ' + str(bot.tweet_probs[1][i]))

Number of tweets to reply to:  13
Predictions for the first tweet to reply to: https://nypost.com/2020/04/20/no-ny-times-fox-news-didnt-kill-joe-joyce/
Probability of fake: 0.16057720913610332
Probability of dodgy: 0.6480580599724753
Probability of mostly true: 0.1668864133719211
Probability of true: 0.024478317519500425


In [9]:
# View all tweets to reply to (WARNING: this will open many tabs in your web browser)
for tweet in bot.tweets_to_reply_to:
    bot.view_tweet(tweet)

In [None]:
# Reply to tweets
bot.post(clear_memory = True)

Once you run `Bot.post`, you should be able to see the bot's replies on Twitter:

![](img/reply0.png)

And that's everything you need to know to use this Twitter bot! The rest of the notebook will outline some of the bot's issues and offer some commentary on the efficacy of this intervention.

<a id='limitations'></a>

## Limitations and Improvements

#### 1. Detecting which tweets contain links to news articles

In its current state, the bot only flags a tweet when it contains a URL from one of the news sites listed in `news_sites.txt`. On one hand, this keeps our false-positive rate extremely low. On the other hand, there are many more news-related websites on the internet than are listed in that file. Plus, Twitter users often use shortened URLs that contain no reference to the website it redirects to. Our approach completely overlooks these links.

A future approach could use the URL and the linked webpage together to calculate the probability that any tweet shares a politically relevant news article. We would then flag any tweet with a probability that surpasses a certain threshold. Of course, this model would be difficult to develop, but doing so would enable the bot to flag tweets with links to obscure websites or blogs, which are often sources of fake news.

#### 2. Deciding whose tweets to monitor

Currently, the bot only reads the tweets of accounts it follows, and we followed those accounts by hand. We did this primarily for ease of implementation given this project's time constraints, but there might be better solutions: the first is to save a list of accounts to monitor privately, rather than doing so by publicly following them. By showing the world who the bot is targetting, we might be opening the bot up to accusations of biasedness. Though, if we really think that the bot is impartial, then the transparency of organizing our list of accounts by following them might be preferable.

The main problem with the current implementation is that we must write this list of accounts by hand. Ideally, the bot would find the platform's most influential figures in politics itself, using a variety of engagement metrics over time. However, the Twitter API might have certain limitations that make this difficult or impossible. More research is needed to know if this is a viable improvement.

#### 3. Improving the NewsBot model

Again, considering the time constraints for this project, it was not feasible to create our own model to determine the reliability of each news article. Instead, we used [this publicly available model](https://www.unslanted.net/newsbot/), which is described in detail [here](https://towardsdatascience.com/full-pipeline-project-python-ai-for-detecting-fake-news-with-nlp-bbb1eec4936d). If we only consider four possible classes of truthfulness (fake, dodgy, mostly true, true), then this model performs pretty well; an empirical model would put each news article in its correct category 35% of the time, while this model does so about 55% of the time. However, if this bot is going to call out journalists for writing fake news, an accuracy of 55% will not be good enough. Using this bot responsibly will require us to write a better model, which is possible with more data and more advanced modeling techniques.

#### 4. Promoting the bot's visibility

As mentioned above, this bot can only have an impact if users can read its replies. If no one engages with the bot's replies, then the bot is useless. Thus, it is incredibly important to develop a strategy for getting this bot noticed on Twitter. 

<a id='conclusion'></a>

## Concluding Remarks

In the short-term, we hope that this Twitter Bot will alert everyday Twitter users when they come across fake news, and that it will slowly improve the media diet of people who are persuaded by the bot. Of course, with few tweets and few followers, most Twitter users will never notice the bot, nor take it seriously. To meet this short-term goal, we need to boost the bot’s credibility and visibility by cultivating its following. At first, we might do this artificially by buying followers or creating a fleet of other bots that support each other’s content. Once humans start to notice the bot and its network begins to expand, we might stop producing artificial followers. This is only one strategy to make the bot more visible; social media experts might have more effective ones.

With the requisite short-term consequences, we hope that developers at Twitter and other social media companies learn that combating false information and enforcing their rules and policies for all users is fruitful in the long-run. Armchair developers like me are limited by time, computational resources, and the Twitter API itself. Practically speaking, social media giants like Twitter have no limits when it comes to how they choose to patrol their sites. To make real change in the long-run, leaders in social media will need to be convinced that policing at this scale is both realistic and beneficial. By showing that this bot can be successful, we hope that they are inspired to take more serious measures toward solving this problem.