# Identifying Anti-Refugee Tweets

In this notebook we'll be:
*   Exploring the Anti-Refugee Tweets Dataset
*   Developing Rule-Based Classifiers for Tweet Sentiment Detection



## Background


**Sentiment Analysis**<br>
It is the process of using the computer to identify and categorize opinions expressed in a piece of text in order to determine whether the writer's attitude towards the given topic is positive or negative (or sometimes even neutral). It can also reveal their emotional state, and the intended effect of their words.

**Why conduct sentiment analysis ?**<br>
The answer depends on where the tool is applied! In business, it can be used to predict the sentiment of the consumers in a market, thereby aiding the growth of the company. In politics, the sentiments of the voters can be used to determine the most appropriate strategy. By listening to and analysing comments on Facebook and Twitter, local government departments can gauge public sentiment and use the results to improve services they provide to the public. Universities can use sentiment analysis to analyze student feedback and improve their curriculum. These are a few of the many uses of sentiment analysis. 

**What is Anti-Refugee Tweet Classification**<br>
Anti-refugee tweet classification, the topic that we would be covering in the coming few days, is classifying a given tweet as pro-refugee or anti-refugee. An example to illustrate the definition:

> *anti-refugee tweet*: 'muslim refugee charged with beating a woman'<br>
> *pro-refugee tweet*: 'refugee hotspots in italy and greece not yet adequate'

As you can guess from the above example, an anti-refugee tweet would have negative words, and would convey negative sentiments towards the refugees, sentiments that would potray the refugee in a negative light, while the converse is true for pro-refugee tweets. **Understanding anti-refugee sentiment is the first step in addressing it.** This project will allow us to use AI models to do so. 

# Milestone 1: Exploring our data

In [None]:
#@title Run this to import all the necessary packages. This will take a few minutes! { display-mode: "form" }

import tweepy
from sklearn.metrics import accuracy_score
from datetime import datetime, timedelta
import re
import numpy as np
import random
import json
import math
from collections import Counter
import matplotlib.pyplot as plt
import os
import sys
import pandas

import nltk
nltk.download('punkt', quiet=True)
from nltk.tokenize import TweetTokenizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
nltk.download('stopwords' ,quiet=True)
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings('ignore')

# from google.colab import drive
# drive.mount('/content/drive/')

import gdown
import zipfile
import shutil

# Download dataset, only try gdown if !wget doesn't work
# gdown.download('https://drive.google.com/uc?id=1ifYLZ-19ZyjjRUICe4PDRmZFAkyL73d0','./source_data.zip',True)
!wget -O source_data.zip 'https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20Anti-Refugee%20Tweets/Anti-Refugee%20Sentiment%20Analysis-20190614T171546Z-001%20(1).zip'

# Download data.json
!wget 'https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20Anti-Refugee%20Tweets/data.json'
my_zip = zipfile.ZipFile('./source_data.zip', mode = 'r')
my_zip.extractall()

# Download library for this notebook
!wget 'https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20Anti-Refugee%20Tweets/lib.py'
import lib
from lib import Tweet
from lib import Tweet_counts

from lib import *

--2022-04-10 15:14:59--  https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20Anti-Refugee%20Tweets/Anti-Refugee%20Sentiment%20Analysis-20190614T171546Z-001%20(1).zip
Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.141.128, 2607:f8b0:4023:c0b::80
Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.141.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 409259394 (390M) [application/zip]
Saving to: ‘source_data.zip’


2022-04-10 15:15:01 (213 MB/s) - ‘source_data.zip’ saved [409259394/409259394]

--2022-04-10 15:15:01--  https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20Anti-Refugee%20Tweets/data.json
Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.141.128, 2607:f8b0:4023:c0b::80
Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.141.128|:

### Understanding the structure of a tweet

Tweets are composed of:
* Hashtags: Keywords that start with the '#' symbol
* Mentions: Referencing another user/person with '@'
* Everything else: Anything that isn't a hashtag or mention!

We've made a convenient interface for processing our tweets, which we call the `Tweet` class. Let's try out the `Tweet` class!

In [None]:
my_tweet = Tweet('these are #tags and this is a @mention. hey #wait there\'s another @one here too','true') 
# takes in text and true or false - don't worry about true or false right now!

In [None]:
# check out the hashtags!
my_tweet.hashtags

['#tags', '#wait']

In [None]:
# check out the mentions!
my_tweet.mentions

['@mention', '@one']

In [None]:
# check out the tweet text!
my_tweet.tokenList

['these',
 'are',
 'tags',
 'and',
 'this',
 'is',
 'a',
 'hey',
 'wait',
 "there's",
 'another',
 'here',
 'too']

## Activity 1b. Examining our dataset

### Exercise (coding)


Let's a take at our prebuilt database of tweets extracted from twitter! It is in a folder called Data and is stored in a file called data.json.

In [None]:
file = open('./data.json','r')
data = json.load(file)

Our `data` is a list of tweets that are classified as either TRUE (anti-refugee) or FALSE (pro-refugee). 

How many tweets do we have? 

In [None]:
len(data)

689

What does each data point look like? 

In [None]:
data[0]

{'classification': 'TRUE',
 'tweet': 'rt @_makada_ muslim refugee charged with beating ga woman with american flag skips court'}

Each data point is a dictionary (in particular, it is a json object) with two keys:

* classification: which is the category of the tweet 
* tweet: which is the tweet text

We can access the value in each dictionary element by using the individual key associated with each.

In [None]:
# use the 'classification' key to see the sentiment of a tweet
data[0]['classification']

'TRUE'

Let's split our tweets into two lists `pro` and `anti`, which contain pro-refugee and anti-refugee tweets respectively.

In [None]:
pro = []
anti = []

for tweet in data:
  if tweet['classification'] != 'TRUE':
    pro.append(tweet ['tweet'])
  else : 
     anti.append(tweet ['tweet'])

How many tweets do we have of each class? 

In [None]:
# how many pro refugee tweets? 
len(pro)

350

In [None]:
# how many anti refugee tweets?
len(anti)

339

In [None]:
pro[0]

'rt @rangersfanjoe @timpimley @foxnews @cristinacorbin they were .  starbucks has hired 10000 vets since 2014 .  well before the refugee progr'

In [None]:
anti[0]

'rt @_makada_ muslim refugee charged with beating ga woman with american flag skips court'

# Milestone 2: Handmade classifiers

We'll actually split our data into lists of `Tweet` objects so we can access the hashtags and mentions easily. 

Let's now split our data into a full data set (`tweet_data`). 





In [None]:
#@title Formatting our data! {display-mode: 'form'}
tweet_data   = [Tweet(t['tweet'], t['classification']) for t in data]
pro_tweets     = [Tweet(t['tweet'], t['classification']) for t in data if t['classification'].lower()=='false']
anti_tweets    = [Tweet(t['tweet'], t['classification']) for t in data if t['classification'].lower()=='true']


In [None]:
print(pro_tweets[0])
print(pro_tweets[0].mentions)

rt they were starbucks has hired 10000 vets since 2014 well before the refugee progr
['@rangersfanjoe', '@timpimley', '@foxnews', '@cristinacorbin']


In [None]:
len(pro_tweets)

346

## Activity 2a. Looking at pro vs. anti tweets

### Exercise (Discussion)

We reformated our data into lists of 'tweets'. Let's now look at a few pro and anti refugee tweets, their original text, hashtags, and mentions. 

In [None]:
# display 5 pro refugee tweets!
for i in range(10):
  this_tweet = pro_tweets[i]
  print('---Original tweet text---')
  print(this_tweet.original_tweet_text)

  print('---Hashtags---')
  print(this_tweet.hashtags)

  print('---Mentions---')
  print(this_tweet.mentions)
  
  print('\n')

---Original tweet text---
rt @rangersfanjoe @timpimley @foxnews @cristinacorbin they were .  starbucks has hired 10000 vets since 2014 .  well before the refugee progr
---Hashtags---
[]
---Mentions---
['@rangersfanjoe', '@timpimley', '@foxnews', '@cristinacorbin']


---Original tweet text---
rt @refugeeinfobus another unaccompanied refugee child arrested in #calais tonight at a evening food distribution
---Hashtags---
['#calais']
---Mentions---
['@refugeeinfobus']


---Original tweet text---
rt @slade now there\'s an actual deadline .    please continue to shout raise awareness donate to lgbt refugee causes contact
---Hashtags---
[]
---Mentions---
['@slade']


---Original tweet text---
rt @gisellalomax helping refugees to thrive not just survive - a new approach on responding to refugee crises .
---Hashtags---
[]
---Mentions---
['@gisellalomax']


---Original tweet text---
rt @yemmadelrey repeat after me i need to watch this video every morning if my ungrateful ass ever think of skippi

In [None]:
# display 5 anti-refugee tweets!
for i in range(15):
  this_tweet = anti_tweets[i]
  print('---Original text---')
  print(this_tweet.original_tweet_text)

  print('---Hashtags---')
  print(this_tweet.hashtags)

  print('---Mentions---')
  print(this_tweet.mentions)
  
  print('\n')

---Original text---
rt @_makada_ muslim refugee charged with beating ga woman with american flag skips court
---Hashtags---
[]
---Mentions---
['@_makada_']


---Original text---
rt @_makada_ muslim refugee charged with beating ga woman with american flag skips court
---Hashtags---
[]
---Mentions---
['@_makada_']


---Original text---
rt @johnkstahlusa there\'s something wrong with this refugee nonsense .  real men stay and fight for their values and country .  #tcot
---Hashtags---
['#tcot']
---Mentions---
['@johnkstahlusa']


---Original text---
trouble is it\'s all dem-friendly spending  planned parenthood refugee resettlement continuing bribe to obamacar
---Hashtags---
[]
---Mentions---
[]


---Original text---
rt @amike4761 muslim refugees decline work say its against their religion to perform labor for americans .  deport them all ?
---Hashtags---
[]
---Mentions---
['@amike4761']


---Original text---
boycott and call out !  chobani yogurt founder pushing for more refugee labor  vi

**In your group, discuss:** 
Does a tweet always have a hashtag or mention? 


## Activity 2b. Handmade Rules for Classification

Rule based classification uses certain rules, defined by the user, to classify tweets to the given categories. These rules are generally rigid and hence a rule based classifier cannot assign a probability to a tweet but can only assign a category to it.

An example of a rule based classifier is:

> If the word 'potato' or 'spinach' occurs in a tweet, then classify the tweet as vegetable, otherwise classify it as a fruit!

Oftentimes, due to the rigidity and simplicity of the rule based classifier, the classification is faulty. Hence, do not expect a high accuracy from this classifier.

Before we begin making our rule based classifier, let us visualize the data. Visualization helps us understand properties of the data which will, in turn, help us with the rule based classifier.

### Exercise (Discussion): Figuring out the rules for our tweets



Rule based classification, as the name suggests, is based on a given set of rules. In case of tweets, these rules can be a lot of things. Let us look at the data to figure out the things that we can use for rules.

We know that we have the following unique things in tweets:

1. Hashtags
2. Mentions
3. Other words

**Question:** Do you think hashtags can be used to classify tweets? Give 5 examples of hashtags that can tell pro or anti refugee tweets apart.

In [None]:
# display the first 10 pro hashtags

pro_hashtags = []
Hashtag_counter_pro = {}
for tweet in pro_tweets:
  if len(tweet.hashtags) > 0: 
    pro_hashtags.append(tweet.hashtags)
  if len(pro_hashtags) == 10:
    break
pro_hashtags    

pro_words_counter = {} #initialize dictionary to count occurrences of words

for tweet in pro_tweets:
 tokens = [t for t in tweet.tokenList if t not in stopwords.words('english')]
 for token in tokens:
   if token in pro_words_counter: #if the word already exists in the counter, add 1 to its count
     pro_words_counter[token]=pro_words_counter[token]+1
   else: #otherwise, add the token to the counter dictionary with a value of 1
      pro_words_counter[token]=1
    
dict(sorted(pro_words_counter.items(), key=lambda item: item[1], reverse=True)) #sort the dictionary by value, descending
print(dict(sorted(pro_words_counter.items(), key=lambda item: item[1], reverse=True)))

{'refugee': 291, 'rt': 161, 'amp': 43, 'syrian': 36, 'refugees': 34, 'help': 33, 'children': 31, 'family': 24, 'story': 23, 'crisis': 23, 'people': 20, 'new': 19, 'support': 17, 'camp': 15, 'camps': 15, 'refugeeswelcome': 14, 'work': 14, 'us': 13, 'need': 12, 'report': 12, 'syria': 12, 'today': 12, 'women': 11, 'via': 11, 'read': 11, 'life': 11, 'child': 10, 'app': 10, 'education': 10, 'families': 10, 'walk': 9, 'mile': 9, 'shoes': 9, 'sign': 9, 'great': 9, 'rohingya': 8, 'day': 8, 'world': 8, 'one': 8, 'lives': 8, 'would': 8, 'many': 8, 'kids': 8, 'free': 8, '2017': 8, 'boat': 8, '16': 8, 'status': 8, 'please': 7, 'awareness': 7, 'may': 7, 'w': 7, 'see': 7, 'welcome': 7, 'user': 7, 'community': 7, 'first': 7, 'woman': 7, 'old': 7, '17': 7, 'every': 6, 'take': 6, 'supporting': 6, 'eu': 6, 'news': 6, 'war': 6, 'students': 6, 'like': 6, 'proud': 6, '2': 6, 'want': 6, 'pregnant': 6, 'school': 6, 'disaster': 6, 'provide': 6, 'military': 6, 'home': 6, 'back': 6, 'well': 5, 'calais': 5, 'hel

In [None]:
# display the first 10 anti hashtags
anti_hashtags = []

for tweet in anti_tweets:
  if len(tweet.hashtags) > 0: 
    anti_hashtags.append(tweet.hashtags)
  if len(anti_hashtags) == 10:
    break
anti_hashtags

[['#tcot'],
 ['#worldpenguinday'],
 ['#refugees'],
 ['#refugee', '#travelban'],
 ['#tuesdaymotivation'],
 ['#stephaniedavis'],
 ['#aid4yemen'],
 ['#flynn'],
 ['#worldp'],
 ['#worldpenguinday']]

**Question:** Can mentions (tags - '@') be used to classify tweets? Give 5 examples of mentions that can classify pro or anti tweets. 

In [None]:
# display the first 10 pro mentions

pro_mentions = []


for tweet in pro_tweets:
  if len(tweet.mentions) > 0:
    pro_mentions.append(tweet.mentions)
  if len(pro_mentions) == 100:
    break


pro_mentions

In [None]:
# display the first 10 anti mentions
anti_mentions = []


for tweet in anti_tweets:
  if len(tweet.mentions) > 0:
    anti_mentions.append(tweet.mentions)
  if len(anti_mentions) == 10:
    break


anti_mentions



**Question:** You think any other words from a tweet can be used to classify pro or anti refugee sentiment? Give 5 examples of words that can classify pro or anti tweets. 

In [None]:
# display words in the first 10 pro tweets
pro_words = []

for tweet in pro_tweets:
  tokens = [t for t in tweet.tokenList if t not in stopwords.words('english')]
  pro_words.append(tokens)
  if len(pro_words) == 10:
      break  
      
pro_words

[['rt',
  'starbucks',
  'hired',
  '10000',
  'vets',
  'since',
  '2014',
  'well',
  'refugee',
  'progr'],
 ['rt',
  'another',
  'unaccompanied',
  'refugee',
  'child',
  'arrested',
  'calais',
  'tonight',
  'evening',
  'food',
  'distribution'],
 ['rt',
  'actual',
  'deadline',
  'please',
  'continue',
  'shout',
  'raise',
  'awareness',
  'donate',
  'lgbt',
  'refugee',
  'causes',
  'contact'],
 ['rt',
  'helping',
  'refugees',
  'thrive',
  'survive',
  'new',
  'approach',
  'responding',
  'refugee',
  'crises'],
 ['rt',
  'repeat',
  'need',
  'watch',
  'video',
  'every',
  'morning',
  'ungrateful',
  'ass',
  'ever',
  'think',
  'skipping',
  'class',
  'amp',
  'takin'],
 ['government',
  'said',
  'take',
  'disabled',
  'refugee',
  'children',
  'people',
  'actually',
  'think',
  'keep',
  'evil',
  'fucks'],
 ['focus',
  'receiving',
  'countries',
  'imho',
  'list',
  'betts',
  'ampp',
  'collier',
  'fix',
  'worlds',
  'refugee',
  'system'],
 ['wa

In [None]:
# display words in the first 10 pro tweets
### YOUR CODE HERE
anti_words = []

for tweet in anti_tweets:
  tokens = [t for t in tweet.tokenList if t not in stopwords.words('english')]
  anti_words.append(tokens)
  if len(anti_words) == 10:
      break  
      
anti_words
### END CODE

The more often a hashtag, mention, or rule comes in one category over another, the better we may expect it to work!


Play around with the interactive form below to see the count of a given property (i.e. hashtag, mention, or just a word), and how often it shows up in the pro or anti refugee tweets. This may give you some indication of what specific ones might work better to categorize tweets. 

In [None]:
#@title Query { run: "auto", vertical-output: true, display-mode: "form" }

examine_tweet = Tweet_counts(tweet_data) 

prop = 'Word' #@param ["Hashtags", "Mentions", "Word"]
string = 'Iraq' #@param {type:"string"}

if prop=='Hashtags':
  if string[0]!= '#': string = '#' + string
  print(examine_tweet.query_hashtag(string.lower()))
elif prop=='Mentions':
  if string[0]!='@': string = '@' + string
  print(examine_tweet.query_mentions(string.lower()))
elif prop=='Word':
  print(examine_tweet.query_words(string.lower()))

#@markdown Mentions are tags in twitter - @blah, @realdonaldtrump. 
#@markdown <br><br>**Code result**:


{'pro': 4, 'anti': 1}


**When you're happy with your lists, discuss with your instructor, then write your hashtags, mentions, and words in the lists below. These will be the lists you'll be using today to classify the tweets as anti or pro refugee!**

In [None]:
pro_hashtags = ['#rap','#immigration']
anti_hashtags = ['#buildthewall','#']
pro_mentions = ['@refugeecouncil','@independent','@appgrefugees']
anti_mentions = ['@potus','@100percfedup']
pro_words = ['welcome','life','save','syrian','children','help','family','story','new','support','camp','camps','Iraq']
anti_words = ['american','flag','muslim','rape','labor','country','islam','illegal','illegals','terrorists','leave','USA']

# Milestone 3: Coding up our classifiers

### Exercise (Coding)

We have three types of information that we get from tweets: hashtags, mentions, and the actual text. As we saw, we can build lists of words that we think indicate something is a pro or anti tweet. Each list gives us a single classifier. For example, a pro hashtag classifier will see if a tweet has hashtags in our `pro_hashtags` and, if it does, it decides that the tweet is `pro refugee`. In this way, we can also build 5 classifiers other classifiers for each of our lists. Each classifier is a decision on the feature information that we care about (i.e. hashtags, mentions, or text), and which category we care to find (pro or anti). 

Let us build a classifier based on anti-refugee features.


In [None]:
def anti_classifier(tweet):
  for hashtag in tweet.hashtags:
    if hashtag in anti_hashtags:
      return True # for pro
  for mention in tweet.mentions:
    if mention in anti_mentions:
      return True # for pro
  for word in tweet.tokenList:
    if word in anti_words:
      return True # for pro    
  return False # if none of the hastags, mentions, or words are in our anti lists, then the tweet does not express anti-refugee sentiment

Once we have made our rule based classifier, we can make predictions!

In [None]:
# first, make our predictions
predicted = []
for tweet in tweet_data:
  predicted.append(anti_classifier(tweet))
print(predicted)


[True, True, True, False, True, True, True, True, False, False, False, False, False, True, False, False, False, False, False, False, True, True, True, False, True, True, True, True, True, True, True, False, False, True, False, False, False, False, True, False, True, False, False, False, False, False, False, False, True, False, True, True, False, False, True, False, False, False, True, True, True, False, True, True, True, True, True, False, False, False, True, False, True, False, False, False, False, True, False, False, False, False, False, True, False, False, False, True, False, True, False, False, False, True, True, True, False, True, False, False, False, False, True, True, True, False, False, False, False, False, False, True, False, False, True, False, True, False, False, True, True, True, False, True, False, True, False, True, False, True, False, False, True, False, True, False, False, False, False, True, False, True, True, False, True, False, True, False, False, False, False, False

We need to compare our rule-based classifier's predictions with the real data. Since the `classification` value in the original data was a string, we need a helper function to convert those to Boolean values that we can use to compare to our predictions.

In [None]:
# helper func to convert string "TRUE" or "FALSE" to boolean values
def make_boolean(s):
  if s.lower() == "false":
    return False
  if s.lower() == "true":
    return True
  
# make the test data  
correct = [make_boolean(i['classification']) for i in data]

**Get this classifier's accuracy below!**

In [None]:

accuracy_score(correct,predicted)

0.660377358490566

**Now try building a pro-classifier to only select for pro-refugee tweets! Then, test its accuracy.**

In [None]:
### YOUR CODE HERE
def pro_classifier(tweet):
  for hashtag in tweet.hashtags:
    if hashtag in pro_hashtags:
      return False # for anti
  for mention in tweet.mentions:
    if mention in pro_mentions:
      return False # for anti
  for word in tweet.tokenList:
    if word in pro_words:
      return False # for anti   
  return True # if none of the hastags, mentions, or words are in our anti lists, then the tweet does not express pro-refugee sentiment

predicted = []
for tweet in tweet_data:
  predicted.append(pro_classifier(tweet))
print(predicted)

def make_boolean(s):
  if s.lower() == "false":
    return False
  if s.lower() == "true":
    return True
correct = [make_boolean(i['classification']) for i in data]
accuracy_score(correct,predicted)
### END CODE

[True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, True, False, True, True, False, False, False, True, False, True, True, True, True, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, True, True, True, True, True, True, True, False, True, False, True, True, True, True, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, False, False, False, True, True, True, True, True, False, True, True, Tru

0.7155297532656023