# Dawn Bali 
# MSDS 696 Project
## Exploration of the US/Iran Crisis Through Social Media

## Objectives
1. Collect and clean Twitter data, focusing on the US/Iran Crisis that occurred December 2019 - January 2020
2. Explore Data
3. Visualize results

### Research Questions
 - Were Americans in favor of assassinating Soleimani and by proxy, President Trump’s Iran foreign policy?To what extent, if any, were Iranians in favor of the assassination?
 - Aside from the US and Iran, what was the sentiment of the rest of the world?  
 - Does it vary by region (Middle East, Far East, Europe, Africa, Americas) or follow along the lines of other divides (Commonwealth states, former Soviet, Islamic)?
 - How did the sentiment vary over the span of collection, noting the key events below?

## Data
### Twitter content related to the US/Iran crisis collected via a Premium search with full archive access
 - Keywords:  Iran, protests, Tehran, Khamenei, Qassem Soleimani, Ukraine International Airlines flight PS752 
 - Hashtages: #iranprotests2020, #iranprotesters, #iran, #tehran, #khamenei, #qassemsoleimani, #uk752
 - Timeframe: December 27, 2019, through January 25, 2020 

In [101]:
# Packages needed
import os
import twitter 
import pandas as pd
import tweepy as tw
import json
import searchtweets as st #recommended package for Premium searches 
#https://twitterdev.github.io/search-tweets-python/index.html
import csv
import re #regular expression
from textblob import TextBlob
import string

### Authentication
Setting up the authentication was not trivial.  The authentication used in previous classes was insufficient for the type of search I wanted to do.  It required a different token.

### Premium search authentication
Required a lot of research:
 - https://developer.twitter.com/en/docs/tweets/search/overview/premium
 - my environ: https://developer.twitter.com/en/account/environments
 - my app:  https://developer.twitter.com/en/apps/17160618
 - my full-archive endpoint: https://api.twitter.com/1.1/tweets/search/fullarchive/USIRCrisis.json

#### Searchtweets Credential Documentation
- https://github.com/twitterdev/search-tweets-python/issues/64
- https://github.com/twitterdev/search-tweets-python/pull/68
- https://github.com/twitterdev/search-tweets-python/blob/master/searchtweets/credentials.py
- https://developer.twitter.com/en/docs/api-reference-index
- https://tweepy.readthedocs.io/en/latest/
- https://twitterdev.github.io/search-tweets-python/
- https://twitterdev.github.io/search-tweets-python/searchtweets.html

In [1]:
#https://twitterdev.github.io/search-tweets-python/
from searchtweets import load_credentials
from searchtweets import ResultStream
from searchtweets import gen_rule_payload
from searchtweets import collect_results

### Required the use of a yaml file to set up configuration
#### This code is for an environmental variable override.  In case it was necessary.  I added it to the .bashrc file.
- os.environ["SEARCHTWEETS_USERNAME"] = "USIRCrisis"
- os.environ["SEARCHTWEETS_ENDPOINT"] = "https://api.twitter.com/1.1/tweets/search/fullarchive/USIRCrisis.json"
- os.environ['SEARCHTWEETS_ACCOUNT_TYPE']= 'premium'
- os.environ['SEARCHTWEETS_BEARER_TOKEN']= ''

### Using the Twitter Search API's Python Wrapper: Searchtweets 
### API Parameters
- https://developer.twitter.com/en/docs/tweets/search/guides/integrating-premium
- Technical Rate limits
    - Requests per second: 10
    - Requests per minute: 30
    - Time Frame:  full history
    - Counts vs Data:  Counts only
    - Query length: 128 characters
    - Operator availability: standard
    - Tweets per request: 100
#### Search Documentation
- https://github.com/twitterdev/search-tweets-python/blob/master/examples/api_example.ipynb
- https://developer.twitter.com/en/docs/tweets/search/api-reference/premium-search
- https://developer.twitter.com/en/docs/tweets/search/guides/premium-operators
- https://developer.twitter.com/en/docs/tweets/search/api-reference/premium-search#DataEndpoint
- https://developer.twitter.com/en/docs/tweets/search/guides/integrating-premium
- https://github.com/twitterdev/search-tweets-python/blob/master/examples/api_example.ipynb

### Creating the Query

In [2]:
search_words = 'iran OR iranprotesters OR #iran OR #tehran OR #khamenei OR #QassemSoleimani OR flight 752 or #uk752 or #pb752'

In [3]:
premium_search_args = load_credentials("~/school/Practicums/twitter_keys.yaml", 
                                       yaml_key="search_tweets_api", 
                                       env_overwrite=False)

  search_creds = yaml.load(f)[yaml_key]
Grabbing bearer token from OAUTH


In [4]:
rule = gen_rule_payload(search_words, from_date='2019-12-27', to_date='2020-01-25')

In [6]:
tweets = collect_results(rule, max_results=100, result_stream_args=premium_search_args)

In [7]:
print(tweets)

[{'created_at': 'Fri Jan 24 23:59:59 +0000 2020', 'id': 1220858817807822853, 'id_str': '1220858817807822853', 'text': 'RT @jameshohmann: Mike Pompeo was angry at NPR because he wanted to talk about IRAN, not Ukraine.\n\nBut when she asked straightforward quest…', 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 26722859, 'id_str': '26722859', 'name': 'Helring, God of Hammers', 'screen_name': 'helring', 'location': None, 'url': None, 'description': 'RSN=Helring. Also known as Zarosian_Emissary on reddit. Runescape player, BioWare and Bethesda fan. Rogue Bitch Bot.', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 938, 'friends_count': 1219, 'listed_count': 6, 'favourites_count': 42247, 'statuses_count': 57840, '

In [8]:
#This was important to do.  The max return will always be only 100 tweets.  
#I will need to do multiple searches as well as filter. 
#For the purpose of this research, I only want original posts.  Filtering retweets, should help considerably.
#I also need to filter for only english text.
#2020-01-24 23:59:59 --> 2020-01-24 23:59:11 (4) = 100 tweets
[print(tweet.created_at_datetime) for tweet in tweets[0:100]];

2020-01-24 23:59:59
2020-01-24 23:59:59
2020-01-24 23:59:58
2020-01-24 23:59:58
2020-01-24 23:59:58
2020-01-24 23:59:58
2020-01-24 23:59:58
2020-01-24 23:59:57
2020-01-24 23:59:57
2020-01-24 23:59:57
2020-01-24 23:59:57
2020-01-24 23:59:57
2020-01-24 23:59:55
2020-01-24 23:59:55
2020-01-24 23:59:55
2020-01-24 23:59:54
2020-01-24 23:59:54
2020-01-24 23:59:54
2020-01-24 23:59:53
2020-01-24 23:59:53
2020-01-24 23:59:53
2020-01-24 23:59:53
2020-01-24 23:59:52
2020-01-24 23:59:51
2020-01-24 23:59:51
2020-01-24 23:59:50
2020-01-24 23:59:49
2020-01-24 23:59:49
2020-01-24 23:59:48
2020-01-24 23:59:48
2020-01-24 23:59:48
2020-01-24 23:59:47
2020-01-24 23:59:47
2020-01-24 23:59:47
2020-01-24 23:59:46
2020-01-24 23:59:45
2020-01-24 23:59:44
2020-01-24 23:59:44
2020-01-24 23:59:44
2020-01-24 23:59:44
2020-01-24 23:59:44
2020-01-24 23:59:43
2020-01-24 23:59:43
2020-01-24 23:59:42
2020-01-24 23:59:42
2020-01-24 23:59:42
2020-01-24 23:59:41
2020-01-24 23:59:41
2020-01-24 23:59:41
2020-01-24 23:59:39


In [57]:
Newrule = gen_rule_payload("search_words lang:EN", from_date='2019-12-27', to_date='2020-01-25')

In [58]:
Ftweets = collect_results(Newrule, max_results=100, result_stream_args=premium_search_args)

In [59]:
#2020-01-24 23:59:59 --> 2020-01-19 20:49:45 = 100 tweets
[print(tweet.created_at_datetime) for tweet in Ftweets[0:100]];

2020-01-24 22:11:47
2020-01-24 21:59:13
2020-01-24 21:18:31
2020-01-24 17:21:23
2020-01-24 17:18:28
2020-01-24 16:04:43
2020-01-24 15:12:24
2020-01-24 13:16:19
2020-01-24 13:15:47
2020-01-24 12:00:01
2020-01-24 10:18:53
2020-01-24 03:25:43
2020-01-23 20:21:04
2020-01-23 19:58:43
2020-01-23 19:25:02
2020-01-23 19:15:52
2020-01-23 16:53:08
2020-01-23 15:41:21
2020-01-23 14:26:41
2020-01-23 12:55:31
2020-01-23 12:39:27
2020-01-23 10:08:58
2020-01-23 08:34:39
2020-01-23 08:26:59
2020-01-23 08:16:15
2020-01-23 04:27:21
2020-01-23 02:02:28
2020-01-23 01:39:19
2020-01-23 01:15:46
2020-01-22 22:27:48
2020-01-22 20:47:28
2020-01-22 19:55:47
2020-01-22 19:53:37
2020-01-22 16:51:25
2020-01-22 16:32:01
2020-01-22 14:44:34
2020-01-22 14:34:42
2020-01-22 12:32:32
2020-01-22 12:26:21
2020-01-22 09:05:18
2020-01-22 07:57:09
2020-01-22 07:15:48
2020-01-22 02:34:26
2020-01-22 02:24:53
2020-01-22 01:56:28
2020-01-22 01:33:38
2020-01-21 23:32:37
2020-01-21 21:13:48
2020-01-21 20:49:13
2020-01-21 20:36:52


##### Filtering
Figuring out the syntax to filter retweets and a language took 4 hours.  In spite of listing "filter:" as a standard operator, I am unable to get it to work. I kept receiving a 422 error. I also tried including it in the keyword variable.  I think it was unsuccessful there because it was being read as part of the string.  I suspect also that filter was an older operator that was phased out as different paid levels of subscriptions were introduced.  the capability "is:retweets" is available for the paid Premium account.
- https://discovertext.com/2018/07/01/the-premium-twitter-api/
- https://stackoverflow.com/questions/35938188/twitter-api-how-to-exclude-retweets-when-searching-tweets-using-twython/36022286

I was able to cover more time with one pull by limiting the results to English.  It looks like I may be able to do a couple of days as a time and then merge the data somehow.

###### HTTP Error code: 422 Unprocessable Entity
There were errors processing your request: Reference to invalid operator 'filter'. Operator is not available in current product or product packaging. Please refer to complete available operator list at http://t.co/operators. (at position 15), Reference to invalid field 'filter' (at position 15)

Rule payload: {'query': 'search_words -filter:retweets lang:EN', 'toDate': '202001250000', 'fromDate': '201912270000'}

##### Query timeframes
- r1 = 2020-01-24 22:11:47 --> 2020-01-20 00:00:25
- r2 = 2020-01-19 23:54:15 --> 2020-01-17 01:15:46 (maxes out at 100, excludes 1 tweet @2020-01-17 00:46:42)
- r3 = 2020-01-16 20:09:46 --> 2020-01-10 01:20:13
- r4 = 2020-01-09 23:56:43 --> 2020-01-09 00:41:03
- r5 = 2020-01-08 23:38:58 --> 2020-01-08 00:21:55
- r6 = 2020-01-07 23:50:12 --> 2020-01-07 05:38:46 (Missing 5 hours.  UTC time?)
- r7 = 2020-01-06 21:13:55 --> 2020-01-01 02:33:54
- r8 = 2019-12-31 21:14:08 --> 2020-01-08 05:39:23
- r9 = 2020-01-09 23:56:43 --> 2019-12-29 02:24:35
- r10 = 2019-12-28 23:36:55 -> 2019-12-28 14:50:43 (Missing 14 hours.  UTC time?)
- r11 = 2019-12-27 23:55:32 -> 2019-12-27 13:11:10 (Missing 13 hours.  UTC time?)

In [98]:
rule1 = gen_rule_payload("search_words lang:EN", from_date='2020-01-20', to_date='2020-01-25')
rule2 = gen_rule_payload("search_words lang:EN", from_date='2019-12-27', to_date='2020-01-20')
rule3 = gen_rule_payload("search_words lang:EN", from_date='2020-01-10', to_date='2020-01-17')
rule4 = gen_rule_payload("search_words lang:EN", from_date='2020-01-09', to_date='2020-01-10')
rule5 = gen_rule_payload("search_words lang:EN", from_date='2020-01-08', to_date='2020-01-09')

rule6 = gen_rule_payload("search_words lang:EN", from_date='2020-01-07', to_date='2020-01-08')
rule7 = gen_rule_payload("search_words lang:EN", from_date='2020-01-01', to_date='2020-01-07')

rule8 = gen_rule_payload("search_words lang:EN", from_date='2020-01-08', to_date='2020-01-01')
rule9 = gen_rule_payload("search_words lang:EN", from_date='2019-12-29', to_date='2020-01-01')

rule10 = gen_rule_payload("search_words lang:EN", from_date='2019-12-28', to_date='2019-12-29')
rule11 = gen_rule_payload("search_words lang:EN", from_date='2019-12-27', to_date='2019-12-28')

In [100]:
tweets1 = collect_results(rule1, max_results=100, result_stream_args=premium_search_args)
tweets2 = collect_results(rule2, max_results=100, result_stream_args=premium_search_args)
tweets3 = collect_results(rule3, max_results=100, result_stream_args=premium_search_args)
tweets4 = collect_results(rule4, max_results=100, result_stream_args=premium_search_args)
tweets5 = collect_results(rule5, max_results=100, result_stream_args=premium_search_args)
tweets6 = collect_results(rule6, max_results=100, result_stream_args=premium_search_args)
tweets7 = collect_results(rule7, max_results=100, result_stream_args=premium_search_args)
tweets8 = collect_results(rule8, max_results=100, result_stream_args=premium_search_args)
tweets9 = collect_results(rule9, max_results=100, result_stream_args=premium_search_args)
tweets10 = collect_results(rule10, max_results=100, result_stream_args=premium_search_args)
tweets11 = collect_results(rule11, max_results=100, result_stream_args=premium_search_args)

In [99]:
Ttweets = collect_results(rule11, max_results=100, result_stream_args=premium_search_args)
[print(tweet.created_at_datetime) for tweet in Ttweets[0:100]];

2019-12-27 23:55:32
2019-12-27 22:31:17
2019-12-27 22:17:51
2019-12-27 21:53:09
2019-12-27 21:41:59
2019-12-27 21:41:52
2019-12-27 21:40:11
2019-12-27 21:39:34
2019-12-27 21:39:13
2019-12-27 21:38:13
2019-12-27 21:23:58
2019-12-27 20:40:32
2019-12-27 20:34:45
2019-12-27 19:27:04
2019-12-27 19:15:47
2019-12-27 19:05:10
2019-12-27 18:20:12
2019-12-27 18:16:59
2019-12-27 18:12:04
2019-12-27 17:56:28
2019-12-27 17:48:06
2019-12-27 17:39:28
2019-12-27 17:26:30
2019-12-27 17:21:33
2019-12-27 17:17:34
2019-12-27 17:09:36
2019-12-27 16:53:49
2019-12-27 16:53:14
2019-12-27 16:43:48
2019-12-27 16:39:16
2019-12-27 16:26:24
2019-12-27 16:24:31
2019-12-27 16:23:48
2019-12-27 16:08:26
2019-12-27 16:02:43
2019-12-27 15:45:48
2019-12-27 15:35:58
2019-12-27 15:33:07
2019-12-27 15:29:36
2019-12-27 15:28:16
2019-12-27 15:20:41
2019-12-27 15:20:36
2019-12-27 15:13:04
2019-12-27 15:10:01
2019-12-27 15:03:37
2019-12-27 15:01:18
2019-12-27 14:53:05
2019-12-27 14:52:55
2019-12-27 14:50:56
2019-12-27 14:43:15


- https://stackoverflow.com/questions/27941940/how-to-exclude-retweets-and-replies-in-a-search-api
- https://twitterdev.github.io/do_more_with_twitter_data/finding_the_right_data.html
- https://towardsdatascience.com/extracting-twitter-data-pre-processing-and-sentiment-analysis-using-python-3-0-7192bd8b47cf
- https://www.toptal.com/python/twitter-data-mining-using-python
- https://medium.com/@mpuig/twitter-101-ae045999c7fe
- https://dev.to/twitterdev/how-i-solved-my-nyc-parking-problem-with-python-the-search-tweets-api-and-twilio-1chp
- https://www.freecodecamp.org/news/basic-data-analysis-on-twitter-with-python-251c2a85062e/
- https://lucahammer.com/2019/11/05/collecting-old-tweets-with-the-twitter-premium-api-and-python/

## Methods and Results

### Preparing the environment

In [None]:
import nltk
import wordcloud
import spacy #conda install -c conda-forge spacy-model-en_core_web_lg
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.attrs import ORTH
nlp = spacy.load('en_core_web_lg')
import string
import collections
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import silhouette_score
import pandas as pd
import numpy as np
import json
import pandas as pd
import csv
import re #regular expression
from textblob import TextBlob
import string
from nltk.cax_results=500orpus import stopwords
from nltk.tokenize import TweetTokenizer

### Text Preprocessing

##### Tweet parser
from tweet_parser.tweet import Tweet

from tweet_parser.tweet_parser_errors import NotATweetError

import fileinput

import json

for line in fileinput.FileInput("gnip_tweet_data.json"):
    try:
        tweet_dict = json.loads(line)
        tweet = Tweet(tweet_dict)
    except (json.JSONDecodeError,NotATweetError):
        pass
    print(tweet.created_at_string, tweet.all_text)

In [None]:
#HappyEmoticons
emoticons_happy = set([
    ':-)', ':)', ';)', ':o)', ':]', ':3', ':c)', ':>', '=]', '8)', '=)', ':}',
    ':^)', ':-D', ':D', '8-D', '8D', 'x-D', 'xD', 'X-D', 'XD', '=-D', '=D',
    '=-3', '=3', ':-))', ":'-)", ":')", ':*', ':^*', '>:P', ':-P', ':P', 'X-P',
    'x-p', 'xp', 'XP', ':-p', ':p', '=p', ':-b', ':b', '>:)', '>;)', '>:-)',
    '<3'
    ])
# Sad Emoticons
emoticons_sad = set([
    ':L', ':-/', '>:/', ':S', '>:[', ':@', ':-(', ':[', ':-||', '=L', ':<',
    ':-[', ':-<', '=\\', '=/', '>:(', ':(', '>.<', ":'-(", ":'(", ':\\', ':-c',
    ':c', ':{', '>:\\', ';('data
    ])
#Emoji patterns
emoji_pattern = re.compile("["
         u"\U0001F600-\U0001F64F"  # emoticons
         u"\U0001F300-\U0001F5FF"  # symbols & pictographs
         u"\U0001F680-\U0001F6FF"  # transport & map symbols
         u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
         u"\U00002702-\U000027B0"
         u"\U000024C2-\U0001F251"
         "]+", flags=re.UNICODE)
emoticons = emoticons_happy.union(emoticons_sad)

In [None]:
def clean_tweets(tweet):
    tokenZR = TweetTokenizer()
    stop_words = set(stopwords.words('english') and ['RT', '...', '📣', '🎄', 'Merry Christmas', "MERRY CHRISTMAS"])
    word_tokax_results=500ens = tokenZR.tokenize(tweet)
#after tweepy preprocessing the colon symbol left remain after      #removing mentions
    tweet = re.sub(r':', '', tweet)
    tweet = re.sub(r'‚Ä¶', '', tweet)
#replace consecutive non-ASCII characters with a space
    tweet = re.sub(r'[^\x00-\x7F]+',' ', tweet)
#to remove the hyperlinks
    tweet = re.sub(r'http\S+','', tweet)
#remove emojis from tweet
    tweet = emoji_pattern.sub(r'', tweet)
#filter using NLTK library append it to a string
    filtered_tweet = [w for w in word_tokens if not w in stop_words]
    filtered_tweet = []
#looping through conditions
    for w in word_tokens:
#check tokens against stop words , emoticons and punctuations
        if w not in stop_words and w not in emoticons and w not in string.punctuation:
            filtered_tweet.append(w)
    return ' '.join(filtered_tweet)
    #print(word_tokens)
    #print(filtered_sentence)
    return tweet

In [None]:
def clean_twitter(docs):
# remove punctuation and numbers
# I do this before lemmatizing, so things like "act's" turn into 'act' instead of 'act s'
    print('removing punctuation and digits')
    table = str.maketrans({key: None for key in string.punctuation + string.digits})
    clean_text = [d.translate(table) for d in docs]
    
    print('spacy nlp...')
    nlp_docs = [nlp(d) for d in clean_text]
    
    # keep the word if it's a pronoun, otherwise use the lemma
    # otherwise spacy substitutes '-PRON-' for pronouns
    print('getting lemmas')
    lemmatized_docs = [[w.lemma_ if w.lemma_ != '-PRON-'
                           else w.lower_
                           for w in d]
                      for d in nlp_docs]
    
    # remove stopwords
    print('removing stopwords')
    stop_words = set(stopwords.words('english') and ['RT', '...', '📣', '🎄', 'Merry Christmas', "MERRY CHRISTMAS"])
    
    lemmatized_docs = [[lemma for lemma in doc if lemma not in stop_words] for doc in lemmatized_docs]
    
    # join tokens back into doc
    clean_text = [' '.join(l) for l in lemmatized_docs]
        
    return clean_text

## Conclusion