# Trump - a 140 character insight

---

## Initial research outline

As presidents lead their nation, we believe their actions could set an example and influence and encourage certain behaviors. We will investigate if there is a possible correlation between sentiment conveyed by such figures via social media and negative social behavior of a nation.

The presidency of Donald Trump has been marked with many controversies, including the rise of supremacist groups and numerous nation wide conflicts. By using Trump tweets, we would explore if there is a significant temporal correlation between sentiment expressed in the Tweet and the number of conflicts in the nation using GDELT dataset. To reduce the bias of such analysis, we would perform the same analysis with social media activity of his predecessor and compare the results, as well as adding the social-economic aspects which may affect such behaviors.

What is the power of a presidential tweet? We hope that we will have more insight on the answer of this question and raise awareness on the impact 140 characters can make.

### Research questions
* Do violent events or crimes occur more frequently after presidential tweets?
* Is the correlation higher with negative sentiment of the tweet?
* Does a significant difference exists between such correlation with Trump and Obama presidency?
* Is there a difference in sentiment expressed in tweets before and after Trump becoming a presidential candidate?
* How do social-economical aspects influence the correlation?
* Does the political regiment affect such negative behavior on a larger scale?

### Dataset

#### Trump tweets
* Downloading the complete dataset locally and processing the JSON files
* We would use mainly the timestamps and tweet content for our analysis:
  * timestamps for temporal correlation with events
  * tweet content for sentiment analysis
* We could follow the impact of tweet by analyzing:
  * number of retweets
  * number of favorites
  * number of followers
  
#### GDELT
* We would use GDELT Global Knowledge graph to gain insights on events:
  * geolocalized to USA
  * in specified timeframe, starting from 2013
  * specific events, such as protests or violent manifestations
  * will need knowledge on working on the cluster and accessing the data
  
#### Wikidata
* Obtaining data for social-economical indicators:
  * for a specific period of time
  * for a specific region of the USA
  * data wrangling and processing

#### Obama tweets *
Available at: http://obamawhitehouse.gov.archivesocial.com/

* It would enrich our analysis by providing an insight in differences between the two leaders
* We strive for a more complete and less biased analysis

---

## Evolution of research topic

The initial project proposal has evolved through iterations of data exploration, evaluation of project complexity and consultation with the teaching team in ADA.

### Main concerns that have been raised
* project required many areas of expertise
    * from social studies, behavioral studies, to mechanisms of USA economy
    * unable to reach the level of expertise on our own
    * for this type of project having external scientists would be an overkill
* timeframe for this project is limited
    * practically impossible to have a meaningful insight in about a month
* **signal detection and correlation would be very hard to prove since many underlying factors exist - complexity is high**
* GDELT could not provide the necessary granularity for this survey

A comment that described our situation and prospect of success:
> You will probably need a PhD to be able to come close to a meaningful result.

Even though we were passionate about our idea, we have realized we need to reduce the scope and pinpoint specific research questions that are feasible with our current expertise and available time. We are thankful to very insightful comments from professor Bob West who steered us to our current project. The project is reasonable, it is possibly insightful and we hope to squeeze as much information from the dataset as possible.

---

# TODO: current outline, abstract, research questions and datasets

# Data analysis

---

## General overview

We will perform an exploratory data analysis to get a deeper insight on the available data and information within. The two datasets we have decided on using are:

* Trump Twitter archive: *main dataset*
* Internet archive - Trump TV news factchecks: *enrichment dataset*


### Trump Twitter archive

Trump Twitter archive represents the complete collection of Tweets from the account of Donald Trump, since the first Tweet in 2009. The dataset has been obtained from [Trump Twitter Archive github repository](https://github.com/bpb27/trump_tweet_data_archive).

Internally, the dataset has been obtained by combining [Twitter scraper](https://github.com/bpb27/twitter_scraping) and official Twitter API. This combination would allow for scraping Tweets from any user account without a need to use the official premium Twitter API.

The result set is compliant with Twitter API output and information on meaning of columns in our dataset can be obtained by looking into [official documentation](https://developer.twitter.com/en/docs/tweets/tweet-updates).

More information and explorative data analysis will follow.

### Internet archive - Trump TV news factchecks

In further research we would like to have an insight how Donald Trump reacts to the news about him. Obtaining a dataset which collected the news about Donald Trump would be difficult because:
* Many sources online - which outlets to select
* Possible bias while covering significant portion of news
* Duplication of news from different sources
* Significant difference in time of publishing
* There is a need to extract main topics of news
* What is the criteria for marking the news are about Donald Trump - he will appear in many news stories!
* No universal API or method for this task - scraping of multiple sites would be necessary

Luckily, [The Internet Archive](https://archive.org/details/tv?factchecks) has a ready dataset combining televised news clips linked to Donald Trump, from 2009 until today. Most importantly for televised clips - the dates are present, as well as a short list of topics covered. This would make it easier to pinpoint the reaction, if there is any in the Tweets. 

More information about the dataset will follow in the exploratory data analysis.


### Where to process: in cluster or locally?
Both **Trump Twitter archive** and **Internet archive - Trump TV news factchecks** are small enough for a local, in-memory processing and analysis. Decisive characteristics of the datasets for a local, in-memory processing are:

* Trump Twitter archive:
    * multiple JSON files, split by the year of collection
    * content of the JSON files is specified by Twitter API
    * 9 uncondensed JSON files, in total less than 100MB (92MB)

* Internet archive - Trump TV news factchecks
    * single CSV file
    * less than 1 MB in size
    
---

## Exploratory data analysis

We will look in depth into the datasets and show the possible uses of the information, as well as comments on usability of certain parts of our data.

Since dataset size allows for working in-memory, we will use standard utilities such as *Pandas* and *Numpy*.

In [19]:
import pandas as pd
import numpy as np

In [20]:
TWEETS_PATH = 'C:/Users/Viktor/Desktop/trump_tweet_data_archive-master/trump_tweet_data_archive-master/master_' 
NEWS_PATH = 'data/factchecks.csv'

### Trump Twitter Archive

We combine all the present Tweets into a single `DataFrame` for an easier utilization.

In [24]:
tweets = pd.DataFrame()

for i in range(2009,2018):
    tweets = tweets.append(pd.read_json(TWEETS_PATH+str(i)+'.json/master_'+str(i)+'.json'))

We perform an introductory description of the dataframe, as provided by Pandas:

In [73]:
tweets.describe()

Unnamed: 0,contributors,favorite_count,id,id_str,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,possibly_sensitive,quoted_status_id,quoted_status_id_str,retweet_count,withheld_copyright
count,0.0,32532.0,32532.0,32532.0,1999.0,1999.0,2440.0,2440.0,7489.0,286.0,286.0,32532.0,1.0
mean,,8414.287809,5.031478e+17,5.031478e+17,3.347165e+17,3.347165e+17,351638100.0,351638100.0,0.0,7.254326e+17,7.254326e+17,2572.550012,1.0
std,,25144.002721,2.027952e+17,2.027952e+17,9.525507e+16,9.525507e+16,400445500.0,400445500.0,0.0,1.097156e+17,1.097156e+17,7422.315792,
min,,0.0,1698309000.0,1698309000.0,1.672774e+17,1.672774e+17,7425.0,7425.0,0.0,5.427798e+17,5.427798e+17,0.0,1.0
25%,,20.0,3.350412e+17,3.350412e+17,2.937718e+17,2.937718e+17,42519610.0,42519610.0,0.0,6.293375e+17,6.293375e+17,16.0,1.0
50%,,68.0,5.097998e+17,5.097998e+17,3.138054e+17,3.138054e+17,239272900.0,239272900.0,0.0,6.973068e+17,6.973068e+17,107.0,1.0
75%,,1910.0,6.551359e+17,6.551359e+17,3.543118e+17,3.543118e+17,485295100.0,485295100.0,0.0,7.88895e+17,7.88895e+17,1151.25,1.0
max,,633253.0,9.341318e+17,9.341318e+17,9.336626e+17,9.336626e+17,3412873000.0,3412873000.0,0.0,9.339594e+17,9.339594e+17,369530.0,1.0


As well as observing all the present columns in our `DataFrame`:

In [29]:
tweets.columns

Index(['contributors', 'coordinates', 'created_at', 'display_text_range',
       'entities', 'extended_entities', 'favorite_count', 'favorited',
       'full_text', 'geo', 'id', 'id_str', 'in_reply_to_screen_name',
       'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str', 'is_quote_status',
       'lang', 'place', 'possibly_sensitive', 'quoted_status',
       'quoted_status_id', 'quoted_status_id_str', 'retweet_count',
       'retweeted', 'retweeted_status', 'scopes', 'source', 'text',
       'truncated', 'user', 'withheld_copyright', 'withheld_in_countries',
       'withheld_scope'],
      dtype='object')

Documentation provides more information about the semantics behind each attribute of a [Tweet object](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object).

We are mainly interested in following general attributes of Tweets:
* textual content
* number of retweets
* number of favorites

Such Tweet context could be obtained from the fields:
* `created_at` - timestamp of the tweet
* `entities` - entities parsed out of the tweet, dict of:
    * `hashtags`
    * `symbols` 
    * `urls`
    * `user_mentions`
* `favorite_count` - number of favorites
* `id` - Tweet id
* `is_quote_status` - indicates whether this is a quoted tweet
* `lang` - language of the tweet (could help in the NLP)
* `retweet_count` - number of retweets
* `source` - utility used to post the Tweet
* `user` - information about the [user](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/user-object), we are possibly interested in:
    * followers count
    * statuses_count

In [138]:
tw = tweets.copy()

In [181]:
tw.loc[tw.is_quote_status==True].count()

created_at         303
entities           303
favorite_count     303
favorited          303
id                 303
id_str             303
is_quote_status    303
lang               303
retweet_count      303
retweeted          303
source             303
truncated          303
user               303
dtype: int64

In [144]:
tw.dropna(axis = 1, how='any', inplace=True)

In [182]:
for i, el in enumerate(tw['entities']):
    if(i>1):
        break
        
    a = el

In [183]:
a

{'hashtags': [], 'symbols': [], 'urls': [], 'user_mentions': []}

In [165]:
tw.columns

Index(['created_at', 'entities', 'favorite_count', 'favorited', 'id', 'id_str',
       'is_quote_status', 'lang', 'retweet_count', 'retweeted', 'source',
       'truncated', 'user'],
      dtype='object')

In [157]:
for i, el in enumerate(tw['entities']):
    print(el['hashtags'])
    
    if(i>2000):
        break

[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[{'indices': [117, 127], 'text': 'EvanForSI'}]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[{'indices': [124, 139], 'text': 'TimeToGetTough'}]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[{'indices': [125, 140], 'text': 'TimeToGetTough'}]
[{'indices': [123, 138], 'text': 'TimeToGetTough'}]
[]
[{'indices': [12, 27], 'text': 'TimeToGetTough'}]
[]
[]
[]
[]
[]
[]
[{'indices': [98, 113], 'text': 'TimeToGetTough'}]
[{'ind

[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[{'indices': [116, 128], 'text': 'SandyRelief'}]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[{'indices': [118, 124], 'text': 'TRUMP'}]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[{'indices': [49, 53], 'text': 'NYE'}]
[]
[{'indices': [23, 29], 'text': 'Sandy'}, {'indices': [99, 111], 'text': 'sandyrelief'}]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[

#### General characteristics of the dataset

The dataset contains 32532 tweets 

In [72]:
tweets.sort_values(by=tweets.created_at.name)

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,favorite_count,favorited,full_text,geo,...,retweeted,retweeted_status,scopes,source,text,truncated,user,withheld_copyright,withheld_in_countries,withheld_scope
55,,,2009-05-04 18:54:25,,"{'symbols': [], 'user_mentions': [], 'hashtags...",,202,False,,,...,False,,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",Be sure to tune in and watch Donald Trump on L...,False,"{'follow_request_sent': False, 'has_extended_p...",,,
54,,,2009-05-05 01:00:10,,"{'symbols': [], 'user_mentions': [], 'hashtags...",,3,False,,,...,False,,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",Donald Trump will be appearing on The View tom...,False,"{'follow_request_sent': False, 'has_extended_p...",,,
53,,,2009-05-08 13:38:08,,"{'symbols': [], 'user_mentions': [], 'hashtags...",,2,False,,,...,False,,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",Donald Trump reads Top Ten Financial Tips on L...,False,"{'follow_request_sent': False, 'has_extended_p...",,,
52,,,2009-05-08 20:40:15,,"{'symbols': [], 'user_mentions': [], 'hashtags...",,27,False,,,...,False,,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",New Blog Post: Celebrity Apprentice Finale and...,False,"{'follow_request_sent': False, 'has_extended_p...",,,
51,,,2009-05-12 14:07:28,,"{'symbols': [], 'user_mentions': [], 'hashtags...",,1950,False,,,...,False,,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...","""My persona will never be that of a wallflower...",False,"{'follow_request_sent': False, 'has_extended_p...",,,
50,,,2009-05-12 19:21:55,,"{'symbols': [], 'user_mentions': [], 'hashtags...",,13,False,,,...,False,,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...","Miss USA Tara Conner will not be fired - ""I've...",False,"{'follow_request_sent': False, 'has_extended_p...",,,
49,,,2009-05-13 17:38:28,,"{'symbols': [], 'user_mentions': [], 'hashtags...",,10,False,,,...,False,,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",Listen to an interview with Donald Trump discu...,False,"{'follow_request_sent': False, 'has_extended_p...",,,
48,,,2009-05-14 16:30:40,,"{'symbols': [], 'user_mentions': [], 'hashtags...",,6,False,,,...,False,,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...","""Strive for wholeness and keep your sense of w...",False,"{'follow_request_sent': False, 'has_extended_p...",,,
47,,,2009-05-15 14:13:13,,"{'symbols': [], 'user_mentions': [], 'hashtags...",,2,False,,,...,False,,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...","Enter the ""Think Like A Champion"" signed book ...",False,"{'follow_request_sent': False, 'has_extended_p...",,,
46,,,2009-05-16 22:22:45,,"{'symbols': [], 'user_mentions': [], 'hashtags...",,5,False,,,...,False,,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...","""When the achiever achieves, it's not a platea...",False,"{'follow_request_sent': False, 'has_extended_p...",,,


In [13]:
tweets.head()

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,favorite_count,favorited,full_text,geo,...,retweeted,retweeted_status,scopes,source,text,truncated,user,withheld_copyright,withheld_in_countries,withheld_scope
0,,,2009-12-23 17:38:18,,"{'symbols': [], 'user_mentions': [], 'hashtags...",,12,False,,,...,False,,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",From Donald Trump: Wishing everyone a wonderfu...,False,"{'follow_request_sent': False, 'has_extended_p...",,,
1,,,2009-12-03 19:39:09,,"{'symbols': [], 'user_mentions': [], 'hashtags...",,6,False,,,...,False,,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",Trump International Tower in Chicago ranked 6t...,False,"{'follow_request_sent': False, 'has_extended_p...",,,
2,,,2009-11-26 19:55:38,,"{'symbols': [], 'user_mentions': [], 'hashtags...",,11,False,,,...,False,,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",Wishing you and yours a very Happy and Bountif...,False,"{'follow_request_sent': False, 'has_extended_p...",,,
3,,,2009-11-16 21:06:10,,"{'symbols': [], 'user_mentions': [], 'hashtags...",,3,False,,,...,False,,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",Donald Trump Partners with TV1 on New Reality ...,False,"{'follow_request_sent': False, 'has_extended_p...",,,
4,,,2009-11-02 14:57:56,,"{'symbols': [], 'user_mentions': [], 'hashtags...",,6,False,,,...,False,,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...","--Work has begun, ahead of schedule, to build ...",False,"{'follow_request_sent': False, 'has_extended_p...",,,


In [8]:
news = pd.read_csv(NEWS_PATH)

In [9]:
coord = tweets.loc[tweets.coordinates.notnull()]['coordinates']

In [7]:
import folium

In [8]:
m = folium.Map(location=[40.7624658, -73.9754123], zoom_start=8)

In [9]:
m

In [10]:
def my_reverse(my_list):
    my_list.reverse()
    return my_list

In [11]:
def not_in_list(my_list, check):
    for lst in my_list:
        if(np.array_equal(check, lst)):
            return False
    return True

In [12]:
def remove_similar_coord(coord_list, decimal):
    cleaned = [[0.53242,  0.43315]]
    
    for el in coord_list:
        rounded = np.around(el['coordinates'],decimal)
        if (not_in_list(cleaned, rounded)):
            cleaned.append(rounded)
    
    return np.array(cleaned)

In [13]:
cleaned = remove_similar_coord(coord, 2)

In [19]:
for el in cleaned:
    folium.Marker(location=np.flip(el, axis=0)).add_to(m)

In [20]:
m

In [2]:
from geopy.geocoders import Nominatim

In [3]:
geolocator = Nominatim()

In [4]:
print(geolocator.reverse([40.7625069, -73.975321]))

712 Fifth Avenue, 712, 5th Avenue, Diamond District, Manhattan Community Board 5, New York County, NYC, New York, 10019, United States of America
