# Capstone 2 Project - Russian Troll Tweets

### Initial tweet data exploration
First, I'll join together all of the tweet data, reading and joining all of the 13 data files

I will use the Tweets to explore questions about the nature of the disinformation campaign, such as:
* Did the tweets increase in frequency or volume around the time of major events? 
* Did other trolls retweet and amplify troll tweets?
* Can clusters be made of Twitter handles/’users’ grouped with similar features?
* Can common topics or themes be identified?
* What were the most-used hashtags?
* Did the tweets predominantly support one candidate or political party, or seek to undermine the other?


Header | Definition
-------|---------
`external_author_id` | An author account ID from Twitter 
`author` | The handle sending the tweet
`content` | The text of the tweet
`region` | A region classification, as [determined by Social Studio](https://help.salesforce.com/articleView?   id=000199367&type=1)
`language` | The language of the tweet
`publish_date` | The date and time the tweet was sent
`harvested_date` | The date and time the tweet was collected by Social Studio
`following` | The number of accounts the handle was following at the time of the tweet
`followers` | The number of followers the handle had at the time of the tweet
`updates` | The number of “update actions” on the account that authored the tweet, including tweets, retweets and likes
`post_type` | Indicates if the tweet was a retweet or a quote-tweet
`account_type` | Specific account theme, as coded by Linvill and Warren
`retweet` | A binary indicator of whether or not the tweet is a retweet
`account_category` | General account theme, as coded by Linvill and Warren
`new_june_2018` | A binary indicator of whether the handle was newly listed in June 2018

In [2]:
import pandas as pd
import numpy as np

Consider dropping the labels, so I can perform my own classification?
i.e., 'account_type', 'account_category','new_june_2018'

In [3]:
#troll.head()
#troll.columns

In [4]:
#Done one time

# IRAhandle_tweets_1
#import glob, os
 
#os.chdir("../data")
#results = pd.DataFrame([])
 
#for counter, file in enumerate(glob.glob("IRAhandle_tweets*")):
#    namedf = pd.read_csv(file, skiprows=0)
#    results = results.append(namedf)
 
#results.to_csv('../data/all_IRAhandle_tweets.csv')

In [5]:
# Read in the dataset - large, ~3 million records. Consider Spark/AWS?
ivan = pd.read_csv('../data/all_IRAhandle_tweets.csv', encoding = "iso-8859-1", parse_dates = ['publish_date', 'harvested_date'])
print(ivan.shape)

# Need to chunk the data?
# https://towardsdatascience.com/why-and-how-to-use-pandas-with-large-data-9594dda2ea4c

  interactivity=interactivity, compiler=compiler, result=result)


(2946207, 22)


## Explore, clean up data

In [24]:
ivan.dtypes

external_author_id            object
author                        object
content                       object
region                        object
language                      object
publish_date          datetime64[ns]
following                      int64
followers                      int64
updates                        int64
post_type                     object
account_type                  object
retweet                        int64
account_category              object
new_june_2018                  int64
alt_external_id               object
tweet_id                       int64
article_url                   object
tco1_step1                    object
tco2_step1                    object
tco3_step1                    object
dtype: object

In [25]:
ivan.head()

Unnamed: 0,external_author_id,author,content,region,language,publish_date,following,followers,updates,post_type,account_type,retweet,account_category,new_june_2018,alt_external_id,tweet_id,article_url,tco1_step1,tco2_step1,tco3_step1
0,906000000000000000,10_GOP,"""We have a sitting Democrat US Senator on tria...",Unknown,English,2017-10-01 19:58:00,1052,9636,253,,Right,0,RightTroll,0,905874659358453760,914580356430536707,http://twitter.com/905874659358453760/statuses...,https://twitter.com/10_gop/status/914580356430...,,
1,906000000000000000,10_GOP,Marshawn Lynch arrives to game in anti-Trump s...,Unknown,English,2017-10-01 22:43:00,1054,9637,254,,Right,0,RightTroll,0,905874659358453760,914621840496189440,http://twitter.com/905874659358453760/statuses...,https://twitter.com/damienwoody/status/9145685...,,
2,906000000000000000,10_GOP,Daughter of fallen Navy Sailor delivers powerf...,Unknown,English,2017-10-01 22:50:00,1054,9637,255,RETWEET,Right,1,RightTroll,0,905874659358453760,914623490375979008,http://twitter.com/905874659358453760/statuses...,https://twitter.com/10_gop/status/913231923715...,,
3,906000000000000000,10_GOP,JUST IN: President Trump dedicates Presidents ...,Unknown,English,2017-10-01 23:52:00,1062,9642,256,,Right,0,RightTroll,0,905874659358453760,914639143690555392,http://twitter.com/905874659358453760/statuses...,https://twitter.com/10_gop/status/914639143690...,,
4,906000000000000000,10_GOP,"19,000 RESPECTING our National Anthem! #StandF...",Unknown,English,2017-10-01 02:13:00,1050,9645,246,RETWEET,Right,1,RightTroll,0,905874659358453760,914312219952861184,http://twitter.com/905874659358453760/statuses...,https://twitter.com/realDonaldTrump/status/914...,,


In [26]:
ivan.columns

Index(['external_author_id', 'author', 'content', 'region', 'language',
       'publish_date', 'following', 'followers', 'updates', 'post_type',
       'account_type', 'retweet', 'account_category', 'new_june_2018',
       'alt_external_id', 'tweet_id', 'article_url', 'tco1_step1',
       'tco2_step1', 'tco3_step1'],
      dtype='object')

In [27]:
# drop strange first column 'Unnamed: 0'
ivan.drop(['Unnamed: 0'], axis=1,inplace=True)

KeyError: "['Unnamed: 0'] not found in axis"

In [None]:
# drop unneccessary "harvested_date"
ivan.drop("harvested_date",axis=1,inplace=True)

In [11]:
# Do some date formatting, set the election date
# edit code to work with DF, rather than list?
#import datetime
#election_day_datetime = datetime.datetime(2016, 11, 8)
#tweets = []
#for t in all_tweets:
#    date_array = t[5].split(' ')[0].split('/')
#    year = int(date_array[2])
#    month = int(date_array[0])
#    day = int(date_array[1])
#    if t[4] == 'English' and datetime.datetime(year, month, day) <= election_day_datetime:
#        tweets.append(t)

In [None]:
ivan.describe(include="all")

In [None]:
# Check for/count NAs
ivan.isna().sum()

In [None]:
# count duplicate tweets
print("",ivan.shape[0] - ivan.drop_duplicates(subset="content").shape[0], "duplicate tweets.")

In [15]:
# Drop duplicate tweets
ivan.drop_duplicates(subset="content",inplace=True)
print("Rows after dropping dupes:",ivan.shape[0])

Rows after dropping dupes: 2365553


## EDA

In [36]:
# how many unique authors?
# df.author.value_counts().shape[0]
ivan.author.nunique()

2808

### Authors
The 3 million tweets were created by only 2,843 authors.

In [17]:
# get the number of tweets by the top 25 authors
ivan.author.value_counts().head(25)

EXQUOTE            59172
SCREAMYMONKEY      40644
WORLDNEWSPOLI      35154
AMELIEBALDWIN      34576
TODAYPITTSBURGH    31460
SPECIALAFFAIR      30802
SEATTLE_POST       29467
FINDDIET           29037
KANSASDAILYNEWS    28762
DAILYSANFRAN       27702
ROOMOFRUMOR        26285
JENN_ABRAMS        22261
CHICAGODAILYNEW    22066
COVFEFENATIONUS    21721
POLITICS_T0DAY     21360
RIAFANRU           20972
FUNDDIET           19992
BERLINBOTE         19546
TODAYNYCITY        18482
ONLINECLEVELAND    17658
WORLDOFHASHTAGS    17262
TODAYINSYRIA       15360
HYDDROX            15096
OLD_NEW_POLICY     14779
ARM_2_ALAN         14402
Name: author, dtype: int64

In [18]:
ivan.author.value_counts().head(25).sum()

634018

### Top Authors
The top 25 handles accounted for 663,136 tweets.

In [19]:
ivan['following'].describe()  

count    2.365553e+06
mean     3.858790e+03
std      5.934234e+03
min     -1.000000e+00
25%      4.090000e+02
50%      1.809000e+03
75%      5.199000e+03
max      7.621000e+04
Name: following, dtype: float64

In [20]:
ivan['followers'].describe()

count    2.365553e+06
mean     8.170018e+03
std      1.558168e+04
min     -1.000000e+00
25%      4.950000e+02
50%      1.794000e+03
75%      1.271000e+04
max      2.512760e+05
Name: followers, dtype: float64

In [21]:
# get followers/following avg count for Top 25

In [42]:
# look at 'new_june_2018' - see % of accounts that were new in june, 2018
new18 = ivan.loc[ivan['new_june_2018'] == 1]

In [35]:
new18.head()

Unnamed: 0,external_author_id,author,content,region,language,publish_date,following,followers,updates,post_type,account_type,retweet,account_category,new_june_2018,alt_external_id,tweet_id,article_url,tco1_step1,tco2_step1,tco3_step1
5118,1670773585,4EVER_SUSAN,#Raiders defense playing hungry .. Bending and...,United States,English,2015-12-13 22:52:00,76,59,9,RETWEET,Right,1,RightTroll,1,1670773585,676172915725897728,http://twitter.com/4ever_susan/statuses/676172...,,,
5119,1670773585,4EVER_SUSAN,Let's go offense !!!! Start Up the #Carr !!!! ...,United States,English,2015-12-13 22:52:00,76,59,11,RETWEET,Right,1,RightTroll,1,1670773585,676172972047003649,http://twitter.com/4ever_susan/statuses/676172...,,,
5120,1670773585,4EVER_SUSAN,I was shocked and heartbroken when @CBS cancel...,United States,English,2015-12-14 20:34:00,76,58,15,RETWEET,Right,1,RightTroll,1,1670773585,676500598980657152,http://twitter.com/4ever_susan/statuses/676500...,,,
5121,1670773585,4EVER_SUSAN,I used to call Eden Hazard 'overrated' as a jo...,United States,English,2015-12-14 20:34:00,76,58,17,RETWEET,Right,1,RightTroll,1,1670773585,676500644597858304,http://twitter.com/4ever_susan/statuses/676500...,,,
5122,1670773585,4EVER_SUSAN,The Holidays are in full swing. Need gift ide...,United States,English,2015-12-14 20:34:00,76,58,16,RETWEET,Right,1,RightTroll,1,1670773585,676500621457920000,http://twitter.com/4ever_susan/statuses/676500...,http://aol.it/1RIKcxc,,


In [23]:
# get number of rows with 'new_june_2018' == 1
new18['new_june_2018'].value_counts()

1    442088
Name: new_june_2018, dtype: int64

In [32]:
# Get number of unique authors
len(ivan['author'].unique())

2808

In [46]:
# get unique authors with ['new_june_2018'] == 1
#len(new18['author'].unique()) # 916
new18_handles = new18['author'].unique()

In [57]:
# % of users who were new in June of 18
new_pct = ( len(new18['author'].unique()) / len(ivan['author'].unique()))
new_pct = (round(new_pct,2)) * 100 
print(new_pct,"% of users were 'new' as of June, 2018")

33.0 % of users were 'new' as of June, 2018


In [37]:
# get counts of account_type
# `account_type` - Specific account theme, as coded by Linvill and Warren
ivan['account_type'].value_counts()

Right         610470
Russian       495166
local         447342
Left          327502
news          131348
Commercial    121677
Hashtager     109743
German         80686
Italian        14858
?              11350
Koch            7675
Arabic          6201
French           976
Spanish          282
ZAPOROSHIA       108
Portuguese        98
Ebola             65
Ukranian           4
Uzbek              2
Name: account_type, dtype: int64