## AMOD 5410 Big Data - Weekly Assignment \# 1
## Detecting Russian Twitter Bots

### By: Matt Emmons and Edgar Khachatryan

* Find a data source, write/use a tool that gathers up some data (e.g. scrapes twitter or uses a google API)
* Prepare somewhere between ½ and 1 page of a description of what your data is, and a simple hypotheses on why this might have something interesting we can do with it

The goal of this dataset is to explore the tweets of certain known Russian twitter 'trolls' and bots to explore the keywords, frequency of tweets and certain tendencies. These accounts often feign as being pro-right American individuals but often have highly coordinated agendas, the goal being to spread information or disinformation amongst other users on social media. The platform we are focused on will be twitter for its ease of use in scraping the relevant tweets from known deviant users. Interestingly, Twitter itself released a new set of statistics regarding their own internal investigation into Russian efforts to influence the 2016 Presidential election revealing that more than 50,000 automated accounts have links to Russian government ministries and Russia-linked organizations, specifically the IRA (Internet Research Agency).

The data being used will mainly be individual from users who commonly participate in online discourse surrounding American politics, with focus on talking points and hashtags that are related to current affairs. Twitter's API provides user-controlled geolocation information and date-times of all tweets involved, included lists of users who liked, retweets and responded to questionable posts. This dataset can continually be expanded as long as Twitter's ratelimiting is not exceeded, but for now the dataset will remain small for prototyping techniques at identifying trends. 

In [2]:
import json
import datetime
import time
import os
import sys
import config as cfg
import pandas as pd
from pprint import pprint as pprint
from twython import Twython, TwythonError
from IPython.display import display

# authenticate with Twitter API
twitter = Twython(cfg.APP_KEY, cfg.APP_SECRET, oauth_version=2)
ACCESS_TOKEN = twitter.obtain_access_token()
twitter = Twython(cfg.APP_KEY, access_token = ACCESS_TOKEN)

# tweets storage file
CSV_FILE = "data/tweets.csv"

def get_rate_limit():
    '''Function that returns current Twitter API rate limit'''
    return twitter.get_application_rate_limit_status()['resources']['search']

def load_tweets(file):
    '''Function that loads tweets from JSON file'''
    with open(file, 'r') as f:
        tweets = json.load(f)
        return tweets

def write_tweets(tweets, filename):
    ''' Function that appends tweets to a file. '''
    with open(filename, 'a') as f:
        json.dump(tweets, f)
            
def write_csv(data_frame, filename):
    ''' Function to write dataframe to CSV file'''
    data_frame.to_csv(filename, sep = ',', encoding = 'utf-8')
    
def read_csv(filename):
    '''Function that returns a dataframe read from filename'''
    return pd.read_csv(filename, header = 0, index_col = 0)
    
def get_tweets(user, num = 25):
    '''
    Function that retrives _num_ tweets from user by username
    Returns an array of tweets
    '''
    tweet_array = []
    try:
        user_timeline = twitter.get_user_timeline(
            screen_name = user,
            count = num
        )
        for tweets in user_timeline:
            tweet_array.append(tweets)
    except TwythonError as e:
        print("Error with {}, {}".format(user, e))
    return tweet_array
    
def delete_tweets_file(filename):
    '''
    Function to remove tweets file
    USE WITH CAUTION!
    '''
    os.remove(filename)

In [3]:
# TEMP: delete the tweets.json file
# delete_tweets_file(JSON_FILE)
get_rate_limit()

{'/search/tweets': {'limit': 450, 'remaining': 450, 'reset': 1517436107}}

In [4]:
# TODO: functionize this stuff
# known russian twitter agents
users = [
    'smartdissent',
    'SparkleSoup45',
    'bbusa617',
    'charlieJuliet',
    'ChrisFromWI',
    'SCroixFreePress',
    'wienerherzog2',
    'PeggyRuppe',
    'remleona',
    'Answers2b4u',
]

fields  = {
    'tweet_id':     [],
    'user_id':      [],
    'screen_name':  [],
    'created_at':   [],
    'text':         [],
}

# prevents casting user_id and tweet_id fields to float
df = pd.DataFrame(fields, dtype = int)

In [5]:
def add_tweets_to_df(tweet_array, fields_dict, data_frame):
    '''
    Function to add tweets to existing dataframe
    Drops duplicate values
    '''
    for tweet in tweet_array:
        fields_dict['tweet_id'].append(tweet['id'])
        fields_dict['user_id'].append(tweet['user']['id'])
        fields_dict['screen_name'].append(tweet['user']['screen_name'])
        fields_dict['created_at'].append(tweet['created_at'])
        fields_dict['text'].append(tweet['text'])
        
    temp_df = pd.DataFrame(fields_dict)
    data_frame = pd.concat([temp_df, data_frame])#, ignore_index = True)
    data_frame.drop_duplicates()
#     data_frame.reset_index(drop = True)
    data_frame['created_at'] = pd.to_datetime(data_frame['created_at'])
    return data_frame


for user in users:
    tweets = get_tweets(user, num = 20)
    # df = add_tweets_to_df(tweets, fields, df)  
for tweet in tweets:
    print(tweet)
    print()
# write_csv(df, CSV_FILE)

# pd.set_option('display.max_colwidth', -1)
# display(df.dtypes)
# display(df.head(100))

{'created_at': 'Fri Jan 19 18:45:53 +0000 2018', 'id': 954424691447312384, 'id_str': '954424691447312384', 'text': '@SierraWhiskee @JeffFlake Blackmailed with #BlackmailedWithPedoPics', 'truncated': False, 'entities': {'hashtags': [{'text': 'BlackmailedWithPedoPics', 'indices': [43, 67]}], 'symbols': [], 'user_mentions': [{'screen_name': 'SierraWhiskee', 'name': '💋 SIᕮᖇᖇᗩ ᗯᕼISKᕮᕮ 💋', 'id': 379690254, 'id_str': '379690254', 'indices': [0, 14]}, {'screen_name': 'JeffFlake', 'name': 'Jeff Flake', 'id': 16056306, 'id_str': '16056306', 'indices': [15, 25]}], 'urls': []}, 'source': '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>', 'in_reply_to_status_id': 954415161900544000, 'in_reply_to_status_id_str': '954415161900544000', 'in_reply_to_user_id': 379690254, 'in_reply_to_user_id_str': '379690254', 'in_reply_to_screen_name': 'SierraWhiskee', 'user': {'id': 951917461309288448, 'id_str': '951917461309288448', 'name': 'Answers4U', 'screen_name': 'Answers2b4u

In [5]:
# load existing CSV and perform analysis
# df = read_csv(CSV_FILE)
# display(df.head(50))

Unnamed: 0,created_at,screen_name,text,tweet_id,user_id
0,2018-01-23 20:10:11,smartdissent,NEW: @realDonaldTrump Restarted the #Sabotage ...,955895456080519169,826982179204915200
1,2018-01-23 20:05:10,smartdissent,NEW: Hidden between #Christmas &amp; #NewYears...,955894196455530497,826982179204915200
2,2018-01-23 19:52:02,smartdissent,RT @ddale8: Some more details: the man charged...,955890889930485761,826982179204915200
3,2018-01-23 19:49:28,smartdissent,"RT @USATODAY: #BREAKING 2 dead, 17 injured in ...",955890245744103424,826982179204915200
4,2018-01-23 19:48:37,smartdissent,RT @kylegriffin1: Justin Trudeau says that Can...,955890031524286465,826982179204915200
5,2018-01-23 19:46:59,smartdissent,RT @kylegriffin1: James Comey was interviewed ...,955889620654444544,826982179204915200
6,2018-01-23 19:40:11,smartdissent,#SmartDissent is a Database tracking actions o...,955887909151887361,826982179204915200
7,2018-01-23 19:40:11,smartdissent,#SmartDissent is a Database tracking actions o...,955887907679719424,826982179204915200
8,2018-01-23 19:38:32,smartdissent,RT @nancyleong: I am sure GOP twitter would be...,955887493257355265,826982179204915200
9,2018-01-23 19:10:08,smartdissent,NEW: @realDonaldTrump Restarted the #Sabotage ...,955880344573661185,826982179204915200
