## AMOD 5410 Big Data - Weekly Assignment \# 1
## Detecting Russian Twitter Bots

### By: Matt Emmons and Edgar Khackatryan

* Find a data source, write/use a tool that gathers up some data (e.g. scrapes twitter or uses a google API)
* Prepare somewhere between ½ and 1 page of a description of what your data is, and a simple hypotheses on why this might have something interesting we can do with it

The goal of this dataset is to explore the tweets of certain known Russian twitter 'trolls' and bots to explore the keywords, frequency of tweets and certain tendencies. These accounts often feign as being pro-right American individuals but often have highly coordinated agendas, the goal being to spread information or disinformation amongst other users on social media. The platform we are focused on will be twitter for its ease of use in scraping the relevant tweets from known deviant users. Interestingly, Twitter itself released a new set of statistics regarding their own internal investigation into Russian efforts to influence the 2016 Presidential election revealing that more than 50,000 automated accounts have links to Russian government ministries and Russia-linked organizations, specifically the IRA (Internet Research Agency).

The data being used will mainly be individual from users who commonly participate in online discourse surrounding American politics, with focus on talking points and hashtags that are related to current affairs. Twitter's API provides user-controlled geolocation information and date-times of all tweets involved, included lists of users who liked, retweets and responded to questionable posts. This dataset can continually be expanded as long as Twitter's ratelimiting is not exceeded, but for now the dataset will remain small for prototyping techniques at identifying trends. 

In [1]:
import json
import datetime
import time
import os
import sys
import config as cfg
import pandas as pd
from twython import Twython, TwythonError
from IPython.display import display

# authenticate with Twitter API
twitter = Twython(cfg.APP_KEY, cfg.APP_SECRET, oauth_version=2)
ACCESS_TOKEN = twitter.obtain_access_token()
twitter = Twython(cfg.APP_KEY, access_token = ACCESS_TOKEN)

# tweets storage file
JSON_FILE = "data/tweets.json"    

# returns current twitter rate limit information
def get_rate_limit():
    '''Function that returns current Twitter API rate limit'''
    return twitter.get_application_rate_limit_status()['resources']['search']

# def load_tweets(file, skip = 0):
#     '''Function that loads tweets from JSON file'''
#     with open(file, 'r') as f:
#         tweets = (json.loads(line) for i, line in enumerate(f.readlines()) if i%skip==0)
#     return tweets

def load_tweets(file):
    '''Function that loads tweets from JSON file'''
    with open(file, 'r') as f:
        tweets = json.load(f)
        return tweets

def write_tweets(tweets, filename):
    ''' Function that appends tweets to a file. '''
    with open(filename, 'a') as f:
        json.dump(tweets, f)
            
def get_tweets(user, num = 25):
    '''
    Function that retrives _num_ tweets from user by username
    Returns an array of tweets
    '''
    tweet_array = []
    try:
        user_timeline = twitter.get_user_timeline(
            screen_name = user,
            count = num
        )
    except TwythonError as e:
        print(e)
    
    for tweets in user_timeline:
        tweet_array.append(tweets)
    return tweet_array
    
def delete_tweets_file(filename):
    '''
    Function to remove tweets file
    USE WITH CAUTION!
    '''
    os.remove(filename)

In [2]:
# TEMP: delete the tweets.json file
# delete_tweets_file(JSON_FILE)

In [3]:
# known russian twitter agents
users = [
    'smartdissent',
    'TEN_GOP',
    'SparkleSoup45',
    'MariaBartiromo',
    'antischool_ftw',
    'OPWolverines',
    '55true4u',
    'bbusa617',
    'charlieJuliet',
    'ChrisFromWI',
    'SCroixFreePress',
    'wienerherzog2',
    'PeggyRuppe',
    'remleona',
    'Answers2b4u',
]


# get tweets and write to JSON
tweets_memory = get_tweets(users[0], num = 3)

# only create tweets.json file if it doesn't exist
if not os.path.isfile(JSON_FILE):
    write_tweets(tweets_memory, JSON_FILE)    

In [4]:
# TODO: functionize this stuff

# data fields we wish to extract into a dataframe from JSON
# see data/example.json for output example
fields  = {
    'created_at': [],
    'text':       [],
}

tweets_file = load_tweets(JSON_FILE)

for tweet in tweets_file:
    fields['created_at'].append(tweet['created_at'])
    fields['text'].append(tweet['text'])
    
df = pd.DataFrame(fields)

# Convert created_at to datetimes
df['created_at'] = pd.to_datetime(df['created_at'])

display(df.head())

Unnamed: 0,created_at,text
0,2018-01-22 23:25:04,#SmartDissent #MLKDay Week In Review (3/3): Tr...
1,2018-01-22 23:15:10,#SmartDissent #MLKDay Week In Review (2/3): #S...
2,2018-01-22 23:05:08,#SmartDissent #MLKDay Week In Review (1/3): Tr...
