## AMOD 5410 Big Data - Weekly Assignment \# 1
## Detecting Russian Twitter Bots

### By: Matt Emmons and Edgar Khachatryan

* Find a data source, write/use a tool that gathers up some data (e.g. scrapes twitter or uses a google API)
* Prepare somewhere between ½ and 1 page of a description of what your data is, and a simple hypotheses on why this might have something interesting we can do with it

The goal of this dataset is to explore the tweets of certain known Russian twitter 'trolls' and bots to explore the keywords, frequency of tweets and certain tendencies. These accounts often feign as being pro-right American individuals but often have highly coordinated agendas, the goal being to spread information or disinformation amongst other users on social media. The platform we are focused on will be twitter for its ease of use in scraping the relevant tweets from known deviant users. Interestingly, Twitter itself released a new set of statistics regarding their own internal investigation into Russian efforts to influence the 2016 Presidential election revealing that more than 50,000 automated accounts have links to Russian government ministries and Russia-linked organizations, specifically the IRA (Internet Research Agency).

The data being used will mainly be individual from users who commonly participate in online discourse surrounding American politics, with focus on talking points and hashtags that are related to current affairs. Twitter's API provides user-controlled geolocation information and date-times of all tweets involved, included lists of users who liked, retweets and responded to questionable posts. This dataset can continually be expanded as long as Twitter's ratelimiting is not exceeded, but for now the dataset will remain small for prototyping techniques at identifying trends. 

In [84]:
import json
import datetime
import time
import os
import sys
import config as cfg
import pandas as pd
from pprint import pprint as pprint
from twython import Twython, TwythonError
from IPython.display import display

# authenticate with Twitter API
twitter = Twython(cfg.APP_KEY, cfg.APP_SECRET, oauth_version=2)
ACCESS_TOKEN = twitter.obtain_access_token()
twitter = Twython(cfg.APP_KEY, access_token = ACCESS_TOKEN)

# tweets storage file
CSV_FILE = "data/tweets.csv"

def get_rate_limit():
    '''Function that returns current Twitter API rate limit'''
    return twitter.get_application_rate_limit_status()['resources']['search']

def load_tweets(file):
    '''Function that loads tweets from JSON file'''
    with open(file, 'r') as f:
        tweets = json.load(f)
        return tweets

def write_tweets(tweets, filename):
    ''' Function that appends tweets to a file. '''
    with open(filename, 'a') as f:
        json.dump(tweets, f)
            
def write_csv(data_frame, filename):
    ''' Function to write dataframe to CSV file'''
    data_frame.to_csv(filename, sep = ',', encoding = 'utf-8')
    
def read_csv(filename):
    '''Function that returns a dataframe read from filename'''
    return pd.read_csv(filename, header = 0, index_col = 0)
    
def get_tweets(user, num = 25):
    '''
    Function that retrives _num_ tweets from user by username
    Returns an array of tweets
    '''
    tweet_array = []
    try:
        user_timeline = twitter.get_user_timeline(
            screen_name = user,
            count = num
        )
        for tweets in user_timeline:
            tweet_array.append(tweets)
    except TwythonError as e:
        print("Error with {}, {}".format(user, e))
    return tweet_array
    
def delete_tweets_file(filename):
    '''
    Function to remove tweets file
    USE WITH CAUTION!
    '''
    os.remove(filename)

In [44]:
# TEMP: delete the tweets.json file
# delete_tweets_file(JSON_FILE)
get_rate_limit()

{'/search/tweets': {'limit': 450, 'remaining': 450, 'reset': 1516729203}}

In [62]:
# TODO: functionize this stuff
# known russian twitter agents
users = [
    'smartdissent',
    'SparkleSoup45',
    'bbusa617',
    'charlieJuliet',
    'ChrisFromWI',
    'SCroixFreePress',
    'wienerherzog2',
    'PeggyRuppe',
    'remleona',
    'Answers2b4u',
]

fields  = {
    'tweet_id':     [],
    'user_id':      [],
    'screen_name':  [],
    'created_at':   [],
    'text':         [],
}

# prevents casting user_id and tweet_id fields to float
df = pd.DataFrame(fields, dtype = int)

In [90]:
def add_tweets_to_df(tweet_array, fields_dict, data_frame):
    '''
    Function to add tweets to existing dataframe
    Drops duplicate values
    '''
    for tweet in tweet_array:
        fields_dict['tweet_id'].append(tweet['id'])
        fields_dict['user_id'].append(tweet['user']['id'])
        fields_dict['screen_name'].append(tweet['user']['screen_name'])
        fields_dict['created_at'].append(tweet['created_at'])
        fields_dict['text'].append(tweet['text'])
    temp_df = pd.DataFrame(fields_dict)
    data_frame = pd.concat([temp_df, data_frame])#, ignore_index = True)
    data_frame.drop_duplicates()
#     data_frame.reset_index(drop = True)
    data_frame['created_at'] = pd.to_datetime(data_frame['created_at'])
    return data_frame


for user in users:
    tweets = get_tweets(user, num = 20)
    df = add_tweets_to_df(tweets, fields, df)  

write_csv(df, CSV_FILE)

# pd.set_option('display.max_colwidth', -1)
# display(df.dtypes)
# display(df.head(100))

In [92]:
# load existing CSV and perform analysis
df = read_csv(CSV_FILE)
display(df.head(50))

Unnamed: 0,created_at,screen_name,text,tweet_id,user_id
0.0,2018-01-23 17:19:22,smartdissent,RT @nytpolitics: You probably don’t realize just how much influence Nafta has on your daily life — even the products we think of as quintes…,955852472173711361,826982179204915200
1.0,2018-01-23 17:18:50,smartdissent,RT @NAACP: One of the many daughters of the civil rights movement! Ms. Fannie Lou Hamer. https://t.co/k8qDS1z0Dk,955852334793416704,826982179204915200
2.0,2018-01-23 17:18:42,smartdissent,RT @kylegriffin1: The Trump administration is reportedly waiving dozens of environmental regulations to speed up construction of the border…,955852304414109696,826982179204915200
3.0,2018-01-23 17:10:12,smartdissent,NEW: @realDonaldTrump Restarted the #Sabotage of #ObamaCare in early January https://t.co/S3sVvbzLKd #SmartDissent… https://t.co/gzH6KIvNoG,955850162819293184,826982179204915200
4.0,2018-01-23 17:05:12,smartdissent,"NEW: Hidden between #Christmas &amp; #NewYearsEve, @realDonaldTrump's Admin Granted Anti-Environment Favors to… https://t.co/oKZu043Goj",955848906998472705,826982179204915200
5.0,2018-01-23 16:57:16,smartdissent,"RT @thehill: Member of Trump’s opioid commission calls the commission a “sham"": https://t.co/6oR2MZLZZv https://t.co/i0D8nLB3Tm",955846907724779520,826982179204915200
6.0,2018-01-23 16:52:44,smartdissent,RT @tedlieu: We knew Mueller was investigating potential conspiracy between @realDonaldTrump officials &amp; Russia. Dept of Justice spokespers…,955845769248083968,826982179204915200
7.0,2018-01-23 16:52:19,smartdissent,"RT @KamalaHarris: Retaliatory raids are an abuse of power, which is why Senator @DianneFeinstein and I called on ICE to detail how their ra…",955845663702573056,826982179204915200
8.0,2018-01-23 16:51:52,smartdissent,RT @NAACP: We need members now more than ever! #NAACP #Vote #Membership https://t.co/hE1yOf187x,955845551848902658,826982179204915200
9.0,2018-01-23 16:50:10,smartdissent,"Every single @HouseGOP &amp; @HouseDemocrats seat and 33 Senate seats are up for election THIS NOVEMBER 6, 2018, 9 mont… https://t.co/2z3sTC5sU2",955845122683437056,826982179204915200
