#### Hello everyone, and welcome to my first blog post! 

The purpose of this blog is to document my progress on a self-designed school project at the University of New Hampshire where I will examining [tweets](https://about.twitter.com/en_us/values/elections-integrity.html#data) identified as being part of the Russian disinformation campaign during the 2016 US presidential election. This data was released in October 2018 by Twitter and includes 1.24 GB of tweet information and 296 GB of image, GIF, video, and periscope broadcast data. 

Since this data was released, a number of reports have been published examining the methods of the Internet Research Agency, an organization run by a close Putin ally which is believed to be behind the  aforementioned disinformation campaign on various social media platforms, including Twitter. These reports include two commissioned by the Senate Intelligence Committee: [*The IRA Social Media and Political Polarization in the United States, 2012-2018*](https://int.nyt.com/data/documenthelper/534-oxford-russia-internet-research-agency/c6588b4a7b940c551c38/optimized/full.pdf#page=1) by the Computational Propaganda Research Project at the Oxford Internet Institute, and [*The Tactics and Tropes of the Internet Research Agency*](https://int.nyt.com/data/documenthelper/533-read-report-internet-research-agency/7871ea6d5b7bedafbf19/optimized/full.pdf#page=1) by New Knowledge. For the most part, I will be attempting to replicate the findings of these and other reports as I learn more about different data science techniques including Natural Language Processing, Network Analysis, and Image Processing. 

With that being said, let's get started with some EDA!

In [18]:
import numpy as np
import pandas as pd
import re
pd.set_option('display.max_columns', None)

import os
os.chdir('/Users/benjaminforleo/Box/spring_project/')

In [24]:
df = pd.read_csv('ira_tweets_csv_hashed.csv', low_memory = False)

In [29]:
print(df.shape)
print("")
print(df.columns)

(9041308, 31)

Index(['tweetid', 'userid', 'user_display_name', 'user_screen_name',
       'user_reported_location', 'user_profile_description',
       'user_profile_url', 'follower_count', 'following_count',
       'account_creation_date', 'account_language', 'tweet_language',
       'tweet_text', 'tweet_time', 'tweet_client_name', 'in_reply_to_tweetid',
       'in_reply_to_userid', 'quoted_tweet_tweetid', 'is_retweet',
       'retweet_userid', 'retweet_tweetid', 'latitude', 'longitude',
       'quote_count', 'reply_count', 'like_count', 'retweet_count', 'hashtags',
       'urls', 'user_mentions', 'poll_choices'],
      dtype='object')


In [30]:
df.iloc[:,7:].head()

Unnamed: 0,follower_count,following_count,account_creation_date,account_language,tweet_language,tweet_text,tweet_time,tweet_client_name,in_reply_to_tweetid,in_reply_to_userid,quoted_tweet_tweetid,is_retweet,retweet_userid,retweet_tweetid,latitude,longitude,quote_count,reply_count,like_count,retweet_count,hashtags,urls,user_mentions,poll_choices
0,132,120,2013-12-07,ru,ru,RT @ruopentwit: ⚡️У НАС НОВОЕ ВИДЕО! Американе...,2017-06-22 16:03,TweetDeck,,,,True,2572896396.0,8.779172e+17,,,0.0,0.0,0.0,0.0,[],[http://ru-open.livejournal.com/374284.html],[2572896396],
1,74,8,2014-03-15,en,ru,Серебром отколоколило http://t.co/Jaa4v4IFpM,2014-07-24 19:20,generationπ,,,,False,,,,,0.0,0.0,0.0,0.0,,[http://pyypilg33.livejournal.com/11069.html],,
2,165,454,2014-04-29,en,bg,@kpru С-300 в Иране https://t.co/elnu3qLUW7,2016-04-11 09:20,TweetDeck,7.194399e+17,40807205.0,,False,,,,,0.0,0.0,0.0,0.0,[],[https://www.youtube.com/watch?v=9GvpImWxTJc],[40807205],
3,165,454,2014-04-29,en,ru,"Предлагаю судить их за поддержку нацизма, т.к....",2014-11-22 15:28,Twitter Web Client,,,,False,,,,,0.0,0.0,0.0,0.0,[STOPNazi],,,
4,4430,4413,2012-02-25,ru,bg,Предостережение американского дипломата https:...,2017-03-13 22:08,Twitter Web Client,,,,False,,,,,0.0,0.0,3.0,4.0,[],[https://goo.gl/fBp94X],,


It looks like many of these tweets are in Russian. For the natural language processing portion of the project, I'll work entirely with english language tweets. 

In [8]:
en_tweets = df[df.tweet_language == 'en']

In [12]:
en_tweets.shape

(3261931, 31)

In [25]:
en_tweets.iloc[:,7:].head()

Unnamed: 0,follower_count,following_count,account_creation_date,account_language,tweet_language,tweet_text,tweet_time,tweet_client_name,in_reply_to_tweetid,in_reply_to_userid,quoted_tweet_tweetid,is_retweet,retweet_userid,retweet_tweetid,latitude,longitude,quote_count,reply_count,like_count,retweet_count,hashtags,urls,user_mentions,poll_choices
8,696,863,2013-08-06,en,en,"As sun and cloud give way to moon and shadow, ...",2015-02-16 16:19,Twitter Web Client,,,,False,,,,,0.0,0.0,0.0,0.0,,,,
10,103,218,2014-03-24,en,en,"Down in the comfort of strangers, I...",2014-07-28 23:02,vavilonX,,,,False,,,,,0.0,0.0,0.0,0.0,,,,
11,103,218,2014-03-24,en,en,Im laughing more than i should #USA,2014-07-28 09:24,vavilonX,,,,False,,,,,0.0,0.0,0.0,0.0,[USA],,,
12,103,218,2014-03-24,en,en,"No, I'm not saying I'm sorry",2014-08-08 00:43,vavilonX,,,,False,,,,,0.0,0.0,0.0,0.0,,,,
32,63,77,2014-05-23,en,en,Laugh it all off in your face,2014-08-17 10:46,vavilonX,,,,False,,,,,0.0,0.0,0.0,0.0,,,,


Great! We've filtered down to tweets containing english text. Let's see how many unique accounts there are from this group.

In [27]:
print("Number of unique accounts with english text:", len(set(en_tweets.userid)))

Number of unique accounts with english text: 3259


Interesting! In the coming weeks, I will be digging into natural language processing and trying to find patterns in the text posted by these accounts. 