# Botometer Processing Notebook

We decided to use a Python package called to Botometer to help with our analysis. The Botometer is a tool developed by researchers at the Indiana University Network Science Institute (IUNI) and the Center for Complex Networks and Systems Research (CNetS). Scores are displayed as percentages. These percentages are the probability that a twitter account is human or bot; the closer to 0 a score is the higher the likelihood it is a human and the closer to 1 a score is the higher the likelihood it is a bot. According to the Botometer’s website, the “probability calculation uses Bayes’ theorem to take into account an estimate of the overall prevalence of bots, so as to balance false positives with false negatives”.(https://botometer.iuni.iu.edu/#!/faq#what-is-cap) For more information, See Maninder's blog post about the Botometer here: https://medium.com/@m.virk1/botometer-eac76a270516. 

### Contents:
- [Reading in and Inspecting Data](#Reading-in-and-Inspecting-Data)
- [Getting the Botometer Running](#Getting-the-Botometer-Running)
- [Making a Usable DataFrame from Botometer Data](#Making-a-Usable-DataFrame-from-Botometer-Data)
- [Merging, Inspecting, and Preparing the DataFrame](#Merging,-Inspecting,-and-Preparing-the-DataFrame)
- [Prepping Data for NLP Classification Modeling ](#Prepping-Data-for-NLP-Classification-Modeling)

In [1]:
# Importing packages needed for Data Cleaning and EDA
import pandas as pd 
import matplotlib.pyplot as plt
import botometer

### Reading in and Inspecting Data

In [2]:
# Reading in my proprocessed csv to pandas
twitter = pd.read_csv('./data/data_for_bots.csv')

In [3]:
# Checking the shape of my dataframe 
twitter.shape

(28479, 10)

In [4]:
# Seeing what my dataframe looks like 
twitter.head()

Unnamed: 0,date,text,username,id,link,tweet_to,times_retweeted,times_favorited,mentions,hashtags
0,2019-10-10 23:12:29+00:00,California Wildfires &amp; Power Outage - LIVE...,@mary_tanasy,1182433749511921670,https://twitter.com/mary_tanasy/status/1182433...,_,0,0,@YouTube,_
1,2019-11-02 21:48:05+00:00,#California #Wildfires Signal the Arrival of a...,@desirablefuture,1190747430104522752,https://twitter.com/desirablefuture/status/119...,_,0,0,_,#California #Wildfires
2,2019-10-26 22:18:34+00:00,California prepares for biggest blackout yet h...,@oneconcerninc,1188218386666401793,https://twitter.com/oneconcerninc/status/11882...,_,0,1,_,#california #wildfires #calfire
3,2019-11-02 22:56:01+00:00,THANK YOU FIREFIGHTERS: Dramatic photos illust...,@renewalof48,1190764526494502913,https://twitter.com/renewalof48/status/1190764...,_,0,0,_,_
4,2019-11-09 15:40:52+00:00,Trump says he has ordered FEMA to cut off fund...,@GabyDore,1193191734701776896,https://twitter.com/GabyDore/status/1193191734...,_,0,0,_,_


In [5]:
# Seeing how many unique user names there are in my dataframe 
twitter['username'].nunique()

18765

### Getting the Botometer Running 

In [6]:
# Putting my usernames in a list for processing in the botometer 
username_list = twitter['username'].tolist()

In [7]:
# Where one would put in their Twitter API credentials and rapid api key and then instantiate a botometer 
rapidapi_key = "XXXXXXXXXXX" # now it's called rapidapi key
twitter_app_auth = {
    'consumer_key': 'XXXXXXXXXX',
    'consumer_secret': 'XXXXXXXXXX',
    'access_token': 'XXXXXXXXXX',
    'access_token_secret': 'XXXXXXXXXX',
  }
bom = botometer.Botometer(wait_on_ratelimit=True,
                          rapidapi_key=rapidapi_key,
                          **twitter_app_auth)

In [59]:
# Check a sequence of accounts
results = []    
accounts = username_list
for screen_name, result in bom.check_accounts_in(accounts):
    results.append(result)

In [60]:
# Checking the length of my results to make sure I got what I was expecting 
len(results)

1000

### Making a Usable DataFrame from Botometer Data

In [61]:
# Taking my result list and making it into a dataframe called users_and_scores
# Going through a series of pandas code to make my dataframe into just the username and botrating 
users_and_scores = pd.DataFrame(results)
users_and_scores['cap'] = users_and_scores['cap'].astype(str)
users_and_scores['bot_rating'] = users_and_scores['cap'].str.slice(12,30)
users_and_scores['user'] = users_and_scores['user'].astype(str)
users_and_scores['user'] = [data.split('screen_name')[-1] for data in users_and_scores['user']]
users_and_scores['user'] = users_and_scores['user'].str.replace("'", "")
users_and_scores['user'] = users_and_scores['user'].replace(" ", "")
users_and_scores['user'] = users_and_scores['user'].str.replace(":", "")
users_and_scores['user'] = users_and_scores['user'].str.replace("'", "")
users_and_scores['username'] = users_and_scores['user'].str.replace("}", "")
users_and_scores = users_and_scores.drop(columns=['cap', 'categories', 'display_scores', 'scores', 'user', 'error'])
users_and_scores['bot_rating'] = pd.to_numeric(users_and_scores['bot_rating'], errors='coerce')
users_and_scores.head()

Unnamed: 0,bot_rating,username
0,0.001419,charlenecolem15
1,0.001828,AndreaCarlaSM
2,0.011259,Livingstrong67
3,0.00208,BoboSalish
4,0.014544,JstJayne


In [62]:
# Checking the shape of my dataframe 
users_and_scores.shape

(1000, 2)

In [64]:
# making a file called twitter 2 with the same indexing as I used on my username list
# Reseting the index and eliminating the hashtag in the username
# Saving my work to a csv just in case, also moving the number up by one
twitter2 = twitter
twitter2 = twitter2.reset_index()
twitter2['username'] = twitter2['username'].str.replace('@', '')
twitter2.head()

Unnamed: 0,index,date,text,username,id,link,tweet_to,times_retweeted,times_favorited,mentions,hashtags
0,4000,2019-10-27 00:04:41+00:00,#californiawildfires,charlenecolem15,1188245093079105536,https://twitter.com/charlenecolem15/status/118...,_,0,1,_,#californiawildfires
1,4001,2019-11-04 14:44:37+00:00,Exactly. Get your shit together .@realDonaldTr...,AndreaCarlaSM,1191365638628839424,https://twitter.com/AndreaCarlaSM/status/11913...,latimes,0,1,@realDonaldTrump,#MakeAmericaCompetentAgain #MondayMood #Califo...
2,4002,2019-11-04 01:55:20+00:00,Did I miss Nancy and Kamala's press conference...,Livingstrong67,1191172044441956352,https://twitter.com/Livingstrong67/status/1191...,_,16,28,_,_
3,4003,2019-10-31 03:02:09+00:00,#californiawildfires,BoboSalish,1189739305515618306,https://twitter.com/BoboSalish/status/11897393...,OC_Scanner,0,0,_,#californiawildfires
4,4004,2019-11-04 19:32:56+00:00,Trump threatens to pull federal aid for Califo...,JstJayne,1191438196602744834,https://twitter.com/JstJayne/status/1191438196...,_,0,0,@NBCNews,_


### Merging, Inspecting, and Preparing the DataFrame

In [65]:
# Merging my dataframe on the index, also doing .head to make sure the usernames match on both sides 
twitter_bots = twitter2.merge(users_and_scores, left_index=True, right_index=True)
twitter_bots.head()

Unnamed: 0,index,date,text,username_x,id,link,tweet_to,times_retweeted,times_favorited,mentions,hashtags,bot_rating,username_y
0,4000,2019-10-27 00:04:41+00:00,#californiawildfires,charlenecolem15,1188245093079105536,https://twitter.com/charlenecolem15/status/118...,_,0,1,_,#californiawildfires,0.001419,charlenecolem15
1,4001,2019-11-04 14:44:37+00:00,Exactly. Get your shit together .@realDonaldTr...,AndreaCarlaSM,1191365638628839424,https://twitter.com/AndreaCarlaSM/status/11913...,latimes,0,1,@realDonaldTrump,#MakeAmericaCompetentAgain #MondayMood #Califo...,0.001828,AndreaCarlaSM
2,4002,2019-11-04 01:55:20+00:00,Did I miss Nancy and Kamala's press conference...,Livingstrong67,1191172044441956352,https://twitter.com/Livingstrong67/status/1191...,_,16,28,_,_,0.011259,Livingstrong67
3,4003,2019-10-31 03:02:09+00:00,#californiawildfires,BoboSalish,1189739305515618306,https://twitter.com/BoboSalish/status/11897393...,OC_Scanner,0,0,_,#californiawildfires,0.00208,BoboSalish
4,4004,2019-11-04 19:32:56+00:00,Trump threatens to pull federal aid for Califo...,JstJayne,1191438196602744834,https://twitter.com/JstJayne/status/1191438196...,_,0,0,@NBCNews,_,0.014544,JstJayne


In [66]:
# Doing tail to make sure the usernames match on both sides
twitter_bots.tail()

Unnamed: 0,index,date,text,username_x,id,link,tweet_to,times_retweeted,times_favorited,mentions,hashtags,bot_rating,username_y
995,4995,2019-10-30 14:48:47+00:00,My appearance on Fox & Friends this morning: D...,ChuckDeVore,1189554748074135554,https://twitter.com/ChuckDeVore/status/1189554...,garysteveneaton,3,3,_,_,0.001725,ChuckDeVore
996,4996,2019-10-30 09:53:25+00:00,California wildfires: What you need to know - ...,KoltovskoyYakov,1189480414802522114,https://twitter.com/KoltovskoyYakov/status/118...,_,0,0,_,_,0.286869,KoltovskoyYakov
997,4997,2019-11-04 13:30:13+00:00,'No more': Trump says he'll cut off federal fu...,puffandwhit,1191346916321300482,https://twitter.com/puffandwhit/status/1191346...,_,0,0,_,_,0.085904,puffandwhit
998,4998,2019-10-29 20:34:47+00:00,"""Oh Maria, you've been missing from Twitter fo...",mariakzurek,1189279434148384770,https://twitter.com/mariakzurek/status/1189279...,_,1,50,@BerkeleyLab,#PSPS,0.024702,mariakzurek
999,4999,2019-11-06 14:31:13+00:00,And you can be sure that the people who make t...,IMMAlab,1192087042185928704,https://twitter.com/IMMAlab/status/11920870421...,JMPyper,0,0,_,#ClimateChange #CaliforniaWildfires,0.00208,IMMAlab


In [67]:
# Checking the shape 
twitter_bots.shape

(1000, 13)

In [68]:
# Dropping unnecessary columns and renaming others, dropping null values, and saving my work to a csv
twitter_bots= twitter_bots.drop(columns=['username_y', 'id', 'link', 'index'])
twitter_bots = twitter_bots.rename(columns={"username_x": "username"})
twitter_bots.dropna(inplace=True)
twitter_bots.head()

Unnamed: 0,date,text,username,tweet_to,times_retweeted,times_favorited,mentions,hashtags,bot_rating
0,2019-10-27 00:04:41+00:00,#californiawildfires,charlenecolem15,_,0,1,_,#californiawildfires,0.001419
1,2019-11-04 14:44:37+00:00,Exactly. Get your shit together .@realDonaldTr...,AndreaCarlaSM,latimes,0,1,@realDonaldTrump,#MakeAmericaCompetentAgain #MondayMood #Califo...,0.001828
2,2019-11-04 01:55:20+00:00,Did I miss Nancy and Kamala's press conference...,Livingstrong67,_,16,28,_,_,0.011259
3,2019-10-31 03:02:09+00:00,#californiawildfires,BoboSalish,OC_Scanner,0,0,_,#californiawildfires,0.00208
4,2019-11-04 19:32:56+00:00,Trump threatens to pull federal aid for Califo...,JstJayne,_,0,0,@NBCNews,_,0.014544


In [69]:
# Checking the shape after nulls dropped
twitter_bots.shape

(996, 9)

In [70]:
# Looking at the info 
twitter_bots.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 996 entries, 0 to 999
Data columns (total 9 columns):
date               996 non-null object
text               996 non-null object
username           996 non-null object
tweet_to           996 non-null object
times_retweeted    996 non-null int64
times_favorited    996 non-null int64
mentions           996 non-null object
hashtags           996 non-null object
bot_rating         996 non-null float64
dtypes: float64(1), int64(2), object(6)
memory usage: 77.8+ KB


### Prepping Data for NLP Classification Modeling 

In [71]:
# Making one column for text variables, dropping columns, and replacing underscore with a space 
# Saving my work
twitter_bots['words'] = twitter_bots['username'] + ' ' + twitter_bots['hashtags'] + ' ' + twitter_bots['text'] + ' ' + twitter_bots['mentions'] + ' ' + twitter_bots['tweet_to']
twitter_bots.drop(columns=['username', 'text', 'hashtags', 'mentions', 'tweet_to'], inplace=True)
twitter_bots['words'] = twitter_bots['words'].str.replace('_', ' ')
twitter_bots.head()

Unnamed: 0,date,times_retweeted,times_favorited,bot_rating,words
0,2019-10-27 00:04:41+00:00,0,1,0.001419,charlenecolem15 #californiawildfires #californ...
1,2019-11-04 14:44:37+00:00,0,1,0.001828,AndreaCarlaSM #MakeAmericaCompetentAgain #Mond...
2,2019-11-04 01:55:20+00:00,16,28,0.011259,Livingstrong67 Did I miss Nancy and Kamala's...
3,2019-10-31 03:02:09+00:00,0,0,0.00208,BoboSalish #californiawildfires #californiawil...
4,2019-11-04 19:32:56+00:00,0,0,0.014544,JstJayne Trump threatens to pull federal aid...


In [74]:
# Dropping duplicates 
twitter_bots = twitter_bots.drop_duplicates()

In [75]:
# Checking out the nulls and object types 
twitter_bots.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28108 entries, 0 to 30290
Data columns (total 5 columns):
date               28108 non-null object
times_retweeted    28108 non-null int64
times_favorited    28108 non-null int64
bot_rating         28106 non-null object
words              28108 non-null object
dtypes: int64(2), object(3)
memory usage: 1.3+ MB


In [76]:
# Checking the shape of my dataframe 
twitter_bots.shape

(28108, 5)

In [77]:
# Dropping null values
twitter_bots.dropna(inplace=True)

In [78]:
# Making sure all the bot_ratings are numeric, since I made them strings to manipulate the dataframe 
twitter_bots['bot_rating'] = pd.to_numeric(twitter_bots['bot_rating'], errors='coerce')

In [79]:
twitter_bots.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28106 entries, 0 to 30290
Data columns (total 5 columns):
date               28106 non-null object
times_retweeted    28106 non-null int64
times_favorited    28106 non-null int64
bot_rating         28105 non-null float64
words              28106 non-null object
dtypes: float64(1), int64(2), object(2)
memory usage: 1.3+ MB


In [80]:
# Seeing how my data looks one last time before saving it to a csv
twitter_bots.head()

Unnamed: 0,date,times_retweeted,times_favorited,bot_rating,words
0,2019-10-28 16:04:00+00:00,0,1,0.005364,AMPMUZIC #CaliforniaFires #californiawildfires...
1,2019-11-12 03:06:00+00:00,2,1,0.014544,"dwatchnews nam Rebirth, angst and the 'new n..."
2,2019-11-03 20:10:28+00:00,0,0,0.036578,WaterSolarWind Trump melts down on Pelosi du...
3,2019-10-26 08:48:42+00:00,2,2,0.097414,BombayHeadlines #CaliforniaWildfire #californi...
4,2019-11-02 21:57:37+00:00,1,1,0.008751,studentveronica California Wildfires Signal ...


In [81]:
# Checking the shape one last time 
twitter_bots.shape

(28106, 5)

In [83]:
# Saving my mega dataframe to a csv
twitter_bots.to_csv('./data/twitter_preprocessed_all.csv', index=False)