# Detecting Twitter Bots Using Machine Learning


### In this notebook we build a RF classifier with a different subset of features, and also implement a Multilayer Perceptron NN.

The predictions from this model will be fed into the final pipeline, presented in the Expert's Advice ipython notebook.


Anantha Natarajn Selvaganapathy<br/>
N16989511<br/>
ans599<br/>
http://ananth.co.in

The objective of this project is to use machine learning techniques to detect weather a given Twitter account is a bot or not. 

We will be using various machine learning algorithms and compare and analyze their predictions. We will also explore the use of deep learning techniques and compare their results with regression and classification algorithms. 



Before we begin, we import all the required libraries. We will be using sklearn for all the machine learning models, and pandas and numpy for data manipulation and cleaning.

We also use matplotlib and seaborn for plots.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib

from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.utils import shuffle
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import seaborn as sns


% matplotlib inline

We load the datasets into a pandas dataframe. The `describe` method gives us a quick glimpse of the dataset.

In [2]:
newbot_data = pd.read_csv('./training_data_2_csv_UTF.csv', encoding = "utf-8")

bot_data = pd.read_csv('bots_data.csv', encoding = "ISO-8859-1")
nonbot_data = pd.read_csv('nonbots_data.csv', encoding = "ISO-8859-1")

print("Bot data shape:", bot_data.shape)
print("Non bot data shape:", nonbot_data.shape)

print(nonbot_data.columns)

print("BOT DATA:")
print(bot_data.describe())

print("\nNON BOT DATA:")
print(nonbot_data.describe())

print("\nNEWBOT DATA:")
print(newbot_data.shape)


Bot data shape: (1056, 20)
Non bot data shape: (1176, 20)
Index(['id', 'id_str', 'screen_name', 'location', 'description', 'url',
       'followers_count', 'friends_count', 'listedcount', 'created_at',
       'favourites_count', 'verified', 'statuses_count', 'lang', 'status',
       'default_profile', 'default_profile_image', 'has_extended_profile',
       'name', 'bot'],
      dtype='object')
BOT DATA:
                 id  followers_count  friends_count   listedcount  \
count  1.056000e+03     1.056000e+03    1056.000000   1056.000000   
mean   2.527841e+17     1.557358e+04    1353.546402    129.678977   
std    3.709023e+17     2.402553e+05   12972.825548    890.896498   
min    2.425231e+06     0.000000e+00       0.000000      0.000000   
25%    2.434705e+09     1.075000e+01       1.000000      0.000000   
50%    3.029672e+09     9.950000e+01      14.000000     10.000000   
75%    7.562500e+17     6.012500e+02     259.250000     31.000000   
max    8.410000e+17     7.154703e+06  355

In [3]:
frames = [ newbot_data]
dataset = pd.concat(frames)

df1 = pd.concat(frames)
vectorizer = CountVectorizer(min_df=1)

from sklearn.feature_extraction.text import CountVectorizer
 
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000) 

dataset['location'] = dataset['location'].fillna('none')
train_data_features = vectorizer.fit_transform(dataset['location'])

train_data_features = train_data_features.toarray()
train_data_features = pd.DataFrame(data=train_data_features[:,:],    
              index=train_data_features[:,0])


In [6]:
names = dataset['screen_name']


In [7]:
dataset['status'] = dataset['status'].apply(lambda x: int(len(str(x))))

dataset['screen_name'] = dataset['screen_name'].apply(lambda x: 'bot' in x.lower())
dataset['description'] = dataset['description'].apply(lambda x: 'bot' in str(x).lower())
dataset['name'] = dataset['name'].apply(lambda x: int('bot' in str(x).lower()))

dataset = shuffle(dataset)

randomForrest = RandomForestClassifier(n_estimators=100, 
                                       min_samples_split=5, 
                                       random_state=0)

X = dataset[['screen_name', 'description','followers_count', 'friends_count', 
             'listedcount', 'favourites_count', 'verified', 
             'statuses_count', 'status', 'default_profile', 
             'default_profile_image']]

y = dataset[['bot']]

x_testing = X[-200:]
y_testing = y[-200:]
X = X[:-200]
y = y[:-200]


In [8]:
predictX = pd.read_csv('./test_data_4_students.csv', encoding = "ISO-8859-1")


x_test = predictX[['screen_name', 'description','followers_count', 'friends_count', 
             'listed_count', 'favorites_count', 'verified', 
             'statuses_count', 'status', 'default_profile',
             'default_profile_image', 'name']]

x_test = x_test[:575]

x_test['screen_name'] = x_test['screen_name'].apply(lambda x: int('bot' in x.lower()))
x_test['description'] = x_test['description'].apply(lambda x: int('bot' in str(x).lower()))
x_test['listed_count'] = x_test['listed_count'].apply(lambda x: 0 if str(x) == "None" else int(x))
x_test['favorites_count'] = x_test['favorites_count'].apply(lambda x: 0 if str(x) == "None" else int(x))
x_test['followers_count'] = x_test['followers_count'].apply(lambda x: 0 if str(x) == "None" else int(x))
x_test['friends_count'] = x_test['friends_count'].apply(lambda x: 0 if str(x) == "None" else int(x))

x_test['verified'] = x_test['verified'].apply(lambda x: 1 if str(x) == "TRUE" else 0)
x_test['statuses_count'] = x_test['statuses_count'].apply(lambda x: int(x))
x_test['status'] = x_test['status'].apply(lambda x: int(len(str(x))))
x_test['default_profile'] = x_test['default_profile'].apply(lambda x: 1 if x == "TRUE" else 0)
x_test['default_profile_image'] = x_test['default_profile_image'].apply(lambda x: 1 if x == "TRUE" else 0)
x_test['name'] = x_test['name'].apply(lambda x: int('bot' in x.lower()))


In [9]:
#Gradient Boosted Classifier
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(n_estimators=200, learning_rate=0.99900, 
                                 max_depth=15, random_state=1, min_samples_split=12).fit(X.values, y.values.ravel())

print(clf.score(x_testing, y_testing))
# pred2 = clf.predict(x_test.values)
# ids = predictX['id'][:575]
# with open('ans599_3_usingnewdata.csv', 'w') as the_file:
#     the_file.write("id,bot\n")    
#     for i in range(575):
#         the_file.write(str(int(ids[i])) +","+ str(pred2[i])+"\n") 

0.855


In [10]:
randomForrest.fit(X.values, y.values.ravel())

# pred = randomForrest.predict(x_test.values)
# ids = predictX['id'][:575]
# with open('ans599_2_usingBothdata_useName.csv', 'w') as the_file:
#     the_file.write("id,bot\n")    
#     for i in range(575):
#         the_file.write(str(int(ids[i])) +","+ str(pred[i])+"\n") 

scores = cross_val_score(randomForrest, X.values, y.values.ravel())
print (scores.mean())

pred = randomForrest.predict(x_testing)
y_test = y_testing.values
# y_test = y_testing.ravel()
print(accuracy_score(y_testing, pred))

0.912200954279
0.88


#### Result

The EF classifier with ['screen_name', 'description','followers_count', 'friends_count', 
             'listed_count', 'favorites_count', 'verified', 
             'statuses_count', 'status', 'default_profile',
             'default_profile_image', 'name'] features gives us 0.91 - 0.92 cross validation accuracy.

# Multi-layer Perceptron regressor

This model optimizes the squared-loss using LBFGS or stochastic gradient descent.

I experimented with 15-25 layers with 33 neurons per layer.

In [4]:
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.neural_network import MLPRegressor

frames = [ newbot_data]
dataset = pd.concat(frames)

dataset['status'] = dataset['status'].apply(lambda x: int(len(str(x))))
dataset['status_hasBot'] = dataset['status'].apply(lambda x: 'bot' in str(x).lower())

dataset['screen_name'] = dataset['screen_name'].apply(lambda x: 'bot' in x.lower())
dataset['description'] = dataset['description'].apply(lambda x: 'bot' in str(x).lower())
dataset['name'] = dataset['name'].apply(lambda x: int('bot' in str(x).lower()))
dataset["has_extended_profile"] = dataset["has_extended_profile"].apply(lambda x:  0 if str(x) == "False" else 1)
dataset["has_extended_profile"].fillna(0)
dataset['location'] = dataset['location'].apply(lambda x: 1 if len(str(x)) > 0 else 0)

dataset = shuffle(dataset)


X = dataset[['screen_name', 'description','followers_count', 'friends_count', 
             'listedcount', 'favourites_count', 'verified', 'name',
             'statuses_count', 'status', 'default_profile', 'has_extended_profile', 'location', 'status_hasBot',
             'default_profile_image']]

y = dataset[['bot']]

x_testing = X[-500:]
y_testing = y[-500:]
X = X[:-500]
y_train = y[:-500]


scaler = StandardScaler()
# Fit only to the training data
scaler.fit(X)

X_train = scaler.transform(X)
X_test = scaler.transform(x_testing)

# X_train
mlp = MLPClassifier(hidden_layer_sizes=(33,33,33,33,33,33,33,33,33,33,33,33,33,33,33, 2))
mlp.fit(X_train,y_train.values.ravel())

predictions = mlp.predict(X_test)
print(confusion_matrix(y_testing,predictions))

print(classification_report(y_testing,predictions))

predictX = pd.read_csv('./test_data_4_students.csv', encoding = "ISO-8859-1")

x_test = predictX[['screen_name', 'description','followers_count', 'friends_count', 
             'listed_count', 'favorites_count', 'verified', 
             'statuses_count', 'status', 'default_profile', 'has_extended_profile', 'location', 'name',
             'default_profile_image']]


x_test = x_test[:575]
print(predictX.columns)
x_test['screen_name'] = x_test['screen_name'].apply(lambda x: int('bot' in str(x).lower()))
x_test['description'] = x_test['description'].apply(lambda x: int('bot' in str(x).lower()))
x_test['listed_count'] = x_test['listed_count'].apply(lambda x: 0 if str(x) == "None" else int(x))
x_test['favorites_count'] = x_test['favorites_count'].apply(lambda x: 0 if str(x) == "None" else int(x))
x_test['followers_count'] = x_test['followers_count'].apply(lambda x: 0 if str(x) == "None" else int(x))
x_test['friends_count'] = x_test['friends_count'].apply(lambda x: 0 if str(x) == "None" else int(x))

x_test['verified'] = x_test['verified'].apply(lambda x: 1 if str(x) == "TRUE" else 0)
x_test['statuses_count'] = x_test['statuses_count'].apply(lambda x: int(x))
x_test['status'] = x_test['status'].apply(lambda x: int(len(str(x))))
x_test['default_profile'] = x_test['default_profile'].apply(lambda x: 1 if x == "TRUE" else 0)
x_test['default_profile_image'] = x_test['default_profile_image'].apply(lambda x: 1 if x == "TRUE" else 0)
x_test['name'] = x_test['name'].apply(lambda x: int('bot' in str(x).lower()))
x_test['status_hasBot'] = x_test['status'].apply(lambda x: int('bot' in str(x).lower()))
x_test['location'] = x_test['location'].apply(lambda x: 1 if len(str(x)) > 0 else 0)

x_test["has_extended_profile"] = x_test["has_extended_profile"].apply(lambda x:  0 if str(x) == "False" else 1)
x_test["has_extended_profile"].fillna(0)

randomForrest = RandomForestClassifier(n_estimators=200, 
                                       min_samples_split=2, 
                                       random_state=1, max_depth=13)

randomForrest.fit(X, y_train.values.ravel())

scores = cross_val_score(randomForrest, X, y_train.values.ravel())
print (scores.mean())
predictions = randomForrest.predict(x_testing)
print(accuracy_score(y_testing, predictions))

x_test.dtypes
pred = randomForrest.predict(x_test)
ids = predictX['id'][:575]
with open('ans599__1_useAllFea.csv', 'w') as the_file:
    the_file.write("id,bot\n")    
    for i in range(575):
        the_file.write(str(int(ids[i])) +","+ str(pred[i])+"\n")
print("Wrote to file!")

[[282   0]
 [218   0]]
             precision    recall  f1-score   support

          0       0.56      1.00      0.72       282
          1       0.00      0.00      0.00       218

avg / total       0.32      0.56      0.41       500

Index(['id', 'id_str', 'screen_name', 'location', 'description', 'url',
       'followers_count', 'friends_count', 'listed_count', 'created_at',
       'favorites_count', 'verified', 'statuses_count', 'lang', 'status',
       'default_profile', 'default_profile_image', 'has_extended_profile',
       'name', 'bot'],
      dtype='object')


  'precision', 'predicted', average, warn_for)


0.909442140651
0.922
Wrote to file!


In [172]:
dataset.columns

Index(['id', 'id_str', 'screen_name', 'location', 'description', 'url',
       'followers_count', 'friends_count', 'listedcount', 'created_at',
       'favourites_count', 'verified', 'statuses_count', 'lang', 'status',
       'default_profile', 'default_profile_image', 'has_extended_profile',
       'name', 'bot', 'status_hasBot'],
      dtype='object')

In [22]:
old_result = pd.read_csv('ans599_1.csv')
new_result = pd.read_csv('ans599_2_tunePrams_addLocNameUrl_idLen.csv')
new2_result = pd.read_csv('ans599_1_tunePrams.csv')
allf_result = pd.read_csv('ans599__1_useAllFea.csv')
name_result = pd.read_csv('ans599_2_usingBothdata_useName.csv')

# merged = old_result.merge(new_result, indicator=True, how='outer')
# merged[merged['_merge'] == 'right_only']
a = old_result["bot"]
b = new_result["bot"]
c = new2_result["bot"]
d = allf_result["bot"]
e = name_result["bot"]

for i in range(575):
    vote = a[i]+b[i]+c[i]+d[i]+e[i]
    vb = 0 if vote <= 2 else 1
    if vb != b[i]:
        print( str(i)+ "  old: "+str(a[i])+" new: "+str(b[i])+ " tune: "+ str(c[i])+ " allf: "+ str(d[i]) +"  vote:"+str(vote))
print(sum(a))
print(sum(b))
print(sum(c))
print(sum(d))

102  old: 1 new: 0 tune: 1 allf: 0  vote:3
103  old: 0 new: 1 tune: 0 allf: 1  vote:2
144  old: 0 new: 1 tune: 0 allf: 0  vote:1
159  old: 1 new: 0 tune: 1 allf: 1  vote:3
281  old: 0 new: 1 tune: 0 allf: 0  vote:1
288  old: 1 new: 0 tune: 1 allf: 1  vote:4
299  old: 1 new: 0 tune: 1 allf: 1  vote:4
401  old: 0 new: 1 tune: 0 allf: 0  vote:1
441  old: 1 new: 0 tune: 1 allf: 1  vote:4
483  old: 1 new: 1 tune: 0 allf: 0  vote:2
493  old: 0 new: 1 tune: 0 allf: 0  vote:1
494  old: 0 new: 1 tune: 0 allf: 0  vote:1
495  old: 0 new: 1 tune: 0 allf: 1  vote:2
496  old: 0 new: 1 tune: 0 allf: 1  vote:2
537  old: 0 new: 1 tune: 0 allf: 1  vote:2
539  old: 0 new: 1 tune: 0 allf: 0  vote:1
557  old: 0 new: 1 tune: 0 allf: 0  vote:1
261
270
252
283


## Understanding the differences in models

We can see from above that the predictions match for more than 95% data points.

We will be using the output of this model, and other generated models along with agressive tuning to come to a vote. This approach is called Expert's Advice, other also known as Randomized weighted majority algorithm.

https://en.wikipedia.org/wiki/Randomized_weighted_majority_algorithm