# Detecting Fake Accounts

**As discovered in analysis of data relating to account details it was obeserved that known spam bot accounts shared many similarities with fake accounts and using this data we will now train a FAKE+SPAM vs GENUINE account classifier and later use it for the purposes of Social Network Analysis**


In [1]:
import pandas as pd

In [2]:
useful_colums = ["statuses_count", "followers_count", "friends_count", "favourites_count", "listed_count", "default_profile", "profile_banner_url", "profile_background_tile", "profile_background_color" ,"verified"]

## Load Data

In [3]:
df_fake = pd.read_csv("data_mib/fake_followers.csv/fake_followers.csv/users.csv")
df_spam_1 = pd.read_csv("data_mib/social_spambots_1.csv/users.csv")
df_spam_2 = pd.read_csv("data_mib/social_spambots_2.csv/users.csv")
df_spam_3 = pd.read_csv("data_mib/social_spambots_3.csv/users.csv")
df_genuine = pd.read_csv("data_mib/genuine_accounts.csv/users.csv")

In [5]:
df_fake.mean()

id                              7.442362e+08
statuses_count                  7.189824e+01
followers_count                 1.774038e+01
friends_count                   3.700597e+02
favourites_count                4.299612e+00
listed_count                    7.311250e-02
default_profile                 1.000000e+00
default_profile_image           1.000000e+00
geo_enabled                     1.000000e+00
profile_use_background_image    1.000000e+00
profile_background_tile         1.000000e+00
utc_offset                     -8.602388e+03
is_translator                            NaN
follow_request_sent                      NaN
protected                                NaN
verified                                 NaN
notifications                            NaN
contributors_enabled                     NaN
following                                NaN
dtype: float64

In [6]:
df_genuine.mean()

id                              9.519675e+08
statuses_count                  1.695822e+04
followers_count                 1.393220e+03
friends_count                   6.332424e+02
favourites_count                4.669620e+03
listed_count                    1.949655e+01
default_profile                 1.000000e+00
default_profile_image           1.000000e+00
geo_enabled                     1.000000e+00
profile_use_background_image    1.000000e+00
profile_background_tile         1.000000e+00
utc_offset                     -4.386545e+03
is_translator                   1.000000e+00
follow_request_sent                      NaN
protected                       1.000000e+00
verified                        1.000000e+00
notifications                            NaN
contributors_enabled                     NaN
following                                NaN
test_set_1                      2.878526e-01
test_set_2                      1.410478e-01
dtype: float64

In [14]:
df_spam = df_spam_1.append(df_spam_2).append(df_spam_3)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


## Preprocess Data into trainable format

In [15]:
def preprocess(df):
    # extract relevant columns
    df = df[useful_colums]
    # FILL NA with 0
    df.fillna(0, inplace=True)
    # fix profile_banner_url `1 == present` `0 == absent`
    df["profile_banner_url"][df["profile_banner_url"] != 0] = 1
    # fix profile_background_color `1 == not default` `0 == default`
    df["profile_background_color"][df["profile_background_color"] != "C0DEED"] = 1
    df["profile_background_color"][df["profile_background_color"] == "C0DEED"] = 0
    df = df.apply(pd.to_numeric)
    return df

In [16]:
df_fake = preprocess(df_fake)
df_spam = preprocess(df_spam)
df_genuine = preprocess(df_genuine)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  downcast=downcast, **kwargs)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://panda

## Preparing Data

In [26]:
import numpy as np

In [17]:
df_fake_spam = df_fake.append(df_spam)

In [18]:
# Prior Values
len(df_fake_spam), len(df_genuine)

(8263, 3474)

In [20]:
# generate training labels
Y_train = [0 for i in range(len(df_fake_spam))] + [1 for i in range(len(df_genuine))]

In [24]:
fake_spam_values = df_fake_spam.values.copy()
genuine_values = df_genuine.values.copy()
fake_spam_values.shape, genuine_values.shape

((8263, 10), (3474, 10))

In [27]:
X_train = np.concatenate((fake_spam_values, genuine_values), axis=0)
X_train.shape

(11737, 10)

In [28]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

In [35]:
# feature scaling
mms = MinMaxScaler()
X_train_scaled = mms.fit_transform(X_train)

In [41]:
# train-test split
X_tr, X_te, Y_tr, Y_te = train_test_split(X_train_scaled, Y_train, test_size=0.25, random_state=101) 

In [38]:
X_tr.shape, X_te.shape

((8802, 10), (2935, 10))

## Training Model

In [43]:
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from xgboost import XGBClassifier

In [50]:
models = {
    "svm" : LinearSVC(),
    "dt" : DecisionTreeClassifier(),
    "knn" : KNeighborsClassifier(),
    "rf" : RandomForestClassifier(),
    "ada" : AdaBoostClassifier(),
    "xgb" : XGBClassifier()
}

In [51]:
for model in models:
    clf = models[model]
    clf.fit(X_tr, Y_tr)
    print("{} gets {} accuracy on testing data".format(model ,clf.score(X_te, Y_te)))

svm gets 0.9560477001703578 accuracy on testing data
dt gets 0.9781942078364566 accuracy on testing data
knn gets 0.9713798977853493 accuracy on testing data




rf gets 0.9812606473594548 accuracy on testing data
ada gets 0.9805792163543441 accuracy on testing data
xgb gets 0.9816013628620103 accuracy on testing data


## Save model

In [53]:
def save_model():
    import pickle
    pickle.dump(file=open("F_vs_G_model.data", "wb"), obj={"classifier": models["xgb"], "scaler": mms})

In [54]:
save_model()