# Spam Bots
**Analyze the features of a SPAM Bot account**

In [1]:
import pandas as pd
import numpy as np

In [2]:
useful_columns = ["statuses_count", "followers_count", "friends_count", "favourites_count", "listed_count", "default_profile", "profile_banner_url", "profile_background_tile", "profile_background_color" ,"verified"]

In [3]:
df_spambots_1 = pd.read_csv("data_mib/social_spambots_1.csv/users.csv")
df_spambots_2 = pd.read_csv("data_mib/social_spambots_2.csv/users.csv")
df_spambots_3 = pd.read_csv("data_mib/social_spambots_3.csv/users.csv")

In [11]:
tot_len = len(df_spambots_1), len(df_spambots_2), len(df_spambots_3)
tot_len

(991, 3457, 464)

In [6]:
df_spambots = df_spambots_1.append(df_spambots_2).append(df_spambots_3)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


* extract desired columns

In [12]:
len(df_spambots) == sum(tot_len)

True

In [13]:
df_spambots = df_spambots[useful_columns]

In [14]:
df_spambots.size

49120

## Check if any of our desired columns have null values.

In [15]:
for col in useful_columns:
    null_values = df_spambots[col].isnull()
    na_values = df_spambots[col].isna()
    print("COLUMN : {} has {} NULL VALUES AND {} NaN Values".format(col, len(null_values), len(na_values)))

COLUMN : statuses_count has 4912 NULL VALUES AND 4912 NaN Values
COLUMN : followers_count has 4912 NULL VALUES AND 4912 NaN Values
COLUMN : friends_count has 4912 NULL VALUES AND 4912 NaN Values
COLUMN : favourites_count has 4912 NULL VALUES AND 4912 NaN Values
COLUMN : listed_count has 4912 NULL VALUES AND 4912 NaN Values
COLUMN : default_profile has 4912 NULL VALUES AND 4912 NaN Values
COLUMN : profile_banner_url has 4912 NULL VALUES AND 4912 NaN Values
COLUMN : profile_background_tile has 4912 NULL VALUES AND 4912 NaN Values
COLUMN : profile_background_color has 4912 NULL VALUES AND 4912 NaN Values
COLUMN : verified has 4912 NULL VALUES AND 4912 NaN Values


In [16]:
df_spambots[col].isnull().size / df_spambots.size * 100

10.0

* Convert Null values to 0

In [17]:
df_spambots.fillna(0, inplace=True)

In [19]:
df_spambots.size

49120

In [20]:
df_spambots.head()

Unnamed: 0,statuses_count,followers_count,friends_count,favourites_count,listed_count,default_profile,profile_banner_url,profile_background_tile,profile_background_color,verified
0,1299,22,40,1,0,1.0,0,0.0,C0DEED,0.0
1,18665,12561,3442,16358,110,0.0,https://pbs.twimg.com/profile_banners/33212890...,1.0,EBEBEB,0.0
2,22987,600,755,14,6,0.0,https://pbs.twimg.com/profile_banners/39773427...,1.0,131516,0.0
3,7975,398,350,11,2,0.0,https://pbs.twimg.com/profile_banners/57007623...,1.0,E60584,0.0
4,20218,413,405,162,8,0.0,https://pbs.twimg.com/profile_banners/63258466...,0.0,EBEBEB,0.0


## Test with pretrained classifier
> Which class do spambots fit into

In [21]:
def load_model():
    import pickle
    loaded_model = pickle.load(open("fake_account_model.dat", "rb"))
    model = loaded_model["classifier"]
    scaler = loaded_model["scaler"]
    return model, scaler

In [22]:
model, scaler = load_model()

In [23]:
def fix_df(df):
    # fix profile_banner_url `1 == present` `0 == absent`
    df["profile_banner_url"][df["profile_banner_url"] != 0] = 1
    # fix profile_background_color `1 == not default` `0 == default`
    df["profile_background_color"][df["profile_background_color"] != "C0DEED"] = 1
    df["profile_background_color"][df["profile_background_color"] == "C0DEED"] = 0
    return df

In [24]:
df_spambots = fix_df(df_spambots)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [25]:
df_spambots = df_spambots.apply(pd.to_numeric)

In [26]:
X_test = df_spambots.values
X_test

array([[1.2990e+03, 2.2000e+01, 4.0000e+01, ..., 0.0000e+00, 0.0000e+00,
        0.0000e+00],
       [1.8665e+04, 1.2561e+04, 3.4420e+03, ..., 1.0000e+00, 1.0000e+00,
        0.0000e+00],
       [2.2987e+04, 6.0000e+02, 7.5500e+02, ..., 1.0000e+00, 1.0000e+00,
        0.0000e+00],
       ...,
       [1.3700e+02, 2.9000e+01, 1.2400e+02, ..., 0.0000e+00, 1.0000e+00,
        0.0000e+00],
       [1.7000e+02, 1.1500e+02, 3.5300e+02, ..., 0.0000e+00, 1.0000e+00,
        0.0000e+00],
       [6.5000e+01, 1.3500e+02, 9.9800e+02, ..., 0.0000e+00, 1.0000e+00,
        0.0000e+00]])

In [27]:
X_test.shape

(4912, 10)

In [28]:
X_test = scaler.transform(X_test)

In [29]:
predictions = model.predict(X_test)

In [32]:
pred_counts = np.bincount(predictions)
pred_counts

array([4564,  348], dtype=int64)

In [34]:
pred_counts[0] / sum(pred_counts)

0.9291530944625407

## CONCLUSION

**SPAM ACCOUNTS share features similar to fake accounts**

Using a predictor trained on fake and genuine accounts, 92.91% of spam bot accounts were classified as fake accounts.