### BotDetecter

The goal of this project is to develop a Machine Learning model that classifies accounts on social media as bot or human based on metadata or behaviors.

## Loading the dataset

The dataset has been extracted from ([[Botometer](https://botometer.osome.iu.edu/bot-repository/datasets.html)]). The original file is in a ".tar.gz" format. I will start by converting to a ".csv" file that i can manipulate.

In [1]:
import tarfile 
import pandas as pd    

# Route to tar.gz file.
tar_path = '../data/botometer-feedback-2019.tar.gz'

# Opening the tar file.
with tarfile.open(tar_path, 'r:gz') as tar:
    # Showing the files.
    print(tar.getnames())

['botometer-feedback-2019.tsv', 'botometer-feedback-2019_tweets.json']


There is an .tsv file, it's the same as a .csv file but separated using tabs. Now, i will extract the tsv file and transform it.

In [11]:
with tarfile.open(tar_path, 'r:gz') as tar:    
    tsv_file = tar.extractfile('botometer-feedback-2019.tsv')
    print(f'TSV File extracted succesfully!')
    df = pd.read_csv(tsv_file, sep='\t', header=0)
    print(f'Dataframe created succesfully!')
    print(f'Column names are: {df.columns}')
df.head()

TSV File extracted succesfully!
Dataframe created succesfully!
Column names are: Index(['2718436417', 'human'], dtype='object')


Unnamed: 0,2718436417,human
0,792615528791703553,human
1,3287012484,human
2,93816184,human
3,754884880996020225,bot
4,3027809025,bot


It doesn't have a header. I will create one manually.

In [12]:
column_names = ['user_id', 'label']  
with tarfile.open(tar_path, 'r:gz') as tar:
    tsv_file = tar.extractfile('botometer-feedback-2019.tsv')
    print(f'TSV File extracted succesfully!')
    df = pd.read_csv(tsv_file, sep='\t', header=None, names=column_names)
    print(f'Dataframe created succesfully!')
    print(f'Column names are: {df.columns}')
df.head()

TSV File extracted succesfully!
Dataframe created succesfully!
Column names are: Index(['user_id', 'label'], dtype='object')


Unnamed: 0,user_id,label
0,2718436417,human
1,792615528791703553,human
2,3287012484,human
3,93816184,human
4,754884880996020225,bot


I will load the json file, for future usages.

In [17]:
import json 

with tarfile.open(tar_path,'r:gz') as tar:
    json_file = tar.extractfile('botometer-feedback-2019_tweets.json')
    tweets_data = json.load(json_file)
    print(f'JSON File loaded succesfully!')
    
# first_user = list(tweets_data.keys())[0]
n = 1
print(f'First {n} tweets are: {tweets_data[:n]}')

JSON File loaded succesfully!
First 1 tweets are: [{'created_at': 'Mon Apr 16 19:28:33 +0000 2018', 'user': {'follow_request_sent': False, 'has_extended_profile': False, 'profile_use_background_image': False, 'default_profile_image': False, 'id': 602249341, 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme4/bg.gif', 'verified': False, 'translator_type': 'none', 'profile_text_color': '000000', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/923924342974578688/k5RCrlSQ_normal.jpg', 'profile_sidebar_fill_color': '000000', 'entities': {'url': {'urls': [{'url': 'https://t.co/e5t6p9w7D8', 'indices': [0, 23], 'expanded_url': 'http://www.socialresultsltd.com', 'display_url': 'socialresultsltd.com'}]}, 'description': {'urls': []}}, 'followers_count': 790, 'profile_sidebar_border_color': '000000', 'id_str': '602249341', 'profile_background_color': '000000', 'listed_count': 42, 'is_translation_enabled': False, 'utc_offset': 3600, 'statuses_count': 6

Now, i will save both files to make the load easier in other processes.

In [18]:
# Saving the DF as csv.
df.to_csv('../data/bot_df.csv', index=False)
# Saving the JSON file.
with open ('../data/bot_tweets.json', 'w', encoding='utf-8') as f:
    json.dump(tweets_data,f)
    print(f'JSON file saved succesfully!')

JSON file saved succesfully!
