<a href="https://colab.research.google.com/github/adrianaleticiamartinez/MCD/blob/main/App.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Social Media Bot Detection App

GitHub link: hhttps://github.com/adrianaleticiamartinez/MCD/tree/main

## Project Metadata

- **University:** Universidad Panamericana
- **Course:** Machine Learning I
- **Team Members:**
  - David Arturo Hernández Gómez
  - Adriana Leticia Martínez Estrada
- **Date:** December 5th, 2023
- **Code Version:** 2.1

## Project Overview

### Description
The advent of social media has been accompanied by the proliferation of automated accounts or 'bots' that can significantly influence the dissemination of information. These bots can be benign, serving to automate repetitive tasks, or malicious, spreading misinformation or spam. The goal of this project is to create a supervised machine learning model that can accurately distinguish between human users and bots based on their behavior on social media platforms.

### Objectives
- To understand the patterns and characteristics that differentiate bot behavior from human behavior.
- To implement a binary classification model that can predict whether a social media account is a bot.
- To evaluate the model's performance using metrics such as AUC-ROC, accuracy, and sensitivity.

## Methodology
Before start running cells install all the libraries in the requirements.txt file this ensure the enviroment is the same in all computers.
The requirements file is hosted on the github repository but all te dependencies are on the cell requirements.

## Execution Instructions
Please execute all the notebook cells in sequential order. Each cell is documented to describe the processes being performed, from data preprocessing to model evaluation.

---

*For detailed analysis and discussion on the results, please refer to the subsequent sections of this notebook.*



#Requirements

In [None]:
!pip install pandas==1.5.3  numpy==1.23.5 matplotlib==3.7.1 scikit-learn==1.2.2 gradio==4.8.0

Collecting gradio==4.8.0
  Downloading gradio-4.8.0-py3-none-any.whl (16.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.5/16.5 MB[0m [31m76.6 MB/s[0m eta [36m0:00:00[0m
Collecting aiofiles<24.0,>=22.0 (from gradio==4.8.0)
  Downloading aiofiles-23.2.1-py3-none-any.whl (15 kB)
Collecting fastapi (from gradio==4.8.0)
  Downloading fastapi-0.104.1-py3-none-any.whl (92 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.9/92.9 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ffmpy (from gradio==4.8.0)
  Downloading ffmpy-0.3.1.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gradio-client==0.7.1 (from gradio==4.8.0)
  Downloading gradio_client-0.7.1-py3-none-any.whl (302 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.9/302.9 kB[0m [31m26.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httpx (from gradio==4.8.0)
  Downloading httpx-0.25.2-py3-none-any.whl (74 kB)
[2K 

In [None]:
import gradio as gr
import numpy as np
from sklearn.linear_model import LogisticRegression
import joblib
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import urllib.request
import requests

# Load bot data

In [None]:
DATASET_BOT_PATH = 'https://raw.githubusercontent.com/adrianaleticiamartinez/MCD/main/ML1_Project/datasets/botwiki-2019/botwiki-2019.tsv'
DATASET_BOT_COMPLEMENT_PATH = 'https://raw.githubusercontent.com/adrianaleticiamartinez/MCD/main/ML1_Project/datasets/botwiki-2019/botwiki-2019_tweets.json'
BEST_MODEL_PATH = "https://raw.githubusercontent.com/adrianaleticiamartinez/MCD/main/ML1_Project/DecisionTreeClassifier_gridsearch.sav"

In [None]:
data_raw_bot = pd.read_csv(DATASET_BOT_PATH,sep='\t', header=0,  names=['id', 'label'])
data_raw_bot_complement = pd.read_json(DATASET_BOT_COMPLEMENT_PATH)
df_unpacked_bot = pd.json_normalize(data_raw_bot_complement['user'])
joined_bot_data = pd.merge(data_raw_bot, df_unpacked_bot, on="id")

In [None]:
full_data = joined_bot_data

In [None]:
full_data = full_data[["location", "followers_count", "friends_count", "listed_count", "favourites_count", "geo_enabled", "verified", "statuses_count", "profile_background_tile", "profile_use_background_image", "has_extended_profile", "default_profile"]]

In [None]:
#Take one record to test the app
testing = full_data.head(1)

In [None]:
row_list = full_data.loc[1, :].values.flatten().tolist()

In [None]:
row_list

['Quee', 5, 0, 0, 0, False, False, 270, False, False, True, False]

#Preprocess the raw data

In [None]:
def preproced_raw_data(data):
  """
    Preprocess the raw data before enters the model app.
    It performs
    Categorical Vars OHE, simpleimputer most frequent
    Numerical Vars StandarScaler and Simpleimputer mean

    Args:
        data (DataFrame): Dataframe with a single row to be tested

    Returns:
        data (DataFrame): Dataframe with data preprocesed.
    """
  attributes_number = Pipeline(steps = [
      ('null_replacement', SimpleImputer(strategy = 'mean')),
      ('scaling', StandardScaler())
  ])
  #Pipeline to impute and encode all categorical data in the dataset
  attributes_category = Pipeline(steps = [
      ('null_replacement', SimpleImputer(strategy = 'most_frequent')),
      ('encoding', OneHotEncoder(handle_unknown = 'ignore', sparse_output = False))
  ])

  attributes_preprocess = ColumnTransformer(transformers = [
      ('number', attributes_number, data.select_dtypes(include='number').columns.tolist()),
      ('category', attributes_category, data.select_dtypes(include='category').columns.tolist())
  ])


  preprocessor = ColumnTransformer(transformers=[
      ('number', attributes_number, data.select_dtypes(include='number').columns.tolist()),
      ('category', attributes_category, data.select_dtypes(include='category').columns.tolist())
  ])
  return preprocessor.fit_transform(data)


In [None]:
def get_model_from_repository(URL):
  """
    Get Model file from a github repository
    It performs a request to the URL and download it to a local file with name
    champion_model.sav
    Args:
        URL (String): String to the file saved on the github repository

    Returns:
        model_file_name (String): Name of the model saved.
  """
  response = requests.get(URL)
  if response.status_code == 200:
    with open('champion_model.sav', 'wb') as file:
        file.write(response.content)
    print("Model downloaded successfully.")
    return("champion_model.sav")
  else:
    print("Failed to download the model file Status code:", response.status_code)

#App code to create all the pipeline

In [None]:
def predict_bot_account(location, followers_count, friends_count, listed_count,
         favourites_count, geo_enabled, verified, statuses_count, profile_background_tile,
                  profile_use_background_image, has_extended_profile,
                  default_profile):
    """
    Get Data of every column in a single row provided by the app input
    Load the downloaded model and make a predicion
    Args:
        location        (String) Location of the account
        followers_count (String) Number of followers in the twitter account
        friends_count   (String)  Number of friends in that account
        listed_count    (String)  Number of list in that account
        favourites_count (String) Number of favorites in that account
        geo_enabled     (String)  The geolocalization is enabled in that account
        verified        (String)  The account is verified ?
        statuses_count  (String)  Number of status make it from that account
        profile_background_tile (String)  The account has a profile backgroud?
        profile_use_background_image (String) The account has a image in backgroud?
        has_extended_profile (String)   The account has a extended description?
        default_profile (String)    The account has a default profile image?
    Returns:
    Prediction (Tuple): ("1/0", "bot / not bot")
    """
    local_model_path = get_model_from_repository(BEST_MODEL_PATH)
    loaded_model = joblib.load(local_model_path)
    input_data = [location, followers_count, friends_count, listed_count,
         favourites_count, geo_enabled, verified, statuses_count, profile_background_tile,
                  profile_use_background_image, has_extended_profile,
                  default_profile]
    input_df = pd.DataFrame([input_data], columns=["location",
    "followers_count", "friends_count", "listed_count", "favourites_count",
    "geo_enabled", "verified", "statuses_count", "profile_background_tile",
    "profile_use_background_image", "has_extended_profile", "default_profile"])
    data_ready = preproced_raw_data(input_df)
    prediction = loaded_model.predict(data_ready)

    if prediction == 0:
        return [("0", "Not a bot account")]
    else:
        return [("1", "Upps a Bot account")]

In [None]:
output = gr.HighlightedText(color_map={
    "0": "green",
    "1": "red"
})


interface = gr.Interface(title= "Twitter bot classification",
    fn=predict_bot_account,
    inputs=[
        gr.Dropdown(['Quee', 'San Francisco, CA', 'Ontario, Canada', 'Austria',
       'Somewhere in the Rainbow', 'The Library of Babel', 'Ireland']),
        gr.Number(label="followers number"),
        gr.Number(label="friends number"),
        gr.Number(label="listed count"),
        gr.Number(label="favourites number"),
        gr.Dropdown(["True", "False"], label="geo enabled?"),
        gr.Dropdown(["True", "False"], label="verified account?"),
        gr.Number(label="statuses count"),
        gr.Dropdown(["True", "False"], label="profile background tile?"),
        gr.Dropdown(["True", "False"], label="use profile background image?"),
        gr.Dropdown(["True", "False"], label="has extended profile?"),
        gr.Dropdown(["True", "False"], label="has default profile?")

       ],
    outputs=output,  theme="freddyaboulton/dracula_revamped"
)

interface.launch(share= True,debug=True)

themes/theme_schema@0.3.9.json:   0%|          | 0.00/12.7k [00:00<?, ?B/s]

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://dc784344a0e66582bc.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


Model downloaded successfully.
Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://dc784344a0e66582bc.gradio.live




#DATA TO TEST

In [None]:
['Quee', 5, 0, 0, 0, False, False, 270, False, False, True, False]

['Quee', 5, 0, 0, 0, False, False, 270, False, False, True, False]