# Notebook Overview

## Goal
* The goal of the notebook is to apply the preprocessing steps we developped in the paper for the SBERT configuration. 
## Configuration
* The concatenation of tweets resulted in the creation of documents of more than 512 tokens ([the maximum context length of SBERT](https://www.sbert.net/examples/applications/computing-embeddings/README.html)). Therefore, in this notebook, we splitted each document into chunks of 300 words. Each chunk was converted into a dense vector using [SBERT](https://www.sbert.net/).

# Imports

In [None]:
!pip install transformers

import pandas as pd
import numpy as np
import seaborn as sns

import torch
import torch.nn as nn

import transformers
from transformers import ElectraTokenizer, ElectraModel, ElectraConfig

# Data loading

In [None]:
# load the dataset
df = pd.read_csv("data/100_tweets.csv")
print(df.shape)
df.head()

(578, 4)


Unnamed: 0.1,Unnamed: 0,authors,Tweets,Labels
0,0,10_GOP,Here is video of woman approaching the man say...,RightTroll
1,1,4MYSQUAD,'@filsdetafa @ShaunKing yeah we now how it wor...,LeftTroll
2,2,AANTIRACIST,"Marry Justin, Fuck Drake & Kill Theo https://...",LeftTroll
3,3,ABIGAILSSILK,The only time I've seen guys excited about blu...,HashtagGamer
4,4,ABIISSROSB,"Yemen 94', Somali 82', Angolia 75-2002, Ethiop...",RightTroll


## Tweets length analysis
We can see that the concatenation of tweets per author resulted in an average tweets length of 1308 words. This configuration is not optimal at all for SBERT as it cannot deal with sequences of more than 512 tokens. Thus, to keep all the information, we split each tweets into chuncks of 300 words.

In [None]:
# check the number of words per authors
df["tweet_len"] = df.Tweets.apply(lambda x: len(x.split()))
print(df.tweet_len.mean())
print(df.tweet_len.median())
print(df.tweet_len.max())
print(df.tweet_len.min())

1307.8892733564014
1325.5
2674
378


# Preprocessing

In [None]:
# split the tweets into chunks of 300 words. 
def split_tweets(tweet: str, n: int) => list:
  """
  Separate the tweets from a same author into different paragraphs of n words.

  Args:
  -------
  tweet: the tweet you want to split.
  n: the number of words per paragraph.

  Returns:
  -------
  tweets: list of the paragraphs generated.

  """
  tweet = tweet.split()
  tweets = [" ".join(tweet[i:i+n]) for i in range(0, len(tweet), n)]
  
  return tweets

## Chunks analysis
* After splitting each tweets into chunks of 300 words, we need to treat them as specific samples. Indeed, we will cluster each of those samples independently and then use a majority vote to assign each author to a specific cluster.

In [None]:
# apply split_tweets to all the rows
df.Tweets = df.Tweets.apply(lambda x: split_tweets(x,300))

In [None]:
# create a row for each chunck (part of a tweet)
df = df.explode("Tweets",ignore_index=True)

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,authors,Tweets,Labels,tweet_len
0,0,10_GOP,Here is video of woman approaching the man say...,RightTroll,6
1,0,10_GOP,"brilliant, quick-witted, and knows how to dest...",RightTroll,6
2,0,10_GOP,speechwriters President Trump tweeted this bac...,RightTroll,6
3,0,10_GOP,was literally standing when everyone was layin...,RightTroll,6
4,0,10_GOP,"book signing, asks her about Seth Rich, emails...",RightTroll,6


In [None]:
# save the new dataset
print(df.shape)
df.to_csv("data/100_tweets_explode.csv", index=False)

(2807, 5)
