## hate speech classification demo

In this notebook we will be testing out our model with some random sample tweets. You'll need to [download our model](https://drive.google.com/drive/folders/1UE1MiiNbXWgalJ1UIDW7mTLsSH15UIfX?usp=sharing) and place it in the current directory.

## installation

In [11]:
%pip install --upgrade git+https://github.com/twintproject/twint.git@origin/master#egg=twint
%pip install torch
%pip install transformers

Collecting twint
  Cloning https://github.com/twintproject/twint.git (to revision origin/master) to /private/var/folders/j_/ygb_gxx970b1bfpk3v12gc980000gp/T/pip-install-9tpx93kl/twint_2176c824128d4ca9ac78bc92fca5855d
You should consider upgrading via the '/usr/local/Cellar/jupyterlab/3.0.5/libexec/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.
You should consider upgrading via the '/usr/local/Cellar/jupyterlab/3.0.5/libexec/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.
You should consider upgrading via the '/usr/local/Cellar/jupyterlab/3.0.5/libexec/bin/python3 -m pip install --upgrade pip' command.[0m


Note: you may need to restart the kernel to use updated packages.
Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
Using legacy 'setup.py install' for bs4, since package 'wheel' is not installed.
Installing collected packages: bs4
    Running setup.py install for bs4 ... [?25ldone
[?25hSuccessfully installed bs4-0.0.1
You should consider upgrading via the '/usr/local/Cellar/jupyterlab/3.0.5/libexec/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


## imports

In [17]:
import torch
import transformers
import twint
from transformers import DistilBertForSequenceClassification, DistilBertTokenizerFast, AdamW, DistilBertConfig
import re
import nest_asyncio
nest_asyncio.apply()

Let's load our model and tokenizer.

In [4]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

In [48]:
model = torch.load('model.bin', map_location=device)

In [49]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

Now, let's fetch some tweets and clean them up.

In [50]:
c = twint.Config()

c.Username = "noneprivacy"
c.Custom["tweet"] = ["id"]
c.Custom["user"] = ["bio"]
c.Limit = 10
c.Pandas = True

tweets = twint.run.Search(c)

1384929804017885187 2021-04-21 12:59:30 -0500 <noneprivacy> @FiloSottile It's a privacy violation, not just a "bug". Still, Whatsapp says that's a feature.
1381958323986124801 2021-04-13 08:11:54 -0500 <noneprivacy> The FB, Linkedin and Clubhouse "leaks" are not actual leaks  Then, if we want to discuss if being able to get that amount of data is wrong or not, let's chat but it's a totally different topic
1380461578257104896 2021-04-09 05:04:22 -0500 <noneprivacy> @Cloudflare updated their infrastructure IPs list, make the following changes to your iptables rules  Remove:  104.16.0.0/12   Add:  104.16.0.0/13 104.24.0.0/14
1380109111581364225 2021-04-08 05:43:48 -0500 <noneprivacy> @Zewensec @Ginger__T Well, keeping the analogy, I hope that you put the X on your map yourself In my personal vision, it's like your goal
1380106045251588098 2021-04-08 05:31:36 -0500 <noneprivacy> @Ginger__T Aaah damn, felt something was missing and that's it!
1380091641898360832 2021-04-08 04:34:22 -0500 <n

Let's convert the tweets we fetched into a Pandas dataframe

In [51]:
tweets = twint.storage.panda.Tweets_df
tweets.head()

Unnamed: 0,id,conversation_id,created_at,date,timezone,place,tweet,language,hashtags,cashtags,...,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest
0,1384929804017885187,1384925456156200964,1619028000000.0,2021-04-21 12:59:30,-500,,"@FiloSottile It's a privacy violation, not jus...",en,[],[],...,,,,,,"[{'screen_name': 'FiloSottile', 'name': 'Filip...",,,,
1,1381958323986124801,1381958323986124801,1618320000000.0,2021-04-13 08:11:54,-500,,"The FB, Linkedin and Clubhouse ""leaks"" are not...",en,[],[],...,,,,,,[],,,,
2,1380461578257104896,1380461578257104896,1617963000000.0,2021-04-09 05:04:22,-500,,@Cloudflare updated their infrastructure IPs l...,en,[],[],...,,,,,,[],,,,
3,1380109111581364225,1380091641898360832,1617879000000.0,2021-04-08 05:43:48,-500,,"@Zewensec @Ginger__T Well, keeping the analogy...",en,[],[],...,,,,,,"[{'screen_name': 'Zewensec', 'name': 'zewen', ...",,,,
4,1380106045251588098,1380091641898360832,1617878000000.0,2021-04-08 05:31:36,-500,,"@Ginger__T Aaah damn, felt something was missi...",en,[],[],...,,,,,,"[{'screen_name': 'Ginger__T', 'name': 'Ginger ...",,,,


Here is a cleaning function that removes usernames, urls and hashtags:

In [52]:
def clean(tweet):
    '''
    Strips tweet of usernames and urls
    '''
    # Remove username
    tweet = re.sub(r'@[A-Za-z0-9]+','',tweet)
    # Remove urls
    tweet = re.sub('https?://[A-Za-z0-9./]+','',tweet)
    # Remove hashtags
    tweet = re.sub("[^a-zA-Z]", " ", tweet)
    return tweet

Clean the first tweet, pass it though the tokenizer, and give it to the model

In [53]:
test_tweet = clean(tweets.loc[0]['tweet'])
test_tweet

' It s a privacy violation  not just a  bug   Still  Whatsapp says that s a feature '

In [54]:
test_token = tokenizer(test_tweet, truncation=True, padding=True, return_tensors='pt')
test_token

{'input_ids': tensor([[  101,  2009,  1055,  1037,  9394, 11371,  2025,  2074,  1037, 11829,
          2145,  2054,  3736,  9397,  2758,  2008,  1055,  1037,  3444,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [61]:
result = model(**test_token)['logits']
print("Got: {0}".format(result))
print("Prediction was: {0}torch.argmax(result, dim=1)

Got: tensor([[ 2.1648, -3.0080]], grad_fn=<AddmmBackward>)


tensor([0])