# Toxicity metrics data generation
In this notebook I generate toxicity metrics with the Detoxify library, which is used to measure toxicity of texts, in our case tweets.   
   
This is meant as a suplementary approach to the Perpective API since we were limited by the number of queries when using it.

Make sure to install CUDA achieves at least 5x speed up.

In [None]:
%reload_ext autoreload
%autoreload 2

import os 
import sys
import pandas as pd
import numpy as np
import plotly  
import plotly.graph_objects as go
import time
import random
# progress monitoring
from tqdm import tqdm

from sentence_splitter import split_text_into_sentences
# for dealing with messed up data
from unidecode import unidecode

import torch
import nltk
from detoxify import Detoxify
# nltk.download('stopwords')

try:
    print(run_only_once)
except Exception as e:
    print(os.getcwd())
    os.chdir("./../../")
    print(os.getcwd())
    run_only_once = "Dir has already been changed"

In [None]:
# clear memory to reduce memory errors
torch.cuda.empty_cache()
print(torch.cuda.memory_summary(device=None, abbreviated=False))

# test if cuda is available, it has to be otherwise slow asf, 33 hours for 1.2 million tweets 
device = torch.device("cuda")
cuda_present = torch.cuda.is_available()
print(f"Cuda present: {cuda_present}")

# load the model
model = Detoxify('original', device="cuda")

In [None]:
# for single predictions testing
# model.predict("Love on the Spectrum is the cutest show on Netflix rn ðŸ¥¹ðŸ’“")

## Test how the text is split into sentences

In [None]:
# test for sentence splitting
sentences = split_text_into_sentences(
    text='This is a paragraph. It contains several sentences. "But why," you ask?',
    language='en'
)

for sent in sentences:
    print(sent)

## Generating toxicity scores for each tweet
The code below needed 19092 seconds (5.3 hours) to run the last time, with CUDA on ~1 million tweets.

This function is the initial implementation of toxicity generation, which was used when we didn't do any preprocessing on the text.
The code of this function is kept for help, not for actually using it!  

```from src.toxicity_metric_generation_functions import upgraded_generate_toxicity_for_tweet_file```   


Here is the new implementation of the function above which split the text into sentences and computes the averages over them, and it also generally more robust

In [None]:
from src.toxicity_metric_generation_functions import upgraded_generate_toxicity_for_tweet_file

Here we actually run our toxify method.

In [None]:
hashtag_files_lemmatized = ["vegetarian_hashtag_6_1_2023_lemmatized.csv", "netflix_hashtag_08_01_2023_lemmatized.csv", 
                            "uno_hashtag_09_01_2023_lemmatized.csv", "vegan_hashtag_6_1_2023_lemmatized.csv", 
                            "fitness_hashtag_08_01_2023_lemmatized.csv", "musk_hashtag_03_01_2023_lemmatized.csv", 
                            "trump_hashtag_13_01_2023_lemmatized.csv"]
# for trump check that the right file was added, since data was collected later "trump_hashtag_13_01_2023.csv"

# to not override files by mistake
hash_int = random.randrange(1000)
for file_name in hashtag_files_lemmatized:
    replaced_str = file_name.replace('.csv', '').replace('_lemmatized', '')
    output_file = f"./data/detoxify_toxicity_added_hashtags/lemmatized_{replaced_str}_detoxify_toxicity_{hash_int}.csv"
    print(f"Saving to {output_file}")
    
    upgraded_generate_toxicity_for_tweet_file(model=model, input_file=file_name, output_file=output_file)

## Unicode
We used **unidecode** function to automatically converts a string to be more asci compliant. Problem occured at netflix on line ~595628