# Report: Has twitter become more toxic?
*Žan Stanonik, Damla Cinel, Lorenz Mangold*   
-> TODO spell check the whole thing

In the light of the recent events at the beginning of the seminar Computational Social Science, our team decided to take a look into how the take over of Twitter by Elon Musk impacted the platform (acquisition happened on 27.10.2022).   
We specifically focused on toxicity, since it was most discussed in the media and a lot of people hinted at increase in racism, bullying and other forms of toxicity.

## 1. Theory and data selection
We couldn't just download all the Tweets from a given point in time to the given date, due to the limitations of the Twitter API and the lack of compute and storage resources on our side.    

Keeping these limitations in mind, we decided to only look at certain "subsets" of Twitter data, which we approximated with hashtags/keywords. We collected only the Tweets that fulfilled the following criteria:
1. They contained the following hashtags/keywords: 
  * trump
  * musk 
  * fitness
  * netflix
  * vegan
  * vegeterian
  * uno
2. They were in English
3. They weren't retweets
4. They didn't have links
5. They weren't replies 
6. They weren't quoting other tweets   
   
With these limitations we wanted to removed tweets that we couldn't process properly, tweets that were posted by bots, tweets out of context, and scammer tweets with links to other webpages. Our hope was that with these limitation we would only gather tweets where people truly voice there opinions and we could gather all of them.

We only looked at the time period from 01/06/2022 to 03/01/2023 which resulted in 216 days of data collection, so that we could really gather all the tweets on this data subset since we were limited with the Twitter Academic API (and its 10 million tweets).



## 2. Data gathering 
Working on the limitations and specifics we set in the previous chapter we researched how we can actually gather the Tweets that we are interested in. We tested out various libraries such as [Tweepy](https://www.tweepy.org/), but we found out it was too limiting for us. Hence we decided to design our own function for retriving the tweets, and it resulted in the following main function:

In [None]:
from src.twitter_download_functions import download_tweets

search_query = "( vegan OR #vegan ) lang:en -is:retweet -is:quote -has:links -is:reply"
download_tweets(search_query=search_query, 
                tweets_per_day=18000, 
                start_time="01/06/2022 00:00", 
                end_time="03/01/2023 00:00")

With this kind of set up we specified the query that should be passed to the Twitter Academic in the function parameter `search_query`, while the other parameters were processed on our side locally and added additional information to the query.   

The information they added was mainly the time from which the tweets should be gathered, since our method split the day into `tweets_per_day` // 500 intervals, because the current API documentation said that it is unable to provide more than 500 tweets per request. In this case the day would be split into 36 equal time intervals.
   
Another limitation of the Twitter API is also that no more than 300 request can be sent in a 15 minute interval so the data gathering took a few days for all the tweets to be collected. Our final dataset was then comprised of 7 data files with a combined size of 1.14GB and contained ~5.5 million tweets.


## 3. Text preprocessing 

Now we had all the data that we needed, and we first feed it directly to the Detoxify model, but we realized this was bad practice and decided to preprocess text of each tweet to make the predictions more robust and to discard any weird symbols in the text. 
   
The text was preprocessed in the following order of steps:  
1. First we replaced all the weird symbols with non standard charaters such as emojis, non-english letters etc. with the help of regex.
2. Then we removed multiple spaces and new lines in strings to have continuous sentences.
3. We gave each tweet to the spacy 'en_core_web_sm' model which lemmatized the words into their base forms.  

All of the steps mentioned were packaged in a function used below:

In [None]:
import spacy, re
from src.text_preparation import optimized_prepare_text_for_tweet_file

test_reg = re.compile("["
                      u"\U0001F600-\U0001F64F" 
                      u"\u3030"
                      "]+", re.UNICODE)
optimized_prepare_text_for_tweet_file(replace_symbols_regex=test_reg, 
                                      input_file="vegetarian_hashtag_6_1_2023.csv", 
                                      output_file_name="test_vegetarian_hashtag_6_1_2023.csv", 
                                      nlp_model=spacy.load('en_core_web_sm'))

We executed the following function on all the tweets that we gathered and saved the preprocessed text as an additional column in the csv files which resulted in data increasing to 1.9GB.

## 4. Toxicity metric generation

### 4.1 Perspective API
Intially our goal was to use the [Perspective API](https://perspectiveapi.com/) offered by Google to acquire the toxicity metrics from the tweets we gathered. It turned out that the API is pretty limited in the amount of request per second you can request from it, and even though we received an increase of request per second, we calculated that we would need at least 8 days to get the metrics for all 5.5 million tweets. Therefore we abandoned this approach due to time limitations, and additional work that would be needed to make it possible to send more request per second to the API.

-> TODO Lorenz add something if you think it is needed


### 4.2 Detoxify library
When we abandoned the initial apprach with the Perspective API we began searching for alternatives and we found the [Detoxify python library](https://pypi.org/project/detoxify/) which seemed straight forward to use, and was built on top of models who won multiple Kaggle competitions.

We then designed a function which used the bert base uncased model provided by the library to predict the toxicity metric. When feeding the tweets to the model we made sure to split each tweet into sentences and aquire each sentence toxicity separately, and averaged the toxicity over all the sentences for each tweet to make the predictions more robust.

Due to the large amount of data we were working with we added intermidiate saving of the tweets and the metrics during the execution to reduce the memory load, and we also loaded the model to the GPU with the help of CUDA which speed up the execution by a factor of 5x. An example of the developed function can be seen below:

In [None]:
from detoxify import Detoxify
from src.toxicity_metric_generation_functions import upgraded_generate_toxicity_for_tweet_file

upgraded_generate_toxicity_for_tweet_file(model=Detoxify("original", device="cuda"), 
                                          input_file="vegetarian_hashtag_6_1_2023_lemmatized.csv", 
                                          output_file="vegetarian_hashtag_6_1_2023_lemm_test.csv")

This was ran for all the tweets in our data set, one time for the raw texts and one time for the preprocessed lemmatized text. We did this to compare the results and to see what kind of effect did the preprocessing of text had on the general change in toxicity metrics. 
With the added metrics we were now up to 2.59GB of data for both texts.

## 5. Analysis 
-> TODO start from here add results charts etc.

## 6. Conclusions
