# Identify (Users and/or Tweets) to Filter

**Authors:** Simon Todd & Chloe Willis  
**Date:** July 2023

---

The purpose of this notebook is to identify Tweets that need to be removed from the dataset on the basis of (1) the user who posted the Tweet or (2) the content of the Tweet. These Tweets are not removed from the dataset in this notebook; rather, they are labeled here according to the level at which they would be filtered out, to facilitate the pairing of Tweets in the study and reference corpora by timebin and filter level in `align_corpora.py`. The actual filtering occurs in the calculation of keyness scores, in `score_keyness.py`.

We have included this as a notebook rather than a standalone .py file because the criteria for filtering are dependent upon your purpose, and are often defined interactively through data exploration. You may use separate criteria to label your Tweets for level of filtering; labels will be carried across for pairing in `align_corpora.py`, and can be flexibly provided when calculating keyness scores in `score_keyness.py`.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import csv
import glob
from collections import Counter
from count_tweet_words import extract_words

## Filtering on the basis of users

To facilitate filtering on the basis of users, we will keep a set of users that we need to filter out. We'll update this set as we build our filtering criteria.

In [None]:
users_to_filter = set()

### Filtering out bots using tweetbotornot2

In our analysis, we used [tweetbotornot2](https://github.com/mkearney/tweetbotornot2) to evaluate the probability that each unique user in our dataset was a bot, based on characteristics of their profile and most recent 200 Tweets. This section demonstrates how the output of tweetbotornot2 can be used to filter out likely bots.

First, we load the tweetbotornot2 output. We don't need the user ID column, since our harvested Tweets store usernames directly.

In [None]:
# Path to the tweetbotornot2 output file
tweetbotornot_output_path = "path/to/your/file.csv"
tweetbotornot_output = pd.read_csv(tweetbotornot_output_path, usecols=["screen_name", "prob_bot"])

We need to identify a cutoff point for deciding that a user is a bot. To facilitate this, we can plot a histogram of bot probabilities:

In [None]:
tweetbotornot_output.hist(column="prob_bot", bins=100)

In our data, the vast majority of users are assessed by tweetbotornot2 to have a probability of 0.1 or lower of being a bot, while there is a pocket of users concentrated around probability 0.99 of being a bot. We decided to label for filtering all users with a bot probability of 0.9 or greater.

In [None]:
prob_threshold = 0.9 # Can be changed
users_to_filter.update(tweetbotornot_output.loc[tweetbotornot_output["prob_bot"] >= prob_threshold]["screen_name"])

### Filtering out templatic users on the basis of type:token ratio

If bot-detection is not available (e.g. because of changes in the Twitter API), or is not extensive enough, users can be further filtered on the basis of the type:token ratio (TTR) shown in Tweets on their timeline. A low TTR indicates less lexical variation, meaning that the user posts many Tweets that are templatic. This may be because the user is actually a bot, because they frequently use automation to generate Tweets (e.g. using Instagram or Etsy to auto-generate Tweets), and/or because they tend to post spam. We aim to exclude users whose TTR is too low.

For our analysis, we harvested the 200 most recent Tweets from each user's timeline using the `rtweet` library in `R`, which is invoked in [tweetbotornot2](https://github.com/mkearney/tweetbotornot2) (see the `tweetbotornot.R` script). We note that timelines can also be harvested through [Twarc2 in Python](https://twarc-project.readthedocs.io/en/latest/twarc2_en_us/#timelines).

The function below calculates TTR for all unique users whose timelines have been harvested, and saves the results to a CSV file. In this calculation, it:  
- normalizes words by removing punctuation and lowercasing (except for the @ and # characters, which designate mentions and hashtags, respectively);  
- treats emoji as regular words, and separates emoji that are written as a sequence with no spaces;  
- treats all media as tokens of a single type (`"MEDIA"`), and also calculates the number of media in the user's Tweets;  
- treats all (non-media) links as tokens of a single type (`"LINK"`), and also calculates the number of links in the user's Tweets;  
- also calculates a separate TTR over hashtags only, to see if the user repeatedly posts the same hashtags. 

The function assumes that the timelines have been saved in CSV format. Users may be batched across different files, but all of the Tweets for a single user are contained within consecutive rows of a single CSV file.

In [None]:
def type_token(input_csv_filepaths, output_csv_filepath,
              username_col="screen_name", text_col="text",
              hashtag_col="hashtags", media_col="media_expanded_url",
              links_col="urls_expanded_url"):
    """Writes a CSV file containing type:token ratio information for each 
    user in CSV files containing recent Tweets from users' timelines.
    
    Arguments
    ---------
    input_csv_filepaths: list(str); filepaths where your timeline data is stored 
                         (see use of glob.glob below)
    output_csv_filepath: str; desired file name & filepath for output CSV
    username_col: str (default: "screen_name"); name of the column in input CSVs
                  where usernames are stored
    text_col: str (default: "text"); name of the column in input CSVs where
              where Tweet text is stored
    hashtag_col: str (default: "hashtags"); name of the column in input CSVs
                 where hashtags are stored
    media_col: str (default: "media_expanded_url"); name of the column in input
               CSVs where media URLs are stored
    links_col: str (default: "urls_expanded_url"); name of the column in input
               CSVs where URLs of links in the Tweet are stored
    """
    
    with open(output_csv_filepath, "w", encoding="utf-8", newline="") as out_file:
        writer = csv.writer(out_file)
        writer.writerow(["USERNAME", "TYPE_COUNT", "TOKEN_COUNT", "RATIO", "TOTAL_TWEETS", 
                         "HASHTAG_TYPES", "HASHTAG_TOKENS", "HASHTAG_RATIO", 
                         "MEDIA_COUNT", "LINK_COUNT"])
    
        # Iterate through timeline files
        for input_csv_filepath in input_csv_filepaths:
            with open(input_csv_filepath, encoding="utf-8", newline="") as in_file:
                reader = csv.DictReader(in_file)

                # Initialize information trackers
                username = None                
                (type_counter, total_tweets, hashtag_counter, media_counter, link_counter) = \
                    initialize_trackers()
                
                # Iterate through Tweets in file
                for row in reader: 
                    
                    # Check if this Tweet is from the same user as the previous one
                    previous_username = username
                    username = row[username_col]
                    if previous_username is not None and previous_username != username:
                            
                        # When moving to a new user, save the results for the previous user
                        user_results = calculate_user_results(previous_username, type_counter, total_tweets, 
                                                              hashtag_counter, media_counter, link_counter)
                        writer.writerow(user_results)

                        # Reset information trackers for the new user
                        (type_counter, total_tweets, hashtag_counter, media_counter, link_counter) = \
                            initialize_trackers()

                    # Count types and tokens in the current Tweet
                    (type_counter, total_tweets, hashtag_counter, media_counter, link_counter) = \
                        update_trackers(row, type_counter, total_tweets, hashtag_counter, media_counter, link_counter)

                # Save the results for the final user
                user_results = calculate_user_results(username, type_counter, total_tweets, 
                                                      hashtag_counter, media_counter, link_counter)
                writer.writerow(user_results)

def initialize_trackers():
    """Returns empty trackers to count types, hashtags, media, links, and tweets"""
    type_counter = Counter()
    hashtag_counter = Counter()
    media_counter = Counter()
    link_counter = Counter()
    total_tweets = 0
    return (type_counter, total_tweets, hashtag_counter, media_counter, link_counter)
                
def update_trackers(row, type_counter, total_tweets, hashtag_counter, media_counter, link_counter):
    """Updates the trackers to incorporate a Tweet.
    Note: media and links are all conflated to tokens of a single type.
    
    Arguments
    ---------
    row: dict; a row of the input CSV, representing a Tweet
    type_counter: Counter; a counter over the types used by this user
    total_tweets: int; the number of Tweets analyzed for this user
    hashtag_counter: Counter; a counter over the hashtag types used by this user
    media_counter: Counter; a counter over the media used by this user
    link_counter: Counter; a counter over the links used by this user
    """
    tweet = row["text"]
    words = extract_words(tweet)
    type_counter.update(words)

    hashtags = row["hashtags"]
    tags = extract_words(hashtags)
    hashtag_counter.update(tags)

    media = row["media_expanded_url"].split()
    media_conflated = ["MEDIA"] * len(media)
    media_counter.update(media_conflated)

    links = row["urls_expanded_url"].split()
    links_conflated = ["LINK"] * len(links)
    link_counter.update(links_conflated)
    
    total_tweets += 1
    
    return (type_counter, total_tweets, hashtag_counter, media_counter, link_counter)


def calculate_user_results(username, type_counter, total_tweets, hashtag_counter, 
                           media_counter, link_counter):
    """Calculates the results for a user, based on completed counters"""
    total_media = sum(media_counter.values())
    total_links = sum(link_counter.values())
    type_count = len(type_counter) + len(media_counter) + len(link_counter)
    token_count = sum(type_counter.values()) + total_media + total_links
    ratio = type_count / token_count

    hashtag_type = len(hashtag_counter)
    hashtag_token = sum(hashtag_counter.values())
    if hashtag_token > 0:
        hashtag_ratio = hashtag_type / hashtag_token
    else:
        hashtag_ratio = ""

    return [username, type_count, token_count, ratio, total_tweets,
            hashtag_type, hashtag_token, hashtag_ratio, total_media, total_links]

The code cell below demonstrates how to use this function to generate user TTR statistics:

In [None]:
# Use glob.glob to get a list of all of the input CSV files,
# based on a template for their path/filename
timeline_data = glob.glob("timeline_chunks/timelines_*.csv")  

# Calculate type:token ratios and save them as "user_ttr_statistics.csv"
type_token(timeline_data, "user_ttr_statistics.csv")

It is useful to get a summary of the data, to inform filtering decisions.

In [None]:
# Load the output as a pandas dataframe
timeline_statistics = pd.read_csv("user_ttr_statistics.csv", encoding = "utf-8")

# Inspect your results
timeline_statistics.describe()

One way to filter users is based on their overall productivity, as indicated by the number of Tweets on their timeline. Since the timeline contains all (publicly-available) Tweets posted by the user over the history of their account, users who do not have many Tweets on their timeline are not productive posters and thus cannot be taken to be reflective of broader community norms.

In our data, 25% of users had fewer than 196 Tweets on their timeline; in order to focus on users who are maximally productive, we chose to exclude these users.

In [None]:
timeline_post_threshold = 196 # Can be changed
users_to_filter.update(timeline_statistics.loc[timeline_statistics["TOTAL_TWEETS"] < post_threshold]["USERNAME"])

Another way to filter users is based on their TTR, as discussed above. For this, it is useful to visualize the distribution of TTR values across the users, to identify a TTR threshold that seems like it would not cut off regular users.

In [None]:
# A histogram showing the distribution of TTR across users
timeline_statistics.hist(column="RATIO", bins=100)

In our data, the majority of users have TTR values greater than 0.2. We decided to exclude users with a TTR of less than 0.2.

In [None]:
ttr_threshold = 0.2 # Can be changed
users_to_filter.update(timeline_statistics.loc[timeline_statistics["RATIO"] < ttr_threshold]["USERNAME"])

## Filtering on the basis of Tweets

To facilitate filtering on the basis of Tweets, we will keep a set of Tweet IDs that we need to filter out. We'll update this set as we build our filtering criteria.

In [None]:
tweets_to_filter = set()

### Filtering out Tweets with NSFW images

In our analysis, we used a [NSFW image detector](https://github.com/GantMan/nsfw_model) to evaluate the probability that a Tweet contains NSFW images. To use this on your own data, you will need to install the image detector (v1.2) and then use our script `classify_images.py` on the JSONL file(s) you harvested from Twitter. This will yield a CSV file with the following columns:  
- `tweet.id`: the unique ID number of the Tweet  
- `image`: the URL of the image examined by the detector (only the first image in each Tweet is checked)  
- `p_drawing`: the probability that the image is a neutral (SFW) drawing  
- `p_hentai`: the probability that the image is a pornographic or sexually explicit (NSFW) drawing  
- `p_neutral`: the probability that the image is a neutral (SFW) photograph  
- `p_porn`: the probability that the image is a pornographic (NSFW) photograph  
- `p_sexy`: the probability that the image is a sexually explicit (NSFW) photograph (but not necessarily pornographic in nature)  

These probabilities form a distribution: for any image, the probabilities all add up to 1.

Based on inspection of our data, we decided in our analysis to exclude any Tweets where either `p_hentai`, `p_porn`, or `p_sexy` was 0.6 or higher. The following code cell demonstrates how the data can be loaded into the exclusion set:

In [None]:
with open("image_probs.csv", encoding = "utf-8") as in_file:
    reader = csv.DictReader(in_file)
    
    for row in reader:
        p_hentai = float(row["p_hentai"])
        p_porn = float(row["p_porn"])
        p_sexy = float(row["p_sexy"])
        tweet_id = row["tweet.id"]
    
        # The criteria below can be changed
        if p_hentai >= 0.6 or p_porn >= 0.6 or p_sexy >= 0.6:
            tweets_to_filter.add(tweet_id)

### Filtering out Tweets on the basis of stopwords

The script `count_tweet_words.py` can be used to get counts of all the unique words in a CSV file of Tweets. Manual inspection of the wordlist can reveal common terms that seem to indicate material that should be filtered out (especially if using the `--keep-links` option, to keep links in the Tweet text). These terms, or *stopwords*, can be saved to a .txt file (with one term per line), and the following code cell can be used to identify Tweets containing any of these terms and mark them for filtering.

In [None]:
# Enter the path to the stopword file:
stopwords_path = "stopwords.txt"

# Enter the paths to the Tweet CSVs, which contain columns for tweet.id and tweet.text
tweet_csvs = ["study.csv", "reference.csv"]

# Read the stopwords into a set
stopwords = set()
with open(stopwords_path, encoding="utf-8") as in_file:
    for line in in_file:
        stopwords.add(line.strip())

# Find stopwords in Tweets and use them to catch Tweet IDs
for tweet_filepath in tweet_csvs:
    with open(tweet_filepath, encoding="utf-8") as in_file:
        reader = csv.DictReader(tweet_filepath)
        
        for row in reader:
            tweet_id = row["tweet.id"]
            tweet_terms = extract_words(row["tweet.text"], remove_links=False)
            if stopwords.intersection(tweet_terms):
                tweets_to_filter.add(tweet_id)

In our analysis, manual inspection of the data revealed that a large proportion of Tweets containing the query terms #bi, #bisexuals, #bisexuality, and #bisexualpride contained sexually explicit content, and Tweets containing the query term #bi were also drawn from two irrelevant domains (Business Intelligence, and the Korean rapper B.I). We split our study corpus into NSFW Tweets, which contained any of these query terms, and SFW Tweets, which did not. We used `count_tweet_words.py` (with the `--keep-links` option) to count how many times each word occurred in each subset of the corpus, from which we calculated a NSFW:SFW ratio for that term. We marked as stopwords all words that occurred at least 100 times in the NSFW subset and had a NSFW:SFW ratio of at least 224 (which was the ratio for #bi, the query term with the highest NSFW:SFW ratio).

## Labeling Tweets for filtering

Now we have a set of usernames and Tweet IDs to use for filtering. The final step is to use these sets to assign a label to each Tweet, indicating the level at which it should be filtered out. We use the following labels:  
- `user-excluded`: Tweets that are filtered out because they were posted by users who are in `users_to_filter`  
- `tweet-excluded`: Tweets that survive user filtering, but are filtered out because they have an ID in `tweets_to_filter`  
- `included`: Tweets that survive user and Tweet filtering  

In our analysis, we designate Tweets with the *included* label as the *Tweet-filtered* data (i.e., the data after filtering to the Tweet level), and Tweets with the *included* or *tweet-excluded* labels as the *user-filtered* data (i.e., the data after filtering to the user level).

We first define a function that takes a row of a Tweet CSV as input and returns a label for it. The ordering is important here: the highest-level filtering should be applied first.

In [None]:
def label_tweet(row, users_to_filter, tweets_to_filter):
    """Returns a label (str) based on the stage at which a Tweet
    should be filtered out of the data.
    
    Arguments
    ---------
    row: dict; a row of the Tweet CSV file, mapping from column
         headings to column values
    users_to_filter: set(str); a set of usernames to filter out
    tweets_to_filter: set(str); a set of Tweet IDs to filter out
    """
    if row["user.username"] in users_to_filter:
        return "user-excluded"
    elif row["tweet.id"] in tweets_to_filter:
        return "tweet-excluded"
    else:
        return "included"

The following code cell labels each Tweet in the Tweet CSVs, saving the results in new CSV files with the `_labeled` suffix.

In [None]:
# Enter the paths to the Tweet CSVs
tweet_csvs = ["study.csv", "reference.csv"]

# Read each CSV file and write a labeled version
for tweet_filepath in tweet_csvs:
    with open(tweet_filepath, encoding="utf-8") as in_file:
        reader = csv.DictReader(tweet_filepath)
        
        with open(tweet_filepath.replace(".csv", "_labeled.csv"), "w", encoding="utf-8", newline="") as out_file:
            fields = reader.fieldnames + ["label"]
            writer = csv.DictWriter(out_file, fields)
            writer.writeheader()
        
            for row in reader:
                row["label"] = label_tweet(row, users_to_filter, tweets_to_filter)
                writer.writerow(row)

The labeled Tweets can be paired across corpora using `align_corpora.py`, as in the following:

```
python align_corpora.py study_labeled.csv reference_labeled.csv paired.csv --label-hierarchy included tweet-excluded user-excluded
```

Notice that the labels are provided in hierarchical order from lowest- to highest-level (separated by spaces), using the `--label-hierarchy` argument. This will create a file `paired.csv` that pairs each study Tweet with a reference Tweet.