In [None]:
import pandas as pd
import os
import utilities.tweet_utils as tweet_utils

## 1. Load State Information Operator Tweets from Twitter Data Dump
In this notebook, we will take a selection of tweets taken from known state operators and released by Twitter Information Operations. The published archives are available [here](https://transparency.twitter.com/en/reports/information-operations.html). Our objective is to clean and process these tweets, removing retweets, so that we can train a model to detect tweet sequences that identify state operators.

The intended use case is that, given any tweet, we will be able to take the N (9 in this case) preceding tweets from that user and identify them as state operators or normal users.

Data dumps should be downloaded to the `raw_downloads/[language]` folder and extracted there. The `merge_csvs_on_columns` function will load all the CSV files in this folder (assuming each file has the same column names) and concatenate them into a dataframe. Empty tweets will be dropped.

In this case, I have downloaded and extracted all of the Russian and Chinese data dumps back to 2019. Where the ZIP folders included files going back several years, I only included 2019-2022.

### 1.1 Basic Functions

Our first order of business is to create a number of helper functions that we can use to load, combine, clean, and assemble data for more thorough analysis and training.

In [None]:
def merge_csvs_on_columns(data_dir, columns):
    """Merges downloaded Tweet CSVs in target folder, dropping NAs

    Args:
        data_dir (string): Name of directory in which CSVs are located
        columns (list of strings): List of columns in target CSVs

    Returns:
        Pandas dataframe: The merged CSVs in the target directory    
    """
    filenames = [name for name in os.listdir(data_dir) 
                 if os.path.splitext(name)[-1]=='.csv']
    
    df = pd.DataFrame(columns=columns)
    
    for fname in filenames:
        # remove any rows with blank Tweets
        tmp_df = pd.read_csv(os.path.join(data_dir,fname)).dropna(
            subset=['tweet_text'])
        tmp_df = tmp_df[columns]
        df = df.append(tmp_df, ignore_index=True)
    
    return df

### 1.2 Process Russian Tweets

In [None]:
dir = '../raw_downloads/russian/'
cols = ['userid','user_profile_description','tweet_text','tweet_time',
        'tweet_language','is_retweet','hashtags','urls']
df = merge_csvs_on_columns(dir, cols)

#### 1.2.1 Clean Text

In [None]:
# remove all non-English Tweets
df = df[df['tweet_language']=='en']

In [None]:
df.head()

In [None]:
df.shape, len(pd.unique(df['userid']))

This yields us quite a few tweets, all from 2019-2021. It should certainly be enough to at least test our capability to train a model.

As a next step, we will remove all retweets, since the text of retweets is likely to come from natural sources (news and so forth). We want to exclude these from the training so that the language model is trained only on the language patterns that come from text the state operators themselves wrote.

In [None]:
crit1 = df["is_retweet"] == False
crit2 = ~df["tweet_text"].str.startswith("RT")

df = df[crit1 & crit2].copy()
df.shape, len(pd.unique(df['userid']))

Next, we'll clean the text using a custom function implemented in the `tweet_utils` folder. This applies a series of regex operations to clean up special characters, @ mentions, newlines, URLs, and so forth. We'll also add a word count.

In [None]:
df["clean_tweets"] = (df["tweet_text"].map(lambda text: tweet_utils.clean_text(text)))
df['word_count'] = df['clean_tweets'].str.count(' ') + 1

Finally, we'll get rid of empty or almost-empty tweets with the following filters.

In [None]:
crit1 = ~df["clean_tweets"].isnull()
crit2 = df["clean_tweets"] != ""
crit3 = df["word_count"] > 3

df = df[crit1 & crit2 & crit3].copy()
df.shape, len(pd.unique(df['userid']))

#### 1.2.2 Create Tweet Sequences
If we want to determine whether a user is a state operator, it is unlikely that one single tweet will provide us with enough data. State operators often post on a variety of topics (including sharing memes and links likely to be popular) in order to disguise their activity and gain followers. In order to effectively assess a user, we'll train our model on multiple sequences of N tweets. For this iteration, we've chosen `n=10`. The custom function below does the following:
 1. Sort tweets by user and then date.
 2. For each tweet, find the 9 previous tweets and concatenate them in backwards-chronological order.
 3. Add sequence of 10 tweets to the dataframe on the row of the most recent tweet in the sequence.

In [None]:
df = tweet_utils.combine_tweets(df, 10)

In [None]:
df.iloc[11444,10]

In [None]:
final_cols = ['userid','tweet_text','tweet_time','clean_tweets','recent_tweets']

df = df[final_cols].copy()

In [None]:
df.head()

In [None]:
df.to_csv('../working_files/russian_tweet_sequences.csv',sep=',', quotechar='"',header=True)

### 1.3 Process Chinese Tweets

In [None]:
dir = '../raw_downloads/chinese/'
cols = ['userid','user_profile_description','tweet_text','tweet_time','tweet_language',
        'is_retweet','hashtags','urls']
df = merge_csvs_on_columns(dir, cols)

In [None]:
df = df[df['tweet_language']=='en']

In [None]:
df.head()

In [None]:
df.shape, len(pd.unique(df['userid']))

The criteria used to filter the Russian tweets above have been added to the `tweet_utils` file, so we will use that here for brevity. The `apply_filters` function will also call the `clean_text` function.

In [None]:
df = tweet_utils.apply_filters(df)
df.shape, len(pd.unique(df['userid']))

We also apply the same combination logic to the Chinese operator tweets.

In [None]:
df = tweet_utils.combine_tweets(df, 10)

In [None]:
df.iloc[11444,10]

In [None]:
final_cols = ['userid','tweet_text','tweet_time','clean_tweets','recent_tweets']

df = df[final_cols].copy()

In [None]:
df.head()

In [None]:
df.to_csv('../working_files/chinese_tweet_sequences.csv',sep=',', quotechar='"',header=True)