# Tweet Turing Test: Detecting Disinformation on Twitter  

|          | Group #2 - Disinformation Detectors                     |
|---------:|---------------------------------------------------------|
| Members  | John Johnson, Katy Matulay, Justin Minnion, Jared Rubin |
| Notebook | `02_preprocess.ipynb`                                   |
| Purpose  | Apply a pre-processing pipeline to merged data.         |

> (TODO - write more explaining notebook)

# 1. Setup

In [5]:
# imports from Python standard library
import json
import logging
import os

# imports requiring installation
#   connection to Google Cloud Storage
from google.cloud import storage            # pip install google-cloud-storage
from google.oauth2 import service_account   # pip install google-auth

#  data science packages
import demoji                               # pip install demoji
import numpy as np                          # pip install numpy
import pandas as pd                         # pip install pandas

In [6]:
# imports from tweet_turing.py
from tweet_turing import get_json_files, load_local_json, get_gcp_storage_client, get_gcp_bucket, \
    list_gcp_objects, get_gcp_object_as_json, get_gcp_object_as_text, \
    set_gcp_object_from_json

# imports from tweet_turing_paths.py
from tweet_turing_paths import local_data_paths, local_snapshot_paths, gcp_data_paths, \
    gcp_snapshot_paths, gcp_project_name, gcp_bucket_name, gcp_key_file

## Local or Cloud?

Decide here whether to run notebook with local data or GCP bucket data
 - if the working directory of this notebook has a "../data/" folder with data loaded (e.g. working on local computer or have data files loaded to a cloud VM) then use the "local files" option and comment out the "gcp bucket files" option
 - if this notebook is being run from a GCP VM (preferrably in the `us-central1` location) then use the "gcp bucket files" option and comment out the "local files" option

In [7]:
# option: local files
local_or_cloud: str = "local"   # comment/uncomment this line or next

# option: gcp bucket files
#local_or_cloud: str = "cloud"   # comment/uncomment this line or previous

# don't comment/uncomment for remainder of cell
if (local_or_cloud == "local"):
    data_paths = local_data_paths
    snapshot_paths = local_snapshot_paths
elif (local_or_cloud == "cloud"):
    data_paths = gcp_data_paths
    snapshot_paths = gcp_snapshot_paths
else:
    raise ValueError("Variable 'local_or_cloud' can only take on one of two values, 'local' or 'cloud'.")
    # subsequent cells will not do this final "else" check

In [8]:
# this cell only needs to run its code if local_or_cloud=="cloud"
#   (though it is harmless if run when local_or_cloud=="local")
gcp_storage_client: storage.Client = None
gcp_bucket: storage.Bucket = None

if (local_or_cloud == "cloud"):
    gcp_storage_client = get_gcp_storage_client(project_name=gcp_project_name, key_file=gcp_key_file)
    gcp_bucket = get_gcp_bucket(storage_client=gcp_storage_client, bucket_name=gcp_bucket_name)

# 2. Troll Tweets (CSV) Pre-processing

## 2.1 Load CSV Snapshot (from prior merge step)

In [9]:
# load the merged troll tweet CSV snapshot file
csv_filename: str = "csv_snapshot.csv"
csv_path: str = f"{snapshot_paths['csv_snapshot']}{csv_filename}"
troll_df_raw: pd.DataFrame = pd.DataFrame()

if (local_or_cloud == "local"):
    troll_df_raw = pd.read_csv(csv_path, encoding='utf-8', low_memory=False)
elif (local_or_cloud == "cloud"):
    pass

## 2.2 Filter for *"Only English language tweets"*

In [10]:
# filter for english language tweets only
#   - relevant dataframe column is `language`
mask_lang_en: pd.Series = (troll_df_raw['language'] == 'English')

troll_df = troll_df_raw[mask_lang_en]

## 2.3 Extract columns of interest

In [11]:
# extract only the columns we will use for later steps
cols_to_keep = [
    'external_author_id',
    'author',
    'content',
    'region',
    'language',
    'publish_date',
    'following',
    'followers',
    'updates',
    'retweet',
    'account_category',
    'tweet_id',
    'tco1_step1'
    ]

troll_df = troll_df[cols_to_keep]

print("Troll Dataframe Shape (rows, cols):", troll_df.shape)

Troll Dataframe Shape (rows, cols): (2116867, 13)


## 2.4 Derive new feature: `data_source`

This feature is setup as a constant value __"Troll"__ for this subset of the dataset to indicate that the data originates from the troll tweets CSV snapshot file. The tweets obtained from Twitter API (in JSON files) have the same feature added by the `01_merge` notebook, but their values are either __"verified_user"__ or __"verified_random"__.

In [12]:
troll_df['data_source'] = 'Troll'

## 2.5 Align column names

In [13]:
# setup rename mapping
#   key = old column name; value = new column name
col_name_mapping = {
    "retweet": "is_retweet",
    "tcol1_step1": "full_url",
}

troll_df.rename(columns=col_name_mapping, inplace=True)

# 3. Authentic Tweets (JSON) Pre-processing

## 3.1 Load JSON Snapshot (from prior merge step)

In [14]:
# load the merged troll tweet CSV snapshot file
json_filename: str = "json_snapshot.json"
json_path: str = f"{snapshot_paths['json_snapshot']}{json_filename}"

json_data: list = []

if (local_or_cloud == "local"):
    json_data = load_local_json(json_path)
elif (local_or_cloud == "cloud"):
    json_data = load_gcp_json(gcp_bucket, json_path)

## 3.2 Transform from JSON to tabular form

Apply the pandas function `json_normalize()` to flatten JSON dict

In [16]:
# convert json to pandas dataframe using normalize to flatten dict
authentic_df_raw = pd.json_normalize(json_data)

## 3.X Transform: Encode as 'utf-8'

Pipeline step skipped, data is already utf-8. Noting here so pipeline diagram can be updated, then delete this cell.

## 3.3 Extract columns of interest

In [None]:
#

## 3.4 Derive new feature: `is_retweet`

In [None]:
#

## 3.5 Derive new feature: `updates`

In [None]:
#

## 3.6 Derive new feature: `account_category`

In [None]:
#

## 3.7 Align column names 

In [None]:
#

# 4 Merge (Partially) Pre-processed Tweets

At this stage, the two separate datasets can be merged. Additional pre-processing will still be performed but can be applied to the entire dataset.

In [None]:
#