# Tweet Turing Test: Detecting Disinformation on Twitter  

|          | Group #2 - Disinformation Detectors                     |
|---------:|---------------------------------------------------------|
| Members  | John Johnson, Katy Matulay, Justin Minnion, Jared Rubin |
| Notebook | `xx_modelA_nlp_preprocess.ipynb`                        |
| Purpose  | NLP-specific preprocessing for Model A                  |

(todo: description)

# 1 - Setup

In [1]:
# imports from Python standard library

# imports requiring installation
#   connection to Google Cloud Storage
from google.cloud import storage            # pip install google-cloud-storage
from google.oauth2 import service_account   # pip install google-auth

#  data science packages
import numpy as np                          # pip install numpy
import pandas as pd                         # pip install pandas

In [2]:
# imports from tweet_turing.py
import tweet_turing as tur      # note - different import approach from prior notebooks

# imports from tweet_turing_paths.py
from tweet_turing_paths import local_data_paths, local_snapshot_paths, gcp_data_paths, \
    gcp_snapshot_paths, gcp_project_name, gcp_bucket_name, gcp_key_file

In [3]:
# pandas options
pd.set_option('display.max_colwidth', None)

## Local or Cloud?

Decide here whether to run notebook with local data or GCP bucket data
 - if the working directory of this notebook has a "../data/" folder with data loaded (e.g. working on local computer or have data files loaded to a cloud VM) then use the "local files" option and comment out the "gcp bucket files" option
 - if this notebook is being run from a GCP VM (preferrably in the `us-central1` location) then use the "gcp bucket files" option and comment out the "local files" option

In [4]:
# option: local files
local_or_cloud: str = "local"   # comment/uncomment this line or next

# option: gcp bucket files
#local_or_cloud: str = "cloud"   # comment/uncomment this line or previous

# don't comment/uncomment for remainder of cell
if (local_or_cloud == "local"):
    data_paths = local_data_paths
    snapshot_paths = local_snapshot_paths
elif (local_or_cloud == "cloud"):
    data_paths = gcp_data_paths
    snapshot_paths = gcp_snapshot_paths
else:
    raise ValueError("Variable 'local_or_cloud' can only take on one of two values, 'local' or 'cloud'.")
    # subsequent cells will not do this final "else" check

In [5]:
# this cell only needs to run its code if local_or_cloud=="cloud"
#   (though it is harmless if run when local_or_cloud=="local")
gcp_storage_client: storage.Client = None
gcp_bucket: storage.Bucket = None

if (local_or_cloud == "cloud"):
    gcp_storage_client = tur.get_gcp_storage_client(project_name=gcp_project_name, key_file=gcp_key_file)
    gcp_bucket = tur.get_gcp_bucket(storage_client=gcp_storage_client, bucket_name=gcp_bucket_name)

# 2 - Load Dataset

Core dataset, as prepared by prior notebook `03_eda.ipynb`, will be loaded as "`df_full`".

In [21]:
# note this cell requires package `pyarrow` to be installed in environment
parq_filename: str = "data_after_03_eda.parquet.gz"
parq_path: str = f"{snapshot_paths['parq_snapshot']}{parq_filename}"

if (local_or_cloud == "local"):
    df_full = pd.read_parquet(parq_path, engine='pyarrow')
elif (local_or_cloud == "cloud"):
    pass    # TODO: implement loading of cloud file

# 3 - Subset Data

Data subset will be created as simply "`df`" for brevity.

In [22]:
# subset parameters
sample_fraction = 0.10  # within range (0.0, 1.0)
random_seed = 3         # for reproducability, and "the number of the counting shall be three"

# generate sample
df = df_full.sample(frac=sample_fraction, random_state=random_seed).copy()

In [34]:
BYTES_PER_GIGABYTE = 10**9  # using IEC-recommended conversion; https://en.wikipedia.org/wiki/Gigabyte#Base_10_(decimal)

df_full_size_gb = df_full.memory_usage(deep=True).sum() / BYTES_PER_GIGABYTE
df_size_gb = df.memory_usage(deep=True).sum() / BYTES_PER_GIGABYTE

print(f"Full dataframe size:\t{df_full_size_gb:8.2f} GB")
print(f"Sampled dataframe size:\t{df_size_gb:8.2f} GB\n")

print(f"Full dataframe rows:\t{len(df_full.index):>11,}")
print(f"Sampled dataframe rows:\t{len(df.index):>11,}\n")

class_split_full = [f"{x*100:0.1f}%" for x in df_full['class'].value_counts().div(len(df_full.index)).tolist()]
class_split_samp = [f"{x*100:0.1f}%" for x in df['class'].value_counts().div(len(df.index)).tolist()]

print(f"Full df class split:\t{class_split_full}")
print(f"Sampled df class split:\t{class_split_samp}\n")


Full dataframe size:	    2.74 GB
Sampled dataframe size:	    0.28 GB

Full dataframe rows:	  3,624,894
Sampled dataframe rows:	    362,489

Full df class split:	['58.4%', '41.6%']
Sampled df class split:	['58.4%', '41.6%']



In [35]:
# save a copy of sampled df so above steps don't need to be repeated everytime
# note this cell requires package `pyarrow` to be installed in environment
parq_filename: str = "data_sample_ten_percent.parquet.gz"
parq_path: str = f"{snapshot_paths['parq_snapshot']}{parq_filename}"

if (local_or_cloud == "local"):
    df.to_parquet(parq_path, engine='pyarrow', index=False, compression='gzip')
elif (local_or_cloud == "cloud"):
    pass