# Tweet Turing Test: Detecting Disinformation on Twitter  

|          | Group #2 - Disinformation Detectors                     |
|---------:|---------------------------------------------------------|
| Members  | John Johnson, Katy Matulay, Justin Minnion, Jared Rubin |
| Notebook | `03_eda.ipynb`                                          |
| Purpose  | Conduct EDA of pre-processed data.                      |

> (TODO - write more explaining notebook)

# 1 - Setup

In [1]:
# imports from Python standard library
import json
import logging
import os

# imports requiring installation
#   connection to Google Cloud Storage
from google.cloud import storage            # pip install google-cloud-storage
from google.oauth2 import service_account   # pip install google-auth

#  data science packages
import demoji                               # pip install demoji
import numpy as np                          # pip install numpy
import pandas as pd                         # pip install pandas

In [2]:
# imports from tweet_turing.py
from tweet_turing import get_json_files, load_local_json, get_gcp_storage_client, get_gcp_bucket, \
    list_gcp_objects, get_gcp_object_as_json, get_gcp_object_as_text, set_gcp_object_from_json, \
    is_retweet, is_retweet_alt

# imports from tweet_turing_paths.py
from tweet_turing_paths import local_data_paths, local_snapshot_paths, gcp_data_paths, \
    gcp_snapshot_paths, gcp_project_name, gcp_bucket_name, gcp_key_file

In [3]:
# pandas options
pd.set_option('display.max_colwidth', None)

## Local or Cloud?

Decide here whether to run notebook with local data or GCP bucket data
 - if the working directory of this notebook has a "../data/" folder with data loaded (e.g. working on local computer or have data files loaded to a cloud VM) then use the "local files" option and comment out the "gcp bucket files" option
 - if this notebook is being run from a GCP VM (preferrably in the `us-central1` location) then use the "gcp bucket files" option and comment out the "local files" option

In [4]:
# option: local files
local_or_cloud: str = "local"   # comment/uncomment this line or next

# option: gcp bucket files
#local_or_cloud: str = "cloud"   # comment/uncomment this line or previous

# don't comment/uncomment for remainder of cell
if (local_or_cloud == "local"):
    data_paths = local_data_paths
    snapshot_paths = local_snapshot_paths
elif (local_or_cloud == "cloud"):
    data_paths = gcp_data_paths
    snapshot_paths = gcp_snapshot_paths
else:
    raise ValueError("Variable 'local_or_cloud' can only take on one of two values, 'local' or 'cloud'.")
    # subsequent cells will not do this final "else" check

In [5]:
# this cell only needs to run its code if local_or_cloud=="cloud"
#   (though it is harmless if run when local_or_cloud=="local")
gcp_storage_client: storage.Client = None
gcp_bucket: storage.Bucket = None

if (local_or_cloud == "cloud"):
    gcp_storage_client = get_gcp_storage_client(project_name=gcp_project_name, key_file=gcp_key_file)
    gcp_bucket = get_gcp_bucket(storage_client=gcp_storage_client, bucket_name=gcp_bucket_name)

# 2 - Load dataset

In [6]:
# note this cell requires package `pyarrow` to be installed in environment
parq_filename: str = "merged_df_preprocessed.parquet.snappy"
parq_path: str = f"{snapshot_paths['parq_snapshot']}{parq_filename}"

if (local_or_cloud == "local"):
    merged_df = pd.read_parquet(parq_path, engine='pyarrow')
    # TODO -> confirm index is maintained from before snapshot
elif (local_or_cloud == "cloud"):
    pass

# 3 - EDA Basics

In [7]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3624894 entries, 0 to 1508027
Data columns (total 18 columns):
 #   Column              Dtype         
---  ------              -----         
 0   external_author_id  object        
 1   author              object        
 2   content             object        
 3   region              object        
 4   language            object        
 5   publish_date        datetime64[ns]
 6   following           int64         
 7   followers           int64         
 8   updates             int64         
 9   is_retweet          int64         
 10  account_category    object        
 11  tweet_id            string        
 12  full_url            string        
 13  data_source         object        
 14  is_retweet_alt      float64       
 15  has_url             int64         
 16  emoji_text          object        
 17  emoji_count         int64         
dtypes: datetime64[ns](1), float64(1), int64(6), object(8), string(2)
memory usage: 525

In [13]:
merged_df.index.duplicated().sum()

1159916