# Tweet Turing Test: Detecting Disinformation on Twitter  

|          | Group #2 - Disinformation Detectors                     |
|---------:|---------------------------------------------------------|
| Members  | John Johnson, Katy Matulay, Justin Minnion, Jared Rubin |
| Notebook | `04_modelA_nlp_preprocess.ipynb`                        |
| Purpose  | NLP-specific preprocessing for Model A                  |

(todo: description)

# 1 - Setup

In [1]:
# imports from Python standard library

# imports requiring installation
#   connection to Google Cloud Storage
from google.cloud import storage            # pip install google-cloud-storage
from google.oauth2 import service_account   # pip install google-auth

#  data science packages
import numpy as np                          # pip install numpy
import pandas as pd                         # pip install pandas

In [2]:
# imports from tweet_turing.py
import tweet_turing as tur      # note - different import approach from prior notebooks

# imports from tweet_turing_paths.py
from tweet_turing_paths import local_data_paths, local_snapshot_paths, gcp_data_paths, \
    gcp_snapshot_paths, gcp_project_name, gcp_bucket_name, gcp_key_file

In [3]:
# pandas options
pd.set_option('display.max_colwidth', None)

## Local or Cloud?

Decide here whether to run notebook with local data or GCP bucket data
 - if the working directory of this notebook has a "../data/" folder with data loaded (e.g. working on local computer or have data files loaded to a cloud VM) then use the "local files" option and comment out the "gcp bucket files" option
 - if this notebook is being run from a GCP VM (preferrably in the `us-central1` location) then use the "gcp bucket files" option and comment out the "local files" option

In [4]:
# option: local files
#local_or_cloud: str = "local"   # comment/uncomment this line or next

# option: gcp bucket files
local_or_cloud: str = "cloud"   # comment/uncomment this line or previous

# don't comment/uncomment for remainder of cell
if (local_or_cloud == "local"):
    data_paths = local_data_paths
    snapshot_paths = local_snapshot_paths
elif (local_or_cloud == "cloud"):
    data_paths = gcp_data_paths
    snapshot_paths = gcp_snapshot_paths
else:
    raise ValueError("Variable 'local_or_cloud' can only take on one of two values, 'local' or 'cloud'.")
    # subsequent cells will not do this final "else" check

In [5]:
# this cell only needs to run its code if local_or_cloud=="cloud"
#   (though it is harmless if run when local_or_cloud=="local")
gcp_storage_client: storage.Client = None
gcp_bucket: storage.Bucket = None

if (local_or_cloud == "cloud"):
    #gcp_storage_client = tur.get_gcp_storage_client(project_name=gcp_project_name, key_file=gcp_key_file)
    gcp_storage_client = tur.get_gcp_storage_client(project_name=gcp_project_name)
    gcp_bucket = tur.get_gcp_bucket(storage_client=gcp_storage_client, bucket_name=gcp_bucket_name)

# 2 - Load Dataset

Core dataset, as prepared by prior notebook `03_eda.ipynb`, will be loaded as "`df_full`".

In [10]:
# note this cell requires package `pyarrow` to be installed in environment
parq_filename: str = "data_after_03_eda.parquet.gz"
parq_path: str = f"{snapshot_paths['parq_snapshot']}{parq_filename}"

if (local_or_cloud == "local"):
    df_full = pd.read_parquet(parq_path, engine='pyarrow')
elif (local_or_cloud == "cloud"):
    df_full = tur.get_gcp_object_from_parq_as_df(bucket=gcp_bucket, object_name=parq_path)

# 3 - Filter and Subset Data

Data subset will be created as simply "`df`" for brevity.

## 3.1 Create filtered subset

In [13]:
# filter for english language only
df_full = df_full.loc[df_full['language'] == 'en']

In [14]:
# subset parameters
sample_fraction = 0.10  # within range (0.0, 1.0)
random_seed = 3         # for reproducability, and "the number of the counting shall be three"

# generate sample
df = df_full.sample(frac=sample_fraction, random_state=random_seed).copy()    # using .copy() for clean-ish copy

In [15]:
BYTES_PER_GIGABYTE = 10**9  # using IEC-recommended conversion; https://en.wikipedia.org/wiki/Gigabyte#Base_10_(decimal)

df_full_size_gb = df_full.memory_usage(deep=True).sum() / BYTES_PER_GIGABYTE
df_size_gb = df.memory_usage(deep=True).sum() / BYTES_PER_GIGABYTE

print(f"Full dataframe size:\t{df_full_size_gb:8.2f} GB")
print(f"Sampled dataframe size:\t{df_size_gb:8.2f} GB\n")

print(f"Full dataframe rows:\t{len(df_full.index):>11,}")
print(f"Sampled dataframe rows:\t{len(df.index):>11,}\n")

class_split_full = [f"{x*100:0.1f}%" for x in df_full['class'].value_counts().div(len(df_full.index)).tolist()]
class_split_samp = [f"{x*100:0.1f}%" for x in df['class'].value_counts().div(len(df.index)).tolist()]

print(f"Full df class split:\t{class_split_full}")
print(f"Sampled df class split:\t{class_split_samp}\n")


Full dataframe size:	    2.74 GB
Sampled dataframe size:	    0.27 GB

Full dataframe rows:	  3,623,140
Sampled dataframe rows:	    362,314

Full df class split:	['58.4%', '41.6%']
Sampled df class split:	['58.4%', '41.6%']



In [16]:
# save a copy of sampled df so above steps don't need to be repeated everytime
# note this cell requires package `pyarrow` to be installed in environment
parq_filename: str = "data_sample_ten_percent.parquet.gz"
parq_path: str = f"{snapshot_paths['parq_snapshot']}{parq_filename}"

if (local_or_cloud == "local"):
    df.to_parquet(parq_path, engine='pyarrow', index=False, compression='gzip')
elif (local_or_cloud == "cloud"):
    tur.set_gcp_object_from_df_as_parq(bucket=gcp_bucket, object_name=parq_path, df=df)

## 3.2 - Reload subset (already filtered)

In [6]:
# reload the sampled data
# note this cell requires package `pyarrow` to be installed in environment
parq_filename: str = "data_sample_ten_percent.parquet.gz"
parq_path: str = f"{snapshot_paths['parq_snapshot']}{parq_filename}"

if (local_or_cloud == "local"):
    df = pd.read_parquet(parq_path, engine='pyarrow')
elif (local_or_cloud == "cloud"):
    df = tur.get_gcp_object_from_parq_as_df(bucket=gcp_bucket, object_name=parq_path)

# 4 - NLP Preprocess

## 4.1 - Convert emoji characters *in situ* to their natural language equivalent

In [12]:
# sanity check of function
test_text = "I bet you didn't know that 🙋, 🙋‍♂️, and 🙋‍♀️ are three different emojis."
test_text_df = pd.DataFrame([[test_text]], columns=['content'])

# apply function
test_text_df.apply(lambda row: tur.convert_emoji_text(row, enclosing_char=':'), axis='columns').tolist()

["I bet you didn't know that :person raising hand:, :person raising hand:\u200d♂️, and :person raising hand:\u200d♀️ are three different emojis."]

In [19]:
def selective_convert_emoji_text(tweet_series: pd.Series, enclosing_char: str = ":") -> str:
    """Selectively applies the tur.convert_emoji_text(...) function.
        When a row's `emoji_text` value is > 0, apply function to `content`.
        Otherwise, return the original `content` string."""
    if (tweet_series['emoji_count'] > 0):
        return tur.convert_emoji_text(tweet_series, enclosing_char)
    else:
        return tweet_series['content']

In [20]:
df['content_demoji'] = df.apply(lambda row: selective_convert_emoji_text(row), axis='columns')

In [21]:
# check output
pd.concat([
    df.loc[df['emoji_count'] > 0, ['content', 'content_demoji']].sample(5),
    df.loc[df['emoji_count'] == 0, ['content', 'content_demoji']].sample(5)
])

Unnamed: 0,content,content_demoji
112513,A new display at our @amhistorymuseum explores Hispanic advertising through the story of Selena: https://t.co/iS6NqzvjSL 📷: Al Rendon #HHM https://t.co/3xXYgPLVsY,A new display at our @amhistorymuseum explores Hispanic advertising through the story of Selena: https://t.co/iS6NqzvjSL :camera:: Al Rendon #HHM https://t.co/3xXYgPLVsY
158232,@maraantonoff @WomenInThoracic @WomenSurgeons @Inspire_WIS @LoggheMD @susieQP8 @LisaBrownMD @EADavidMD @wtspres Love it Mara ❤️,@maraantonoff @WomenInThoracic @WomenSurgeons @Inspire_WIS @LoggheMD @susieQP8 @LisaBrownMD @EADavidMD @wtspres Love it Mara :red heart:
74125,@GyanMano here is the poster for Rocky Handsome. Do not forget to watch the teaser on the 20th 🙂 @IAmAzure https://t.co/t9snvhV1BB,@GyanMano here is the poster for Rocky Handsome. Do not forget to watch the teaser on the 20th :slightly smiling face: @IAmAzure https://t.co/t9snvhV1BB
169657,.@marcale_lotts12 EasyMoneyBigga @AutumnElla1 Autumn @ItsmeCelesteP Celeste @JustinTense_ Justin @meeeeechhhh_ ❤️⚓️ http://t.co/nT7MmBfWin,.@marcale_lotts12 EasyMoneyBigga @AutumnElla1 Autumn @ItsmeCelesteP Celeste @JustinTense_ Justin @meeeeechhhh_ :red heart::anchor:️ http://t.co/nT7MmBfWin
176550,Filmed my first Minecraft Pocket Edition video with @RageElixir today!! Parkour on your phone = 😡😡😡,Filmed my first Minecraft Pocket Edition video with @RageElixir today!! Parkour on your phone = :pouting face::pouting face::pouting face:
338934,Wake up! https://t.co/nzyuMCLFB5,Wake up! https://t.co/nzyuMCLFB5
26831,#kah SICK! Islamic “Refugees” Have Started Flying a New Flag And It’s BAD https://t.co/v83wzqy6KJ #ka https://t.co/BlLDk82njj,#kah SICK! Islamic “Refugees” Have Started Flying a New Flag And It’s BAD https://t.co/v83wzqy6KJ #ka https://t.co/BlLDk82njj
189723,"Bob Quinn on the Matthew Stafford extension: ""We're in the early stages... It takes two sides to do a deal."" Quinn likes Stafford a lot.","Bob Quinn on the Matthew Stafford extension: ""We're in the early stages... It takes two sides to do a deal."" Quinn likes Stafford a lot."
240824,The inner sanctum #MyBedroomIn3Words,The inner sanctum #MyBedroomIn3Words
149182,"Why, yes, I'd like that endangered rhino with a side of arugula salad. #ThingsNotToDoAtTheZoo","Why, yes, I'd like that endangered rhino with a side of arugula salad. #ThingsNotToDoAtTheZoo"


## 4.2 - Create column containing tweet text with all emoji characters removed

In [33]:
# sanity check of function
test_text = "I bet you didn't know that 🙋, 🙋‍♂️, and 🙋‍♀️ are three different emojis."
test_text_df = pd.DataFrame([[test_text]], columns=['content'])

# apply function
test_text_df.apply(lambda row: tur.remove_emoji_text(row), axis='columns').tolist()

["I bet you didn't know that , , and  are three different emojis."]

In [29]:
def selective_remove_emoji_text(tweet_series: pd.Series) -> str:
    """Selectively applies the tur.remove_emoji_text(...) function.
        When a row's `emoji_text` value is > 0, apply function to `content`.
        Otherwise, return the original `content` string."""
    if (tweet_series['emoji_count'] > 0):
        return tur.remove_emoji_text(tweet_series)
    else:
        return tweet_series['content']

In [30]:
# make a column with the text but no emojis
df['content_no_emoji'] = df.apply(lambda row: selective_remove_emoji_text(row), axis='columns')

In [31]:
# check output
pd.concat([
    df.loc[df['emoji_count'] > 0, ['content', 'content_no_emoji']].sample(5),
    df.loc[df['emoji_count'] == 0, ['content', 'content_no_emoji']].sample(5)
])

Unnamed: 0,content,content_no_emoji
223028,RT @badman_og: Here for this📌 https://t.co/WCG9VSyDNh,RT @badman_og: Here for this https://t.co/WCG9VSyDNh
13315,'@zubovnik ❤❤❤ Sexy Russians known how to ride bikes. ❤❤❤','@zubovnik Sexy Russians known how to ride bikes. '
30391,@CurfewUKMUSIC @mtvex @talithaminnis @ShelbyBilingham I really don't know bro.. They'll deffo be repeats through the week! 👊,@CurfewUKMUSIC @mtvex @talithaminnis @ShelbyBilingham I really don't know bro.. They'll deffo be repeats through the week!
75237,RT @FaZe_Rain: You are truly the best. Thank you so much for 5 Million Subscribers. ❤️💧,RT @FaZe_Rain: You are truly the best. Thank you so much for 5 Million Subscribers.
330919,We apparently show compassion as a country—by gutting programs for the needy 🙄 #FeedTheChildren https://t.co/x5pJmTuy4x,We apparently show compassion as a country—by gutting programs for the needy \n\n#FeedTheChildren https://t.co/x5pJmTuy4x
305439,Listening to the #Indy 911 scanner is like a sixth sense for me. I can hear things even when the dispatcher mumbles,Listening to the #Indy 911 scanner is like a sixth sense for me. I can hear things even when the dispatcher mumbles
128261,.@realisticstud29 RealisticLove @Terrijyl Tj @fonr_yfollowrs ??? @realtalkVEEZY Veezus @SportsDPT Alex http://t.co/9VDpi18wa9,.@realisticstud29 RealisticLove @Terrijyl Tj @fonr_yfollowrs ??? @realtalkVEEZY Veezus @SportsDPT Alex http://t.co/9VDpi18wa9
179913,Help for a new candidate? Hello all.I am currently in the beginning steps of running for school board in Minnesota. I'm really excited to a…,Help for a new candidate? Hello all.I am currently in the beginning steps of running for school board in Minnesota. I'm really excited to a…
90572,Nearly all Americans reset their clocks this past weekend. The history and reasoning behind this practice is here: http://t.co/wZHLH1Ycay,Nearly all Americans reset their clocks this past weekend. The history and reasoning behind this practice is here: http://t.co/wZHLH1Ycay
358045,"@omeraziz12 listened to your debate with @SamHarrisOrg, you make the claim that Iran does not punish apostacy. What is your basis for that?","@omeraziz12 listened to your debate with @SamHarrisOrg, you make the claim that Iran does not punish apostacy. What is your basis for that?"


# 5 - Export NLP-preprocessed dataset

In [32]:
# save a copy of sampled df so above steps don't need to be repeated everytime
# note this cell requires package `pyarrow` to be installed in environment
parq_filename: str = "data_sample_ten_percent_NLP_preprocessed.parquet.gz"
parq_path: str = f"{snapshot_paths['parq_snapshot']}{parq_filename}"

if (local_or_cloud == "local"):
    df.to_parquet(parq_path, engine='pyarrow', index=False, compression='gzip')
elif (local_or_cloud == "cloud"):
    tur.set_gcp_object_from_df_as_parq(bucket=gcp_bucket, object_name=parq_path, df=df)