# Tweet Turing Test: Detecting Disinformation on Twitter  

|          | Group #2 - Disinformation Detectors                     |
|---------:|---------------------------------------------------------|
| Members  | John Johnson, Katy Matulay, Justin Minnion, Jared Rubin |
| Notebook | `xx_modelA_nlp_preprocess.ipynb`                        |
| Purpose  | NLP-specific preprocessing for Model A                  |

(todo: description)

# 1 - Setup

In [40]:
# imports from Python standard library

# imports requiring installation
#   connection to Google Cloud Storage
from google.cloud import storage            # pip install google-cloud-storage
from google.oauth2 import service_account   # pip install google-auth

#  data science packages
import numpy as np                          # pip install numpy
import pandas as pd                         # pip install pandas

In [41]:
# imports from tweet_turing.py
import tweet_turing as tur      # note - different import approach from prior notebooks

# imports from tweet_turing_paths.py
from tweet_turing_paths import local_data_paths, local_snapshot_paths, gcp_data_paths, \
    gcp_snapshot_paths, gcp_project_name, gcp_bucket_name, gcp_key_file

In [42]:
# pandas options
pd.set_option('display.max_colwidth', None)

## Local or Cloud?

Decide here whether to run notebook with local data or GCP bucket data
 - if the working directory of this notebook has a "../data/" folder with data loaded (e.g. working on local computer or have data files loaded to a cloud VM) then use the "local files" option and comment out the "gcp bucket files" option
 - if this notebook is being run from a GCP VM (preferrably in the `us-central1` location) then use the "gcp bucket files" option and comment out the "local files" option

In [13]:
# option: local files
local_or_cloud: str = "local"   # comment/uncomment this line or next

# option: gcp bucket files
#local_or_cloud: str = "cloud"   # comment/uncomment this line or previous

# don't comment/uncomment for remainder of cell
if (local_or_cloud == "local"):
    data_paths = local_data_paths
    snapshot_paths = local_snapshot_paths
elif (local_or_cloud == "cloud"):
    data_paths = gcp_data_paths
    snapshot_paths = gcp_snapshot_paths
else:
    raise ValueError("Variable 'local_or_cloud' can only take on one of two values, 'local' or 'cloud'.")
    # subsequent cells will not do this final "else" check

In [5]:
# this cell only needs to run its code if local_or_cloud=="cloud"
#   (though it is harmless if run when local_or_cloud=="local")
gcp_storage_client: storage.Client = None
gcp_bucket: storage.Bucket = None

if (local_or_cloud == "cloud"):
    gcp_storage_client = tur.get_gcp_storage_client(project_name=gcp_project_name, key_file=gcp_key_file)
    gcp_bucket = tur.get_gcp_bucket(storage_client=gcp_storage_client, bucket_name=gcp_bucket_name)

# 2 - Load Dataset

Core dataset, as prepared by prior notebook `03_eda.ipynb`, will be loaded as "`df_full`".

In [17]:
#workaround- delete

import os
os.chdir('/Users/katymatulay/Documents/Drexel - Grad School/08 Winter 2023/DSCI592/data')
df_full = pd.read_parquet("../data/data_after_03_eda.parquet.gz")

In [14]:
# note this cell requires package `pyarrow` to be installed in environment
parq_filename: str = "data_after_03_eda.parquet.gz"
parq_path: str = f"{snapshot_paths['parq_snapshot']}{parq_filename}"

if (local_or_cloud == "local"):
    df_full = pd.read_parquet(parq_path, engine='pyarrow')
elif (local_or_cloud == "cloud"):
    pass    # TODO: implement loading of cloud file

FileNotFoundError: ../data/snapshot/data_after_03_eda.parquet.gz

# 3 - Subset Data

Data subset will be created as simply "`df`" for brevity.

In [25]:
df_full[df_full['language']=='lt']

Unnamed: 0,external_author_id,author,content,region,language,following,followers,updates,post_type,is_retweet,...,tco1_step1,data_source,has_url,emoji_text,emoji_count,publish_date,class,following_ratio,class_numeric,RUS_lett_count
2812169,23297570,ProfJamesLogan,#W1A is genius,"London, UK",lt,5130,14835,3,,0.0,...,,verified_random,0,[],0,2015-04-26 21:55:02+00:00,Verified,0.345781,0,0
3492472,119709295,clairybrowne,RT @PenelopeAAustin: S T A R 󾭪 M A N ! Into this.. Px https://t.co/mQ0oxPaJyE,Your heart,lt,1349,3194,1,retweeted,1.0,...,https://www.theguardian.com/music/2016/jan/18/david-bowie-astronomers-give-the-starman-his-own-constellation?CMP=share_btn_fb,verified_random,1,[],0,2016-01-19 10:33:25+00:00,Verified,0.422222,0,0
3502564,704344867,PenelopeAAustin,S T A R 󾭪 M A N ! Into this.. Px https://t.co/mQ0oxPaJyE,,lt,714,726,4,,0.0,...,https://www.theguardian.com/music/2016/jan/18/david-bowie-astronomers-give-the-starman-his-own-constellation?CMP=share_btn_fb,verified_random,1,[],0,2016-01-19 09:59:18+00:00,Verified,0.982118,0,0
3576325,307596372,LanhNguyenFilms,LTE is 4G. (@YouTube http://t.co/qmAJDDv1M8),Kansas City,lt,199,2973,0,,0.0,...,http://youtu.be/FPrA5TmN7wc?a,verified_random,1,[],0,2013-07-06 01:56:10+00:00,Verified,0.066913,0,0


In [28]:
df_full[df_full['account_category']=='NonEnglish'].head()

Unnamed: 0,external_author_id,author,content,region,language,following,followers,updates,post_type,is_retweet,...,tco1_step1,data_source,has_url,emoji_text,emoji_count,publish_date,class,following_ratio,class_numeric,RUS_lett_count
415,839000000000000000,1LORENAFAVA1,Come vedere Juventus-Milan in streaming o in tv https://t.co/NHlb4OgXXY,Italy,en,416,61,249,RETWEET,1.0,...,http://ift.tt/2nnaPwn,Troll,1,[],0,2017-03-10 18:21:00+00:00,Troll,6.709677,1,0
416,839000000000000000,1LORENAFAVA1,#SerieA in campo #JuventusMilan LIVE e FOTO https://t.co/zR8rrsmSPL,Italy,en,416,62,255,RETWEET,1.0,...,http://ow.ly/luNF309Nh2L,Troll,1,[],0,2017-03-10 20:01:00+00:00,Troll,6.603175,1,0
417,839000000000000000,1LORENAFAVA1,"#Privacy, come difenderla on line (e con le #App) https://t.co/t4A4GisVyE https://t.co/ET8FkCSct9",Italy,en,416,62,261,RETWEET,1.0,...,https://twitter.com/Adnkronos/status/840315104440680448/photo/1,Troll,1,[],0,2017-03-10 21:37:00+00:00,Troll,6.603175,1,0
418,839000000000000000,1LORENAFAVA1,Come vedere Italia-Francia di rugby in tv e in streaming https://t.co/aKFgLcyljK,Italy,en,415,64,292,RETWEET,1.0,...,http://ift.tt/2ng5dVp,Troll,1,[],0,2017-03-11 12:59:00+00:00,Troll,6.384615,1,0
419,839000000000000000,1LORENAFAVA1,"Come vedere Genoa-Sampdoria, in tv o in streaming https://t.co/ew3hENEsYE",Italy,en,414,68,313,RETWEET,1.0,...,http://ift.tt/2nqCe06,Troll,1,[],0,2017-03-11 18:58:00+00:00,Troll,6.0,1,0


In [20]:
df_full.columns

Index(['external_author_id', 'author', 'content', 'region', 'language',
       'following', 'followers', 'updates', 'post_type', 'is_retweet',
       'account_category', 'tweet_id', 'tco1_step1', 'data_source', 'has_url',
       'emoji_text', 'emoji_count', 'publish_date', 'class', 'following_ratio',
       'class_numeric', 'RUS_lett_count'],
      dtype='object')

In [34]:
en_df=df_full[df_full['language']=='en']

In [35]:
en_df = en_df[en_df['account_category']!='NonEnglish']

In [36]:
en_df['class'].value_counts()

Troll       2090304
Verified    1506274
Name: class, dtype: int64

In [39]:
en_df['account_category'].value_counts()

Verified_User    1470028
RightTroll        704953
NewsFeed          596593
LeftTroll         422141
HashtagGamer      236091
Commercial        112580
Unknown            43191
Fearmonger         11001
NonEnglish             0
Name: account_category, dtype: int64

In [22]:
# subset parameters
sample_fraction = 0.10  # within range (0.0, 1.0)
random_seed = 3         # for reproducability, and "the number of the counting shall be three"

# generate sample
df = df_full.sample(frac=sample_fraction, random_state=random_seed).copy()

In [34]:
BYTES_PER_GIGABYTE = 10**9  # using IEC-recommended conversion; https://en.wikipedia.org/wiki/Gigabyte#Base_10_(decimal)

df_full_size_gb = df_full.memory_usage(deep=True).sum() / BYTES_PER_GIGABYTE
df_size_gb = df.memory_usage(deep=True).sum() / BYTES_PER_GIGABYTE

print(f"Full dataframe size:\t{df_full_size_gb:8.2f} GB")
print(f"Sampled dataframe size:\t{df_size_gb:8.2f} GB\n")

print(f"Full dataframe rows:\t{len(df_full.index):>11,}")
print(f"Sampled dataframe rows:\t{len(df.index):>11,}\n")

class_split_full = [f"{x*100:0.1f}%" for x in df_full['class'].value_counts().div(len(df_full.index)).tolist()]
class_split_samp = [f"{x*100:0.1f}%" for x in df['class'].value_counts().div(len(df.index)).tolist()]

print(f"Full df class split:\t{class_split_full}")
print(f"Sampled df class split:\t{class_split_samp}\n")


Full dataframe size:	    2.74 GB
Sampled dataframe size:	    0.28 GB

Full dataframe rows:	  3,624,894
Sampled dataframe rows:	    362,489

Full df class split:	['58.4%', '41.6%']
Sampled df class split:	['58.4%', '41.6%']



In [35]:
# save a copy of sampled df so above steps don't need to be repeated everytime
# note this cell requires package `pyarrow` to be installed in environment
parq_filename: str = "data_sample_ten_percent.parquet.gz"
parq_path: str = f"{snapshot_paths['parq_snapshot']}{parq_filename}"

if (local_or_cloud == "local"):
    df.to_parquet(parq_path, engine='pyarrow', index=False, compression='gzip')
elif (local_or_cloud == "cloud"):
    pass

In [43]:
df_sample = pd.read_parquet("../data/data_sample_ten_percent.parquet.gz")

In [44]:
df_sample['class'].value_counts()

Troll       211873
Verified    150616
Name: class, dtype: int64

In [45]:
import demoji

In [52]:
test="I bet you didn't know that 🙋, 🙋‍♂️, and 🙋‍♀️ are three different emojis."
test_replaced = demoji.replace_with_desc(test, "'") 

In [53]:
test_replaced

"I bet you didn't know that 'person raising hand', 'man raising hand', and 'woman raising hand' are three different emojis."

In [57]:
df_mini = df_sample[df_sample['emoji_count']>1][:5]
df_mini

Unnamed: 0,external_author_id,author,content,region,language,following,followers,updates,post_type,is_retweet,...,tco1_step1,data_source,has_url,emoji_text,emoji_count,publish_date,class,following_ratio,class_numeric,RUS_lett_count
44,3272640600,EXQUOTE,'@J_cranee Doozling @jamiieeubanks Janel' @RamzelInDistres ☹R a m s l e e p y☹ @sebass_field Staff Zenji http://t.co/j71SGSGouC',United States,en,2,356,30376,,0.0,...,https://twitter.com/safety/unsafe_link_warning?unsafe_link=http%3A%2F%2Fwww.FatLossAdvice.pw%2FTips%2FThe-best-compliment-you-dont-workout-and-you-look-like-that.asp,Troll,1,"[frowning face, frowning face]",2,2015-08-04 18:59:00+00:00,Troll,0.005602,1,0
78,785341092862525440,youFamousEnough,RT @SLIKNATIONPROAM: 🚨SLIK WIT IT MATCHING UP FOR SEASON 7🚨https://t.co/Pq0JpQQUTe @youFamousEnough @MPBA2K,"Houston, TX",en,5954,32934,1,retweeted,1.0,...,https://www.twitch.tv/anewman33,verified_random,1,"[police car light, police car light]",2,2017-03-20 03:05:13+00:00,Verified,0.18078,0,0
83,362787242,OttoMatticBaby,RT @Joey_0806: @OttoMatticBaby listening to your music before a game 🔥🔥🔥,"Pennsylvania, USA",en,737,203692,1,retweeted,1.0,...,,verified_random,0,"[fire, fire, fire]",3,2015-04-26 21:21:55+00:00,Verified,0.003618,0,0
128,66369181,DeptofDefense,Water delivery. Members of the 🇵🇷 #NationalGuard distribute 🚰 for the #Utuado community following #HurricaneMaria. https://t.co/OGGKrLVZ8j,"The Pentagon, Washington, D.C.",en,471,6534141,1335,,0.0,...,https://twitter.com/DeptofDefense/status/914339088416825344/photo/1,verified_user,1,"[flag: Puerto Rico, potable water]",2,2017-10-01 04:00:00+00:00,Verified,7.2e-05,0,0
171,26145732,AdamMcKola,@vex1zgooner @ManUtd @Arsenal Well it is cos United fans are Wenger In. 😂😂😂,,en,974,212108,1,replied_to,0.0,...,,verified_random,0,"[face with tears of joy, face with tears of joy, face with tears of joy]",3,2017-04-14 15:20:14+00:00,Verified,0.004592,0,0


In [88]:
def convert_emoji_text(tweet_series: pd.Series) -> str:
    ''' The following converts an emoji in a text string to a str enclosed with ''. '''
    ##return demoji.replace_with_desc(tweet_series['content'], "'") 
    return demoji.replace_with_desc(tweet_series['content'], " ") 

In [89]:
# apply convert_emoji_text
new_column = df_mini.apply(convert_emoji_text, axis='columns')
df_mini.loc[:, 'content2'] = new_column

In [99]:
# apply convert_emoji_text
from tweet_turing import convert_emoji_text
new_column = df_mini.apply(convert_emoji_text, axis='columns')
df_mini.loc[:, 'content2'] = new_column

ImportError: cannot import name 'convert_emoji_text' from 'tweet_turing' (/Users/katymatulay/Documents/GitHub/tweet-turing-test/src/tweet_turing.py)

In [90]:
df_mini[['content','content2']]

Unnamed: 0,content,content2
44,'@J_cranee Doozling @jamiieeubanks Janel' @RamzelInDistres ☹R a m s l e e p y☹ @sebass_field Staff Zenji http://t.co/j71SGSGouC','@J_cranee Doozling @jamiieeubanks Janel' @RamzelInDistres frowning face R a m s l e e p y frowning face @sebass_field Staff Zenji http://t.co/j71SGSGouC'
78,RT @SLIKNATIONPROAM: 🚨SLIK WIT IT MATCHING UP FOR SEASON 7🚨https://t.co/Pq0JpQQUTe @youFamousEnough @MPBA2K,RT @SLIKNATIONPROAM: police car light SLIK WIT IT MATCHING UP FOR SEASON 7 police car light https://t.co/Pq0JpQQUTe @youFamousEnough @MPBA2K
83,RT @Joey_0806: @OttoMatticBaby listening to your music before a game 🔥🔥🔥,RT @Joey_0806: @OttoMatticBaby listening to your music before a game fire fire fire
128,Water delivery. Members of the 🇵🇷 #NationalGuard distribute 🚰 for the #Utuado community following #HurricaneMaria. https://t.co/OGGKrLVZ8j,Water delivery. Members of the flag: Puerto Rico #NationalGuard distribute potable water for the #Utuado community following #HurricaneMaria. https://t.co/OGGKrLVZ8j
171,@vex1zgooner @ManUtd @Arsenal Well it is cos United fans are Wenger In. 😂😂😂,@vex1zgooner @ManUtd @Arsenal Well it is cos United fans are Wenger In. face with tears of joy face with tears of joy face with tears of joy


# 4 - Transformer Testing

Testing transformer on '`df_mini`'

In [100]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-macosx_10_11_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting packaging>=20.0
  Downloading packaging-23.0-py3-none-any.whl (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.7/42.7 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, packaging, huggingface-hub, tran

In [101]:
import transformers

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [102]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


HBox(children=(HTML(value='Downloading (…)lve/main/config.json'), FloatProgress(value=0.0, max=629.0), HTML(va…




RuntimeError: At least one of TensorFlow 2.0 or PyTorch should be installed. To install TensorFlow 2.0, read the instructions at https://www.tensorflow.org/install/ To install PyTorch, read the instructions at https://pytorch.org/.