# Preface

Upon research, I found a favourable approach to further pre-training was to perform further within-task or in-domain pretraining from the existing BERT model checkpoint.

I've already collected around 150,000 tweets that are categorized as either hate speech, offensive or benign so this roughly satisfies the requirement that the pretraining data is within-task or in-domain.

However there are far more tweet databases that can be used to further-pretrain my BERT model. There is a wealth of unsupervised tweet datasets online that I can make use of, these datasets often do not come in text form but rather the tweets are represented by their tweet IDs. Below I will describe the tweet datasets and retrieve their associated text using the tweepy module.

# Retrieving more pre-training data from unsupervised tweet datasets
Using as wide a variety of sources as possible, we will increase the knowledge of our further-pretrained model. My goal when sourcing these datasets was to try and find tweet datasets which are largely user-generated, as hate speech online largely comes from user generated content.

If I could find it, I would use tweet datasets likely to have aggressive or abusive content as well as possibly containing sexist and racial slurs. Even having some tweets just talking about racial, gender or sexuality based issues would be beneficial for my model to be trained on as these tweets use the language associated with these issues.

The following tweets can all be retrieved by ID and they come from the following sources:


*   ### <b>#UniteTheRight tweet database:</b>

The Unite the Right rally (also known as the Charlottesville rally) was a protest in Charlottesville, Virginia, United States from August 11–12, 2017, to oppose the removal of a statue of Robert E. Lee in Emancipation Park, which itself was renamed from Lee Park two months earlier. Protesters included white supremacists, white nationalists, neo-Confederates, neo-Nazis, and militias. This dataset contains 200,113 tweet ids collected with the #unitetheright hashtag. The time ranges for the tweets are from 2017-08-04 11:44:12 to 2017-08-15 16:03:30 GMT.

*   ### <b>#Charlottesville:</b>

The same event as above but tweets were sourced from the hashtag #charlottesville, many tweets in common are expected with the dataset above but these will be removed in pre-processing. 200,000 tweets.

*   ### <b>Bill 10 Twitter IDs:</b>

A list of 24876 Twitter IDs for tweets harvested between Nov. 28 and Dec. 6 2014 containing the hashtag #bill10. Bill 10 in the Alberta legislature would have given public and Catholic school boards the right to refuse student requests to form gay-straight alliances in schools. Under intense public interest it was withdrawn by the Conservative government.

*  ### <b>BLMKidnapping:</b>

These 136,990 tweet ids represent reaction to a Facebook Live video that was posted on January 3rd, 2017, showing four African American men violently attacking a white, mentally disabled man. The tweets were collected on 01/05/2017. After the video surfaced, the Twitter hashtag, #BLMkidnapping, was created and used to incorrectly attribute the violent attack to members of the Black Lives Matter movement. Police in Chicago, where the attack took place, have found no evidence the attack has any connection to the Black Lives Matter movement. This link is to a CNN story documenting the police denial of Black Lives Matter connection: http://www.cnn.com/2017/01/05/us/black-lives-matter-chicago-facebook-live-beating/index.html


*   ### <b>Replies to Ocasio-Cortez Tweets:</b>

Replies to senator Rep. Alexandria Ocasio-Cortez’s Tweets and Retweets in March 2019. Whilst many tweets to Ms. Ocasio-Cortez may be glowing praise, I'm counting on a sizable portion of the tweets directed at her to be abusive and perhaps sexist and racist - I feel like this is a safe enough hypothesis, as America's political climate has become much more toxic in recent years.


Importing and Installing Dependencies ...

In [1]:
!pip install tweepy
!pip install google-cloud-storage
!pip install gcsfs



In [3]:
import os
import pandas as pd
import re
import json
import tweepy

#Below is to authenticate google bucket access for local machines

#Put GCS service account credentials json in current working directory
#Not in github repo because it's private info
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]= "C:\\Users\\fionn\\Downloads\\storageCreds.json"
from google.cloud import storage
storage_client = storage.Client()
buckets = list(storage_client.list_buckets())
print(buckets) # Testing if access to GCS has been granted

pd.set_option('display.max_colwidth', -1) # Set col width to -1 so we can see entire text column

#Below is for google colab environment
"""from google.colab import auth
auth.authenticate_user()
!gcloud config set project 'my-project-csc3002'"""

[<Bucket: csc3002>]


"from google.colab import auth\nauth.authenticate_user()\n!gcloud config set project 'my-project-csc3002'"

**Combining tweet ID dataframes**

In [4]:
#UniteTheRight
utr = 'gs://csc3002/pretrain_data/tweetIDs/UniteTheRight.txt'
utr = pd.read_csv(utr, sep=',',  index_col = False, encoding = 'utf-8', header = None, names = ['id'])
print("#UniteTheRight tweet database contains", len(utr.index), "tweet IDs")

#Bill 10
b10 = 'gs://csc3002/pretrain_data/tweetIDs/bill10tweets.txt'
b10 = pd.read_csv(b10, sep=',',  index_col = False, encoding = 'utf-8', header = None, names = ['id'])
print("\nBill 10 tweet database contains", len(b10.index), "tweet IDs")

#BlmKidnapping
blm = 'gs://csc3002/pretrain_data/tweetIDs/blmkidnapping_tweet_ids_v1.txt'
blm = pd.read_csv(blm, sep=',',  index_col = False, encoding = 'utf-8', header = None, names = ['id'])
print("\n#blmkidnapping tweet database contains", len(blm.index), "tweet IDs")

#Ocasio-Cortez Replies
aoc = 'gs://csc3002/pretrain_data/tweetIDs/ocasio_cortez_replies.csv'
aoc = pd.read_csv(aoc, sep=',',  index_col = False, encoding = 'utf-8', header = 0, names = ['id'])
print("\nAOC tweet database contains", len(aoc.index), "tweet IDs")



#UniteTheRight tweet database contains 200113 tweet IDs

Bill 10 tweet database contains 24876 tweet IDs

#blmkidnapping tweet database contains 136990 tweet IDs

AOC tweet database contains 109201 tweet IDs


Now combining all of the tweet IDs into a single dataframe.

In [7]:
full = pd.concat([aoc, blm, b10, utr], axis =0)
full.drop_duplicates(subset='id',inplace =True ) #Important to drop duplicates

#Shuffle Data
full = full.sample(frac=1)
full.reset_index(drop = True, inplace = True)

print("There are", len(full.index), "tweet IDs in this dataset")
full.head()

There are 471180 tweet IDs in this dataset


Unnamed: 0,id
0,1103682676794552320
1,896382208172113921
2,896755993769410560
3,540344228853719040
4,816863154391158784


Below is a tweepy method to obtain tweets via ID. Twitter API only allows us to extract tweets 100 at a time, as there are rate limits - therfore, we must set the wait_on_rate_limit parameter to True.

In [7]:
def lookup_tweets(tweet_IDs, api):
    
    full_tweets = []
    tweet_count = len(tweet_IDs)
    print("\nThere are", tweet_count, "tweet IDs to fetch")
    
    #Catching error if empty list
    if tweet_count < 1:
        return full_tweets
    
    #Below code to monitor progress
    #It's divided by 500 because we retrieve tweets via API call 100 at a time.
    #and we're looking to monitor progess each 5th of the way complete
    x = int(tweet_count/500) 
    progress = {x: '20%', x*2: '40%', x*3: '60%', x*4: '80%'}
    print("Fetching tweets...")
    try:
        for i in range((tweet_count // 100) + 1):
            if i in list(progress.keys()):
                print(progress[i], "complete")

        
            # Catch the last group if it is less than 100 tweets
            end_loc = min((i + 1) * 100, tweet_count)
            
            full_tweets.append(
                api.statuses_lookup(id_=tweet_IDs[i * 100:end_loc], map_ = True))
            
        return full_tweets
    
    except tweepy.TweepError as e:
        print("Around index:", i*100, "\n", e.reason) 
        
        #Recursive call to continue even with exception
        return full_tweets + lookup_tweets(tweet_IDs[(i+1)*100:tweet_count], api)
                

#Google colab
"""from google.colab import drive
drive.mount('/content/drive')
with open('/content/drive/My Drive/twitter_credentials.json', "r") as f:
  creds = json.load(f)"""

#Again, like storageCreds. json is not in the github repo. But will have to be put in
#local directory in orderto use tweepy API
with open('C://Users/fionn/Downloads/twitter_credentials.json', "r") as f:
    creds = json.load(f)


auth = tweepy.OAuthHandler(creds['CONSUMER_KEY'], creds['CONSUMER_SECRET'])
auth.set_access_token(creds['ACCESS_TOKEN'], creds['ACCESS_SECRET'])

api = tweepy.API(auth, wait_on_rate_limit=True,  wait_on_rate_limit_notify=True, \
                 retry_count=10, retry_delay=5, retry_errors=set([503])) # These last three params catch over-capacity error

Applying the custom tweepy function to this dataframe to retrieve the text content corresponding to each tweet ID, then wrangling all of the text data into a singular dataframe.

<b>The below cell may take a while to run.</b>

In [105]:
tweet_ids = list(full['id'])

#Below works as long as len(tweet_ids) is not a multiple of 100. Takes a while
results = lookup_tweets(tweet_ids, api)

temp = ""
final = pd.DataFrame() 
for i, obj in enumerate(results):
    
    temp = json.dumps([status._json for status in results[i]])#create JSON string
    temp_df = pd.read_json(temp, orient='records')
    temp_df = temp_df[['id','text']]
    final = final.append(temp_df, ignore_index = True)
    
pd.set_option('display.max_colwidth', -1)
print(final.info())
final.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 471180 entries, 0 to 471179
Data columns (total 2 columns):
id      471180 non-null int64
text    230163 non-null object
dtypes: int64(1), object(1)
memory usage: 7.2+ MB
None


Unnamed: 0,id,text
0,816849890789752832,"White kid, kid napped by black people outrage, black person shot by white cops silence #BLMKidnapping"
1,816831990338945024,RT @hottiesfortrump: Instances like the #BLMKidnapping is what got trump elected.
2,1102870750367752192,@thehill @AOC https://t.co/IJmyar9HAR
3,896524739753132032,
4,1103682676794552320,@Mylife47778820 @RashidaTlaib @IlhanMN Yes I think you have to say that to your leader trump he like no one non whi… https://t.co/Njoqm9OKh8


It'll be interesting to see if the amount of NaNs is the result of an error in my code, the database or if it's just banned accounts

In [106]:
#Quick function to beautify the error message
def getExceptionMessage(msg):
    words = msg.split(' ')

    errorMsg = ""
    for index, word in enumerate(words):
        if index not in [0,1,2]:
            errorMsg = errorMsg + ' ' + word
    errorMsg = errorMsg.rstrip("\'}]")
    errorMsg = errorMsg.lstrip(" \'")

    return errorMsg

def getErrorDesc(df): 
    for row in range(0, len(df.index)):
        current = df.loc[row]
        if pd.isna(current.text) == True:  
            try:    
                twt = api.get_status(current.id)
            except tweepy.TweepError as err:
                df.loc[row, 'text'] = getExceptionMessage(err.reason)
    return df

#If I attempt this with the whole dataset it'll take ages,
#So I'll just randomly sample the dataset to get a rough idea
df = final.sample(1000)
df.reset_index(drop = True, inplace = True)
print(df.info(), '\n\n')

df = getErrorDesc(df)
print("\n", df.text.value_counts()[1:10])
df.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
id      1000 non-null int64
text    505 non-null object
dtypes: int64(1), object(1)
memory usage: 15.8+ KB
None 



 No status found with that ID.                                                                                                                        229
Sorry, you are not authorized to see this status.                                                                                                    21 
RT @AynRandPaulRyan: David Duke in #Charlottesville saying this Nazi, #UniteTheRight fiasco "fulfills the promises of Donald Trump." \nhttps…        20 
Sorry, that page does not exist.                                                                                                                     10 
RT @RVAwonk: Here's former KKK leader David Duke explicitly stating that Trump motivated the white supremacist #UniteTheRight rally in #Cha…         8  
RT @RealAlexR

Unnamed: 0,id,text
0,1110674175960776704,@AOC What an overwhelming show of support on your vote. You must be so proud!\n\nHahahahahahahahaaaahahhhhhhaaaaa Ouch I broke a rib
1,1108431292394336256,@storyseen @kylegriffin1 An article from 2002. It was so pressing it only took 17 years to introduce legislation. N… https://t.co/sJxhiwcaFG
2,896488360482488320,RT @LawyersComm: #UniteTheRight's #Charlottesville rally is an attempt to divide this country. Help us #StopHate by visiting https://t.co/3…
3,816876638231298049,No status found with that ID.
4,1107062014764036098,@AOC I traveled an hour and a half up and back to go from the end of the IRT line in Brooklyn to attend engineering… https://t.co/TUiSCYU8NH
5,816873484378722305,User has been suspended.
6,896437970705821697,User has been suspended.
7,1109959913353216000,@LanceAHerring1 @judithfinnemore @AOC Excellent!
8,1104549953005793280,No status found with that ID.
9,896443192828211200,RT @JDeanSeal: This is the car that plowed through a crowd at the #UniteTheRight rally. Stopped along Monticello Ave. https://t.co/7Uf4lHfA…


Interesting that not only has our function above showcased the errors associated, but also it's shown that there are many duplicate tweets as the result of RTs. We can remove this through the drop duplicates method

<b>Now concatenating the charlottesville dataset which already has the tweets associated with each ID so no text retrieval function via tweepy is necessary</b>

In [107]:
nullvals = final.text.isna().sum()
print("Amount of tweets IDs not returning text:", nullvals)
per = (nullvals/len(final.index)) * 100
print(("Which is %.2f%% of the overall dataset") % (per))


Amount of tweets IDs not returning text: 241017
Which is 51.15% of the overall dataset


In [12]:
final.dropna(inplace = True)
final.reset_index(drop=True, inplace = True)
final.tail()

Unnamed: 0,id,text
235546,896201447838187520,happening now at uva. our people on the march. will you be at #unitetheright tomorrow?
235547,896835566624546816,jason kessler organized the #unitetheright rally. he deserves the shaming for organizing &amp; the violence it incited. https
235548,1102729934894653440,i 2nd that shout out!
235549,1103276581060005889,perhaps it is just a
235550,896247350808784896,"alt right #unitetheright woman tells #antifa counter-protester that he ""sounds like a n-----"" #charlottesville"


In [108]:
#Charlottesville - Contains full text in columns
charl = 'gs://csc3002/pretrain_data/tweetIDs/charlottesville_aug15_sample.csv'
charl = pd.read_csv(charl, sep=',',  index_col = False, encoding = 'utf-8', header = 0,)
charl1 = 'gs://csc3002/pretrain_data/tweetIDs/aug16_sample.csv'
charl1 = pd.read_csv(charl1, sep=',',  index_col = False, encoding = 'utf-8', header = 0,)
charl2 = 'gs://csc3002/pretrain_data/tweetIDs/aug17_sample.csv'
charl2 = pd.read_csv(charl2, sep=',',  index_col = False, encoding = 'utf-8', header = 0,)
charl3 = 'gs://csc3002/pretrain_data/tweetIDs/aug18_sample.csv'
charl3 = pd.read_csv(charl3, sep=',',  index_col = False, encoding = 'utf-8', header = 0,)

dfs = [charl[['id','full_text']], charl1[['id','full_text']], \
       charl2[['id','full_text']], charl3[['id','full_text']]]
       
charlot = pd.concat(dfs, axis = 0)
charlot.rename(columns = {'full_text':'text'}, inplace = True)
print("\n#Charlottesville tweet database contains", len(charlot.index), "tweet IDs")

print(charlot.info())
charlot.head()


#Charlottesville tweet database contains 200000 tweet IDs
<class 'pandas.core.frame.DataFrame'>
Int64Index: 200000 entries, 0 to 49999
Data columns (total 2 columns):
id      200000 non-null int64
text    200000 non-null object
dtypes: int64(1), object(1)
memory usage: 4.6+ MB
None


Unnamed: 0,id,text
0,897661668787982336,It's almost as if people are exactly who they say they are https://t.co/MnWFXZd9c3
1,897654901534228480,"@Slate Conservative media: Yes, Trump's response to Charlottesville was bad, but what about Obama? https://t.co/jjINXL5Qp0 via @slate"
2,897659748597870592,👀 https://t.co/qeyzYeblwu
3,897660496656179202,😂 😂 😂 Karma really isn't wasting time.. https://t.co/JYRqf6vlSX
4,897642311903055872,"After Charlottesville, Black Lives Matter Issues New Demand - https://t.co/Vuw3IvrhL2"


In [121]:
final1 = pd.concat([final, charlot], axis = 0)
final1 = final1.sample(frac=1) #shuffle
final1['text'] = final1['text'].astype(str)
final1['text'] = final1['text'].map(lambda x: x.encode('unicode-escape').decode('utf-8'))
final1.reset_index(drop = True, inplace = True)
final1.drop_duplicates(subset='text',inplace =True )
print("There are ", len(final1.index), "unique tweets in this database\n")
final1.to_csv('gs://csc3002/pretrain_data/tweetText/full.csv', sep = ',', encoding='utf-8', \
                 index = False, header = True)
final1.info()

There are  329720 unique tweets in this database
<class 'pandas.core.frame.DataFrame'>
Int64Index: 329720 entries, 0 to 671175
Data columns (total 2 columns):
id      329720 non-null int64
text    329720 non-null object
dtypes: int64(1), object(1)
memory usage: 7.5+ MB


## More pre-training data... Specifically datasets likely to have women or immigrants as the subject

The main hate speech dataset I'll be testing with will be the HatEval dataset, which has women and immigrants as their target. With this in mind, I sought to source more tweet datasets which were likely to have women or immigrants as the subject
*   ### <b>#thechalkening tweet database:</b>

The Chalkening is a campaign launched by Donald Trump supporters on college campuses that involves writing pro-Trump messages in chalk on campus facilities. This mass, chalk-based, protest happened alongside an outpouring of media criticism of an incident at Emory University in March 2016. An Emory university administrator sent an email expressing support for students who claimed to feel threatened and unsafe by hate speech in the form of pro-Trump chalkings on the campus.

In [122]:
chalkening = pd.read_csv('gs://csc3002/pretrain_data/tweetIDs/thechalkening-ids-20160412.txt', sep=',',  index_col = False, header=None, names =['id'])

chalkening1 = pd.read_csv('gs://csc3002/pretrain_data/tweetIDs/thechalkening-ids-20160615.txt', sep=',',  index_col = False, header=None, names =['id'])

chalk = pd.concat([chalkening, chalkening1], axis = 0)
print("There are", len(chalk.index), "tweets with the #chalkening hashtag")

There are 115524 tweets with the #chalkening hashtag


In [130]:
tweet_ids = list(chalk['id'])

#Below works as long as it's not a multiple of 100. takes a while
results = lookup_tweets(tweet_ids, api)
                                  
temp = ""
final = pd.DataFrame() 
for i, obj in enumerate(results):
    
    temp = json.dumps([status._json for status in results[i]])#create JSON string
    temp_df = pd.read_json(temp, orient='records')
    if 'text' in list(temp_df):
        temp_df = temp_df[['id','text']]
        final = final.append(temp_df, ignore_index = True)
    
pd.set_option('display.max_colwidth', -1)
print(final.info())
final.head()


There are 115524 tweet IDs to fetch
Fetching tweets...
20% complete
40% complete
60% complete
80% complete
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115324 entries, 0 to 115323
Data columns (total 2 columns):
id      115324 non-null int64
text    50984 non-null object
dtypes: int64(1), object(1)
memory usage: 1.8+ MB
None


Unnamed: 0,id,text
0,720103525401931777,RT @DanScavino: #TheChalkening- thank you! #Trump2016 #StudentsForTrump https://t.co/W6aTj6TzzL
1,720118772808474624,
2,720112336632168448,
3,720107218495016960,
4,720123193269338112,


In [131]:
final.dropna(inplace = True)
final.drop_duplicates(subset='text',inplace =True ) #Important to drop duplicates
final.reset_index(drop=True, inplace = True)

print("After Dropping nulls and duplicate tweets (such as retweets) there are",\
     len(final.index), "tweets")
final['text'] = final['text'].astype(str)
final['text'] = final['text'].map(lambda x: x.encode('unicode-escape').decode('utf-8'))
final.to_csv('gs://csc3002/pretrain_data/tweetText/chalkTweets.csv', sep = ',', encoding='utf-8', \
                 index = False, header = True)
final.tail(15)

After Dropping nulls and duplicate tweets (such as retweets) there are 8720 tweets


Unnamed: 0,id,text
8705,739671052398305281,#TheChalkening #GameofThrones #OBAMAGATE #SoccerAid2016 https://t.co/m0stELlf8B
8706,739804103702945792,RT @RockjackOne: Chicago and Detroit are plantations where liberals keep there voters https://t.co/c10RtXmqUL
8707,739677331342696448,#TheChalkening #GameofThrones https://t.co/D5JTNMKnet
8708,739840239057805312,Choose a side. Probably our last chance. #WomenForTrump #TheChalkening #StudentsForTrump #BernieSanders https://t.co/W2sn6jtUJ9
8709,739674668454514688,#TheChalkening #ThronesYall #SoccerAid2016 https://t.co/CXhBgLcN4S
8710,739675025742123008,#ThronesYall #TheChalkening #GameofThrones https://t.co/TtJq8tMjLP
8711,739672908163846144,RT @LDJT2016: #TheChalkening #GameofThrones #OBAMAGATE #SoccerAid2016 https://t.co/m0stELlf8B
8712,739668137021440000,#TheChalkening #GameofThrones #OBAMAGATE #SoccerAid2016 #FrenchOpenFinal https://t.co/QQrZP19Ntf
8713,739815224946102272,#mondaymotivation #TheChalkening #GameofThrones https://t.co/qCfAcDocpl
8714,739716451779497984,Undeniable. #WomenForTrump #TheChalkening #StudentsForTrump https://t.co/WRjXl74vpK


<i>There are very few tweets returned surprisingly. There must have been an overload of retweets in the dataset. Still we'll use them in the pre-training regardless</i>


In [132]:
final.dropna(inplace = True)
final.to_csv('gs://csc3002/pretrain_data/tweetText/chalkTweets.csv', sep = ',', encoding='utf-8', \
                 index = False, header = True)
final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8720 entries, 0 to 8719
Data columns (total 2 columns):
id      8720 non-null int64
text    8720 non-null object
dtypes: int64(1), object(1)
memory usage: 204.4+ KB



*   ### <b>#NotAllMen Twitter IDs:</b>

Around 70,000 tweets with the #NotAllMen hashtag. A Time magazine article on the subject states that "Not all men" was previously stated as an object of frustration, but in early 2014 it became usually used as an object of mockery. Intended to counter generalizations about men's behaviour, some critics claim the phrase deflects conversations from uncomfortable topics, such as sexual assault.

*   ### <b>#NotAllWomen Twitter IDs:</b>

A counter to the #YesAllWomen protest (explained in more detail later). Many tweets with this hashtag at a glance seem to be quite sexist or at least subversive. Basically a vacant protest on somebody else's protest like #AllLivesMatter

In [133]:
notallmen = pd.read_csv('gs://csc3002/pretrain_data/tweetIDs/NotAllMen.ids.txt', sep=',',  index_col = False, header=None, names =['id'])
print("There are", len(notallmen.index), "tweets with the #NotAllMen hashtag")

notAllWomen = pd.read_csv('gs://csc3002/pretrain_data/tweetIDs/NotAllWomen.ids.txt', sep=',',  index_col = False, header=None, names =['id'])
print("\nThere are", len(notAllWomen.index), "tweets with the #NotAllWomen hashtag")

sexism = pd.concat([notallmen, notAllWomen], axis = 0)
print("\nThere are", len(sexism.index), "tweets total")
sexism = sexism.sample(frac=1)

tweet_ids = list(sexism['id'])

#Below works as long as it's not a multiple of 100. takes a while
results = lookup_tweets(tweet_ids, api)

There are 69873 tweets with the #NotAllMen hashtag

There are 1827 tweets with the #NotAllWomen hashtag

There are 71700 tweets total

There are 71700 tweet IDs to fetch
Fetching tweets...
20% complete
40% complete
60% complete
80% complete
Around index: 71700 
 [{'code': 38, 'message': 'id parameter is missing.'}]

There are 0 tweet IDs to fetch


In [135]:
temp = ""
final = pd.DataFrame() 
for i, obj in enumerate(results):
    
    temp = json.dumps([status._json for status in results[i]])#create JSON string
    temp_df = pd.read_json(temp, orient='records')
    temp_df = temp_df[['id','text']]
    final = final.append(temp_df, ignore_index = True)
    
pd.set_option('display.max_colwidth', -1)
print(final.info())
final.head()

final.dropna(inplace = True)
final.reset_index(drop=True, inplace = True)
final.drop_duplicates(subset='text',inplace =True ) #Important to drop duplicates

print("After Dropping nulls and duplicate tweets (such as retweets) there are",\
     len(final.index), "tweets")
final['text'] = final['text'].astype(str)
final['text'] = final['text'].map(lambda x: x.encode('unicode-escape').decode('utf-8'))
final.to_csv('gs://csc3002/pretrain_data/tweetText/sexismTweets.csv', sep = ',', encoding='utf-8',\
              index = False, header = True)
final.tail()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71700 entries, 0 to 71699
Data columns (total 2 columns):
id      71700 non-null int64
text    40352 non-null object
dtypes: int64(1), object(1)
memory usage: 1.1+ MB
None
After Dropping nulls and duplicate tweets (such as retweets) there are 19030 tweets


Unnamed: 0,id,text
40330,471202846256668672,"#YesAllWomen have had to define and defend feminism, even to #NotAllMen."
40338,475397749538848768,"@adkarabinus #notallmen need us but #yesallwomen do #imgoingtohellforsayingthat #shouldvetakenthepitch Still, we must #feedBrendanBenson"
40339,471233196664053760,#YesAllWomen Is Brilliant Response 2 #NotAllMen http://t.co/9AvtGy6Z21 via @bustle #Feminism = #equality4all #StandWithWomen
40341,473904513699835904,@MisterSchaffner I don't know if I hate it or #NotAllWomen more
40347,470675133255139328,RT @JasonCarlen: I'm starting to think that the #NotAllMen and #YesAllWomen hashtag activism is the beginning of a gender war the likes we'\u2026


## Large tweet datasets

The next couple of cells are containing very large tweet ID datasets. I'm not sure if I'll be able to retrieve all of the tweets in one session as it takes so long, also converting a very large dataframe to a json via pandas throws a MemoryError

I'll slightly alter the previous lookup_tweets function to checkpoint by saving the tweet text to a designated google bucket file path after however many IDs have been fetched

These datasets - while they have a much larger volume than the others - have less desirable tweets to my pretraining than the other datasets. For one, these datasets are likely to have more non-user generated tweets and possibly spam as they're based on huge movements or global issues, unlike the niche subjects before.

Also, the immigration executive order tweets are much more difficult to anticipate what common terms to remove from the tweets that might affect the learning of the word-masking task (for example in the #Chalkening tweets I'll remove #chalkening from each tweet because the model could just learn that #chalkening is likely to be the missing word in each sequence, just because it comes up in each tweet).

Therefore, I'll not use all of the tweets from each of these sets, but rather a sample. Still, I might as well retrieve as many as I can, in case my strategy changes later and I find that the more tweets I do further pre-training on the better my model performs.

In [8]:
 def lookup_tweets_ckpt(tweet_IDs, api, dirc, checkpoint = 500000):
    print("Saving tweet text in directory at path", dirc)
    full_tweets = []
    tweet_count = len(tweet_IDs)
    
    print("\nThere are", tweet_count, "tweet IDs to fetch")
    print("Checkpoint saved every", checkpoint, "IDs")
    
    if tweet_count < 1:
        return full_tweets
    
    #Below code to monitor progress
    #It's divided by 500 because we retrieve tweets via API call 100 at a time.
    #and we're looking to monitor progess each 5th of the way complete
    x = int(tweet_count/500) 
    progress = {x: '20%', x*2: '40%', x*3: '60%', x*4: '80%'}
    
    #Value for checkpoint saves. Dictates how many files there are
    ckpt = 5
    print("Fetching tweets...")
    try:
        for i in range((tweet_count // 100) + 1):
            if i in list(progress.keys()):
                print(progress[i], "complete")

        
            # Catch the last group if it is less than 100 tweets
            end_loc = min((i + 1) * 100, tweet_count)
            #It was .extend but it is slower than .append
            full_tweets.append(
                api.statuses_lookup(id_=tweet_IDs[i * 100:end_loc], map_ = True))
                
            #Checkpointing every time the tweet ID dataset reaches a multiple of 500,000
            if (i*100)%checkpoint == 0 and i != 0:
                    
                print("Checkpoint at tweet ID No.", i*100)
                temp = ""
                interim = pd.DataFrame() 
                for i, obj in enumerate(full_tweets):

                    temp = json.dumps([status._json for status in full_tweets[i]])#create JSON string
                    temp_df = pd.read_json(temp, orient='records')
                    temp_df = temp_df[['id','text']]
                    interim = interim.append(temp_df, ignore_index = True)

                

                interim.dropna(inplace = True)
                interim.reset_index(drop=True, inplace = True)
                interim.drop_duplicates(subset='text',inplace =True ) #Important to drop duplicates

                interim['text'] = interim['text'].astype(str)
                interim['text'] = interim['text'].map(lambda x: x.encode('unicode-escape').decode('utf-8'))
                
                path = dirc + '/' + str(ckpt) + '.csv'
                print("Saving at path", path)
                interim.to_csv(path, sep = ',', encoding='utf-8',\
                              index = False, header = True)
                ckpt = ckpt + 1
                full_tweets = [] # Reinitialise to empty dataframe
        
        
        #Outside of the loop
        print("Final save")
        temp = ""
        interim = pd.DataFrame() 
        for i, obj in enumerate(full_tweets):

            temp = json.dumps([status._json for status in full_tweets[i]])#create JSON string
            temp_df = pd.read_json(temp, orient='records')
            temp_df = temp_df[['id','text']]
            interim = interim.append(temp_df, ignore_index = True)



        interim.dropna(inplace = True)
        interim.reset_index(drop=True, inplace = True)
        interim.drop_duplicates(subset='text',inplace =True ) #Important to drop duplicates
        interim['text'] = interim['text'].astype(str)
        interim['text'] = interim['text'].map(lambda x: x.encode('unicode-escape').decode('utf-8'))
        path = dirc + '/' + str(ckpt) + '.csv'
        print("Saving at path", path)
        interim['text'] = interim['text'].astype(str)
        interim['text'] = interim['text'].map(lambda x: x.encode('unicode-escape').decode('utf-8'))
        interim.to_csv(path, sep = ',', encoding='utf-8',\
                      index = False, header = True)
        
# Keep this return statement in the case of an TweepError so it can be recursively called and continue the function
        return full_tweets 
    
    except tweepy.TweepError as e:
        print("Around index:", i*100, "\n", e.reason)       
        #Recursive call to continue even with exception
        return full_tweets + lookup_tweets_ckpt(tweet_IDs[(i+1)*100:tweet_count], api, dirc, checkpoint)
                
    
    

*   ### <b>#YesAllWomen Twitter IDs:</b>

This hashtag was popular in May 2014 and was created partly in response to the Twitter hashtag #NotAllMen. #YesAllWomen reflected a grassroots campaign in which women shared their personal stories about harassment and discrimination. The campaign attempted to raise awareness of sexism that women experience, often from people they know.

<b>There are around 2.7 million tweet IDs in this database</b>

In [None]:
yesAllWomen = pd.read_csv('gs://csc3002/pretrain_data/tweetIDs/YesAllWomen.ids.txt', sep=',',  index_col = False, header=None, names =['id'])
print("There are", len(yesAllWomen.index), "tweets with the #YesAllWomen hashtag")
tweet_ids = list(yesAllWomen['id'])

#Path to directory holding all saves
path = 'gs://csc3002/pretrain_data/tweetText/YesAllWomenTweets' 
results = lookup_tweets_ckpt(tweet_ids[2000000:], api, path, checkpoint = 500000)

There are 2705985 tweets with the #YesAllWomen hashtag
Saving tweet text in directory at path gs://csc3002/pretrain_data/tweetText/YesAllWomenTweets

There are 705985 tweet IDs to fetch
Checkpoint saved every 500000 IDs
Fetching tweets...


<i>I fetched the first 2 million of these tweets in an unsupervised jupyter notebook session. We can combine all of the tweet text files into one csv and view how many tweets we have fetched overall after dropping duplicates and nans<i>

In [95]:
wom = pd.read_csv('gs://csc3002/pretrain_data/tweetText/YesAllWomenTweets/1.csv', sep=',',  index_col = False, header=0)
wom2 = pd.read_csv('gs://csc3002/pretrain_data/tweetText/YesAllWomenTweets/2.csv', sep=',',  index_col = False, header=0)
wom3 = pd.read_csv('gs://csc3002/pretrain_data/tweetText/YesAllWomenTweets/3.csv', sep=',',  index_col = False, header=0)
wom4 = pd.read_csv('gs://csc3002/pretrain_data/tweetText/YesAllWomenTweets/4.csv', sep=',',  index_col = False, header=0)
wom5 = pd.read_csv('gs://csc3002/pretrain_data/tweetText/YesAllWomenTweets/5.csv', sep=',',  index_col = False, header=0)
wom6 = pd.read_csv('gs://csc3002/pretrain_data/tweetText/YesAllWomenTweets/6.csv', sep=',',  index_col = False, header=0)

womTweets = pd.concat([wom, wom2, wom3, wom4, wom5, wom6], axis =0)
womTweets.drop_duplicates(subset='text',inplace =True )
womTweets.dropna(inplace = True)
print("\nThere are", len(womTweets.index), "unique tweets in this csv file after dropping duplicates over the entire database")
womTweets.to_csv('gs://csc3002/pretrain_data/tweetText/YesAllWomenTweets.csv', sep = ',', \
                 encoding='utf-8', index = False, header = 0)
womTweets.head(50)


There are 388799 unique tweets in this csv file after dropping duplicates over the entire database


Unnamed: 0,id,text
0,470315730706399232,"because i get in an elevator with a guy and think ""what's my escape plan going to be?"" #yesallwomen"
1,470317902776647680,"because it starts earlyschool dress codes punish girls for wearing clothes that are ""distracting"" to boys. #yesallwomen"
2,470317461024555009,"because ""boys will be boys"" is a phrase that still exists. #yesallwomen"
3,470316502445744130,"because we're taught ""never leave your drink alone,"" instead of ""don't drug someone."" #yesallwomen"
4,470317309022978048,#yesallwomen have the right to set boundaries. prevalent misogyny has led to the extinction of private space for women.
5,470315790885863424,because women are taught to carry our keys like a weapon in case we're attacked in a parking lot. #yesallwomen
6,470315890060189696,hell yeah. #yesallwomen deserve the right to say no to a man without a given reason. they owe you
7,470316527959670784,"because there is a moment, daily, weekly, monthly, where you're in a situation where you think: ""is today the day i get rap"
8,470314968492302338,sounds like something that needs to get shared right now. #yesallwomen
9,470315514980757504,hell yeah. #yesallwomen deserve the right to say no to a man without a given reason. they owe you nothing!


*   ### <b>Immigration and Travel Ban Tweet Ids:</b>

This dataset contains the tweet ids of 16,875,766 tweets related to the immigration and travel ban executive order announced by the Trump Administration in January 2017. They were collected between January 30, 2017 and April 20, 2017. 

The terms using for the filter were: #MuslimBan, #NoBanNoWall, #NoMuslimBan, #JFKTerminal4, #RefugeesWelcome, muslim ban, immigrant ban, immigration ban, travel ban, immigration order, #ImmigrationBan, #TravelBan.

In [None]:
imm = pd.read_csv('gs://csc3002/pretrain_data/tweetIDs/immigration_exec_order.txt', sep=',',  index_col = False, header=None, names =['id'])
print("There are", len(imm.index), "tweets which are in this dataset. \n\nThe subject of this dataset is the excutive order restricting immigration which trump signed,", \
     "many believe the intentional target were muslims. \nHence the predominant hashtag in this dataset is #MuslimBan or #NoMuslimBan\n\n")
tweet_ids = list(imm['id'])

#Path to directory holding all saves
path = 'gs://csc3002/pretrain_data/tweetText/immigrationTweets' 
results = lookup_tweets_ckpt(tweet_ids, api, path)

There are 16875766 tweets which are in this dataset. 
The subject of this dataset is the excutive order restricting immigration which trump signed, many believe the intentional target were muslims. 
Hence the predominant hashtag in this dataset is #MuslimBan or #NoMuslimBan
Saving tweet text in directory at path gs://csc3002/pretrain_data/tweetText/immigrationTweets

There are 16875766 tweet IDs to fetch
Checkpoint saved every 500000 IDs
Fetching tweets...


Rate limit reached. Sleeping for: 294
Rate limit reached. Sleeping for: 331
Rate limit reached. Sleeping for: 302
Rate limit reached. Sleeping for: 273
Rate limit reached. Sleeping for: 283


Checkpoint at tweet ID No. 500000
Saving at path gs://csc3002/pretrain_data/tweetText/immigrationTweets/1.csv


Rate limit reached. Sleeping for: 267
Rate limit reached. Sleeping for: 337
Rate limit reached. Sleeping for: 278
Rate limit reached. Sleeping for: 303


In [6]:
im = pd.read_csv('gs://csc3002/pretrain_data/tweetText/immigrationTweets/0.csv', sep=',',  index_col = False, header=0)
im1 = pd.read_csv('gs://csc3002/pretrain_data/tweetText/immigrationTweets/1.csv', sep=',',  index_col = False, header=0)
im2 = pd.read_csv('gs://csc3002/pretrain_data/tweetText/immigrationTweets/2.csv', sep=',',  index_col = False, header=0)
im3 = pd.read_csv('gs://csc3002/pretrain_data/tweetText/immigrationTweets/3.csv', sep=',',  index_col = False, header=0)
im4 = pd.read_csv('gs://csc3002/pretrain_data/tweetText/immigrationTweets/4.csv', sep=',',  index_col = False, header=0)
im5 = pd.read_csv('gs://csc3002/pretrain_data/tweetText/immigrationTweets/5.csv', sep=',',  index_col = False, header=0)
im6 = pd.read_csv('gs://csc3002/pretrain_data/tweetText/immigrationTweets/6.csv', sep=',',  index_col = False, header=0)
im7 = pd.read_csv('gs://csc3002/pretrain_data/tweetText/immigrationTweets/7.csv', sep=',',  index_col = False, header=0)

imTweets = pd.concat([im, im1, im2, im3, im4, im5, im6, im7], axis =0)
imTweets.drop_duplicates(subset='text',inplace =True ) #Important to drop duplicates
imTweets.dropna(inplace = True)
print("\nThere are ", len(imTweets.index), "unique tweets in this csv file after dropping duplicates again")
imTweets.head(50)


There are  852459 unique tweets in this csv file after dropping duplicates again


Unnamed: 0,id,text
0,836058186796847108,RT @thehill: Iranian director\u2019s Oscar acceptance message hits Trump travel ban https://t.co/hv88n3dhKq https://t.co/m8k65UKcZ6
1,836058145990434816,"RT @brianstelter: Best foreign language film: ""The Salesman,"" directed by Asghar Farhadi, who boycotted the #Oscars due to Trump's travel b\u2026"
2,836058233672327168,RT @ACLU: Asghar Farhadi did not attend the #Oscars out of respect for the Iranians and people in six other countries affected by the #Musl\u2026
3,836058203213422594,"RT @BostonGlobe: Watch: Iranian director wins #Oscars, but skips awards show over Trump\u2019s \u2018inhumane\u2019 travel ban https://t.co/dPi2V1zcUg htt\u2026"
4,836058161786183680,RT @ACLU: More from Asghar Farhadi on #Muslimban at @UTAFoundation #UnitedVoices rally https://t.co/30EkrRTJYJ
5,836058249753350145,"RT @CNN: The Salesman wins best Foreign Language Film, directed by Asghar Farhadi who boycotted #Oscars over Trump travel ban https://t.co/\u2026"
6,836058248646180865,"RT @Nick_Offerman: You can just stop at ""Trump rejects intelligence"" https://t.co/lRlN9ws657"
7,836058240420999170,RT @Rosie: jail should be FANTASTIC !!! #russiagate #resist #FUN https://t.co/uPqzw7nzpk
8,836058153225699329,"RT @CNNent: Oscar winner #TheSalesman was directed by Asghar Farhadi, who boycotted the #Oscars over Trump's travel ban. https://t.co/qrL95\u2026"
9,836058141854994433,#AsgharFarhadi \U0001f339\U0001f339\U0001f339 https://t.co/WdPheCOHXY


In [7]:
imTweets.to_csv('gs://csc3002/pretrain_data/tweetText/immigrationTweets.csv', sep = ',', \
                encoding='utf-8', index = False, header = 0)

# Removing common terms for word-masking
It has occured to me that in the way these tweet datasets were sourced - (by filtering through a particular term/hashtag via twitter API), there will be recurring terms or hashtags in each sequence - which may be counter-productive to the word masking excercise later in further BERT pre-training. 

<i>(Also though, removing these terms may render some sequences non-sensical - however I'm not sure how concerining this might be from an NLP standpoint) </i>

Therefore I'll develop a function to remove these common terms from my tweet data. These terms can be often followed by punctuation. So I'll fill an initial list with the terms I'd like to remove, then from that I'll create an augmented list which contains all the terms, plus versions of them with punctuation at the end.
    
<i> (I'll do further pre-training on data that has had common terms removed, as well as data that hasn't - and instead hashtags are segmented to mitigate the commonality of terms) </i>

I'll convert this list into a set at the end of the function, as a set is much quicker to access than a list - which will be useful later on

In [8]:
#This dataset was the combo of AOC replies, #charlottesville, #blmkidnapping, #bill10 and #uniteTheRight tweets'
dat = pd.read_csv('gs://csc3002/pretrain_data/tweetText/full.csv',\
                  sep=',',  index_col = False, header=0)

list1 = ['#unitetheright', '#charlottesville', '#utr', '#blmkidnapping',\
            '#bill10', 'charlottesville']

def addPunc(wordlist):
    
    
    newlist = wordlist + [word + "." for word in wordlist] + [word + "," for word in wordlist] \
     + [word + "?" for word in wordlist] + [word + "!" for word in wordlist] + \
    [word + "-" for word in wordlist] + [word + ":" for word in wordlist] + \
    [word + ";" for word in wordlist]
    
    wordset = set(newlist) # Convert to set for faster lookup later
    return wordset

print(addPunc(list1))

{'charlottesville,', '#charlottesville;', '#utr:', '#charlottesville', '#charlottesville?', '#unitetheright', '#unitetheright;', '#utr!', '#blmkidnapping,', 'charlottesville!', '#charlottesville!', '#blmkidnapping?', '#bill10:', '#utr;', 'charlottesville', '#unitetheright.', 'charlottesville:', '#unitetheright:', '#unitetheright!', '#blmkidnapping.', '#utr,', '#utr?', '#blmkidnapping-', '#bill10;', 'charlottesville;', '#bill10', '#charlottesville,', '#bill10,', 'charlottesville?', '#bill10-', '#bill10?', 'charlottesville-', '#charlottesville:', '#charlottesville.', '#utr', '#utr-', '#bill10!', '#utr.', '#blmkidnapping:', '#charlottesville-', '#bill10.', '#unitetheright,', 'charlottesville.', '#blmkidnapping!', '#blmkidnapping', '#unitetheright-', '#unitetheright?', '#blmkidnapping;'}


In [13]:
def removeWords(text_string, wordlist):
    
    #Make sure row entry is a string
    text_string = str(text_string)
    
    # Add Punctuation to each word so function can account for occurences where string follows a punctuation mkr
    wordSet = addPunc(wordlist) 
    
    #Add spaces between hashtags. This ensures strings with consecutive hashtags are processed properly
    text_string = text_string.replace('#', ' #') 
    
    querywords = text_string.split()
    resultwords  = [word for word in querywords if word.lower() not in wordSet]
    resultwords = ' '.join(resultwords)
    return resultwords

stopwords = ['#unitetheright', '#charlottesville', '#utr', '#blmkidnapping',\
            '#bill10', 'charlottesville']

tweet = dat.iloc[10]['text']
print("Removing #unitetheright:\n")
print("Original:\n", tweet)
print("\n\nNew:\n ", removeWords(tweet, stopwords))

tweet = dat.iloc[20]['text']
print("\n\nRemoving \"charlottesville\":\n")
print("Original:\n", tweet)
print("\n\nNew:\n ", removeWords(tweet, stopwords))

Removing #unitetheright:

Original:
 RT @RealAlexRubi: Tons of gas deployed. Police declare an unlawful assembly outside #antifa counter-protest of #UniteTheRight #Charlottesvi\u2026


New:
  RT @RealAlexRubi: Tons of gas deployed. Police declare an unlawful assembly outside #antifa counter-protest of #Charlottesvi\u2026


Removing "charlottesville":

Original:
 Trump Disbands Advisory Panels As CEOs Quit Over Charlottesville Remarks https://t.co/YuzB7mBDfq


New:
  Trump Disbands Advisory Panels As CEOs Quit Over Remarks https://t.co/YuzB7mBDfq


<b>Let's see the effect this function has</b>

In [14]:
dat['text'] = dat['text'].apply(removeWords, wordlist = stopwords)
dat.head(40)

Unnamed: 0,id,text
0,1112695958540902402,@AOC They pay 7 billion a year in taxes and cost us 13 billion. Do the math
1,1112477224412672001,@AOC How do you pay taxes on free?
2,816807362363265032,
3,896356223649333248,RT @AndreaChalupa: Bannon and Gorka are actual Nazis in the White House paid by US tax payers as they spread lies and hate.
4,896553831118454784,"RT @RealAlexRubi: ""Tight formation!"": rally lights torches, preps security for march https://t.co/s0btqHigrj"
5,896574089741058048,RT @irmahinojosa_: Discussing deceitful politicians with @Johnny__MAGA and the rally in #ThisIsNotUs https\u2026
6,898220671054086145,VICE @vicenews has produced an impeccable gem of journalism on WATCH #Nazis #ImpeachTrump #ViceNews https://t.co/NGN2OMff8v
7,896365557959856128,"At this point, singing clergy in street most vocal here in But rally not due to start for another 2 hrs."
8,1105931607783890944,@JamieRio @pramsey342 @ClintSmithIII @AOC You mean a rate closer to what they paid for the 50 years that everyone r\u2026 https://t.co/qir3Ur3Xue
9,897992781490270209,"@FoxNews America I like to believe YOU are smart, but YOU""RE NOT; if you heard Trump say only one thing on whose at fault"


In [15]:
dat.to_csv('gs://csc3002/pretrain_data/tweetText/removedTerms/full.csv', sep = ',', encoding='utf-8', index = False, header = 0)

<b>Likewise we'll remove common terms for the #thechalkening tweet database. I just see the term #thechalkening come up often</b>

In [16]:
#This is the #thechalkening databse
chalk = pd.read_csv('gs://csc3002/pretrain_data/tweetText/chalkTweets.csv', sep=',',  index_col = False, header=0)
chalk.head(10)

Unnamed: 0,id,text
0,720103525401931777,RT @DanScavino: #TheChalkening- thank you! #Trump2016 #StudentsForTrump https://t.co/W6aTj6TzzL
1,720107446149193728,RT @2AFight: 19 yr old Kurdish girl celebrates 1 year of fighting #ISIS... needs no #SafeSpace from #TheChalkening\n\n#tcot\n#PJNET https://t.\u2026
2,720102290749841408,Sullivan County NY Loves #Trump #TheChalkening #Trump2016 #MAGA \U0001f1fa\U0001f1f8\U0001f1fa\U0001f1f8\U0001f1fa\U0001f1f8 https://t.co/u4k7jUogB8
3,720103135440715776,RT @RoyBatty010816: Next thing you know there will be a background check and waiting period to buy chalk.\n#TheChalkening https://t.co/L9puO\u2026
4,720100823729102848,"RT @anamericanfam: HUGE props to @AndersonU students who conducted #TheChalkening on campus! Together, WE will #MakeAmericaGreatAgain https\u2026"
5,720104336202821632,RT @LainieYennie: Sullivan County NY Loves #Trump #TheChalkening #Trump2016 #MAGA \U0001f1fa\U0001f1f8\U0001f1fa\U0001f1f8\U0001f1fa\U0001f1f8 https://t.co/u4k7jUogB8
6,720101963812233216,RT @TomatoPie1: .@DanScavino #TheChalkening At shopping malls! Libraries! Especially Starbucks! Everywhere\n#Trump2016 #TrumpNewYork https:/\u2026
7,720140123141185537,RT @TrumpStudents: We will not be silenced! WE WILL WIN! #TeamTrump #Trump2016 #TheChalkening #StudentsForTrump #StandWithStudents https://\u2026
8,720116749765582848,"a trump supporter once said, ""in real life, there are no safe spaces"" #thechalkening https://t.co/5sAkx3zZTd"
9,720102767130472450,"You can wash it off, but you can't erase Trump support at NU! #TheChalkening @TrumpStudentsIL @ILStudentsforTrump https://t.co/iAGZbF14X9"


Remove common terms...

In [17]:
stopwords = ['#thechalkening']
chalk['text'] = chalk['text'].apply(removeWords, wordlist = stopwords)
chalk.head(10)

Unnamed: 0,id,text
0,720103525401931777,RT @DanScavino: thank you! #Trump2016 #StudentsForTrump https://t.co/W6aTj6TzzL
1,720107446149193728,RT @2AFight: 19 yr old Kurdish girl celebrates 1 year of fighting #ISIS... needs no #SafeSpace from #TheChalkening\n\n #tcot\n #PJNET https://t.\u2026
2,720102290749841408,Sullivan County NY Loves #Trump #Trump2016 #MAGA \U0001f1fa\U0001f1f8\U0001f1fa\U0001f1f8\U0001f1fa\U0001f1f8 https://t.co/u4k7jUogB8
3,720103135440715776,RT @RoyBatty010816: Next thing you know there will be a background check and waiting period to buy chalk.\n https://t.co/L9puO\u2026
4,720100823729102848,"RT @anamericanfam: HUGE props to @AndersonU students who conducted on campus! Together, WE will #MakeAmericaGreatAgain https\u2026"
5,720104336202821632,RT @LainieYennie: Sullivan County NY Loves #Trump #Trump2016 #MAGA \U0001f1fa\U0001f1f8\U0001f1fa\U0001f1f8\U0001f1fa\U0001f1f8 https://t.co/u4k7jUogB8
6,720101963812233216,RT @TomatoPie1: .@DanScavino At shopping malls! Libraries! Especially Starbucks! Everywhere\n #Trump2016 #TrumpNewYork https:/\u2026
7,720140123141185537,RT @TrumpStudents: We will not be silenced! WE WILL WIN! #TeamTrump #Trump2016 #StudentsForTrump #StandWithStudents https://\u2026
8,720116749765582848,"a trump supporter once said, ""in real life, there are no safe spaces"" https://t.co/5sAkx3zZTd"
9,720102767130472450,"You can wash it off, but you can't erase Trump support at NU! @TrumpStudentsIL @ILStudentsforTrump https://t.co/iAGZbF14X9"


And Save

In [18]:
chalk.to_csv('gs://csc3002/pretrain_data/tweetText/removedTerms/chalkTweets.csv', sep = ',', \
             encoding='utf-8', index = False, header = 0)

<b>And for the #NotAllMen and #NotAllWomen databases</b>

In [29]:
final = pd.read_csv('gs://csc3002/pretrain_data/tweetText/sexismTweets.csv',sep=',',  index_col = False)
final.head(10)

Unnamed: 0,id,text
0,470578066662895616,"RT @Eidelonn: #notallmen are rapists but ANY MAN could be,from anywhere and at any time, so #YesAllWomen live in fear."
1,471061943965720577,"RT @so_unimpressed: No, #notallmen are rapists and murderers, but #yesallwomen are still in danger of being raped or murdered just because \u2026"
2,470472133622390784,RT @schemaly: #notallmen practice violence against women but #YesAllWomen live with the threat of male violence. Every. Single. Day. All ov\u2026
3,470573365195857921,"RT @karinjr: No, #NotAllMen are violent against women, but #YesAllWomen have to navigate a world where those who are look the same as those\u2026"
4,471516144093122560,"RT @alliasan: Because #YesAllWomen have to endure near daily harassment in their adult lives, but #notallmen can handle hearing this for 4 \u2026"
5,471321319833735168,"RT @kpmiracle: #NotAllMen are violent against women, but if we are just passive bystanders then we're still part of the problem. #YesAllWom\u2026"
6,470836290204991488,I hate this fad but this gif is great #NotAllMen #Castlevania #vampires http://t.co/4qYILg2tWV
7,470976100760158209,"RT @GenericPsycho: Maybe #notallmen are rapists, but #YesAllWomen are terrified of passing men on the street, walking home alone, going out\u2026"
8,472034665843687425,"Enough w/the #notallmen defense, men. Women KNOW we're not all like that. Stop stealing spotlight w/whining &amp; support women. #YesAllWomen"
9,470814489017729024,"RT @dxtehsecks: In all seriousness though, the contrast between #YesAllWomen and #NotAllMen has been an eye opener"


In [30]:
stopwords = ['#notallmen', '#notallwomen', '#yesallwomen']
tweet = final.iloc[650]['text']

print("Original:\n", tweet)
print("\n\nNew:\n ", removeWords(tweet, stopwords))

final['text'] = final['text'].apply(removeWords, wordlist = stopwords)
final.head(10)

Original:
 And I hate it. #YesAllWomen have dealt with shitty men doing shitty things, &amp; yes, #NotAllMen are awful. #RiseAboveHate


New:
  And I hate it. have dealt with shitty men doing shitty things, &amp; yes, are awful. #RiseAboveHate


Unnamed: 0,id,text
0,470578066662895616,"RT @Eidelonn: are rapists but ANY MAN could be,from anywhere and at any time, so live in fear."
1,471061943965720577,"RT @so_unimpressed: No, are rapists and murderers, but are still in danger of being raped or murdered just because \u2026"
2,470472133622390784,RT @schemaly: practice violence against women but live with the threat of male violence. Every. Single. Day. All ov\u2026
3,470573365195857921,"RT @karinjr: No, are violent against women, but have to navigate a world where those who are look the same as those\u2026"
4,471516144093122560,"RT @alliasan: Because have to endure near daily harassment in their adult lives, but can handle hearing this for 4 \u2026"
5,471321319833735168,"RT @kpmiracle: are violent against women, but if we are just passive bystanders then we're still part of the problem. #YesAllWom\u2026"
6,470836290204991488,I hate this fad but this gif is great #Castlevania #vampires http://t.co/4qYILg2tWV
7,470976100760158209,"RT @GenericPsycho: Maybe are rapists, but are terrified of passing men on the street, walking home alone, going out\u2026"
8,472034665843687425,"Enough w/the defense, men. Women KNOW we're not all like that. Stop stealing spotlight w/whining &amp; support women."
9,470814489017729024,"RT @dxtehsecks: In all seriousness though, the contrast between and has been an eye opener"


In [31]:
final.to_csv('gs://csc3002/pretrain_data/tweetText/removedTerms/sexismTweets.csv', sep = ',', \
             encoding='utf-8', index = False, header = 0)

<b>#YesAllWomen Dataset</b>

In [37]:
womTweets = pd.read_csv('gs://csc3002/pretrain_data/tweetText/YesAllWomenTweets.csv',sep=',',  index_col = False, names =['id', 'text'])
womTweets.head(10)

Unnamed: 0,id,text
0,470315730706399232,"Because I get in an elevator with a guy and think ""what's my escape plan going to be?"" #YesAllWomen"
1,470317902776647680,"RT @Ava_Jae: Because it starts early\u2014school dress codes punish girls for wearing clothes that are ""distracting"" to boys. #YesAllWomen"
2,470315567233372160,RT @anniecardi: Because women are taught to carry our keys like a weapon in case we're attacked in a parking lot. #YesAllWomen
3,470315805222404096,"RT @anniecardi: Because I get in an elevator with a guy and think ""what's my escape plan going to be?"" #YesAllWomen"
4,470317461024555009,"RT @anniecardi: Because ""boys will be boys"" is a phrase that still exists. #YesAllWomen"
5,470317579920506880,"Because it starts early\u2014school dress codes punish girls for wearing clothes that are ""distracting"" to boys. #YesAllWomen"
6,470316502445744130,"RT @anniecardi: Because we're taught ""never leave your drink alone,"" instead of ""don't drug someone."" #YesAllWomen"
7,470317309022978048,#YesAllWomen have the right to set boundaries. Prevalent misogyny has led to the extinction of private space for women.
8,470315890060189696,RT @Ceilidhann: @anniecardi @gildedspine Hell yeah. #YesAllWomen deserve the right to say no to a man without a given reason. They owe you \u2026
9,470314968492302338,@gildedspine @Ceilidhann Sounds like something that needs to get shared right now. #YesAllWomen


In [38]:
#A lot of the tweets follow the format "because.... -comes up too often"
tweet = womTweets.iloc[650]['text']
stopwords = ['because', '#yesallwomen']

print("Original:\n", tweet)
print("\n\nNew:\n ", removeWords(tweet, stopwords))

womTweets['text'] = womTweets['text'].apply(removeWords, wordlist = stopwords)

Original:
 FYI #YesAllWomen includes WOC, trans women, women w/ disabilities, sex workers, celibate women, and non-binary people read as women.


New:
  FYI includes WOC, trans women, women w/ disabilities, sex workers, celibate women, and non-binary people read as women.


In [39]:
womTweets.to_csv('gs://csc3002/pretrain_data/tweetText/removedTerms/YesAllWomenTweets.csv', sep = ',', \
                 encoding='utf-8', index = False, header = 0)
womTweets.head(10)

Unnamed: 0,id,text
0,470315730706399232,"I get in an elevator with a guy and think ""what's my escape plan going to be?"""
1,470317902776647680,"RT @Ava_Jae: it starts early\u2014school dress codes punish girls for wearing clothes that are ""distracting"" to boys."
2,470315567233372160,RT @anniecardi: women are taught to carry our keys like a weapon in case we're attacked in a parking lot.
3,470315805222404096,"RT @anniecardi: I get in an elevator with a guy and think ""what's my escape plan going to be?"""
4,470317461024555009,"RT @anniecardi: ""boys will be boys"" is a phrase that still exists."
5,470317579920506880,"it starts early\u2014school dress codes punish girls for wearing clothes that are ""distracting"" to boys."
6,470316502445744130,"RT @anniecardi: we're taught ""never leave your drink alone,"" instead of ""don't drug someone."""
7,470317309022978048,have the right to set boundaries. Prevalent misogyny has led to the extinction of private space for women.
8,470315890060189696,RT @Ceilidhann: @anniecardi @gildedspine Hell yeah. deserve the right to say no to a man without a given reason. They owe you \u2026
9,470314968492302338,@gildedspine @Ceilidhann Sounds like something that needs to get shared right now.


<b>Immigration Executive Order Dataset</b>

In [48]:
imTweets = pd.read_csv('gs://csc3002/pretrain_data/tweetText/immigrationTweets.csv',sep=',',  index_col = False, names = ['id', 'text'])
imTweets.head(10)

Unnamed: 0,id,text
0,836058186796847108,RT @thehill: Iranian director\u2019s Oscar acceptance message hits Trump travel ban https://t.co/hv88n3dhKq https://t.co/m8k65UKcZ6
1,836058145990434816,"RT @brianstelter: Best foreign language film: ""The Salesman,"" directed by Asghar Farhadi, who boycotted the #Oscars due to Trump's travel b\u2026"
2,836058233672327168,RT @ACLU: Asghar Farhadi did not attend the #Oscars out of respect for the Iranians and people in six other countries affected by the #Musl\u2026
3,836058203213422594,"RT @BostonGlobe: Watch: Iranian director wins #Oscars, but skips awards show over Trump\u2019s \u2018inhumane\u2019 travel ban https://t.co/dPi2V1zcUg htt\u2026"
4,836058161786183680,RT @ACLU: More from Asghar Farhadi on #Muslimban at @UTAFoundation #UnitedVoices rally https://t.co/30EkrRTJYJ
5,836058249753350145,"RT @CNN: The Salesman wins best Foreign Language Film, directed by Asghar Farhadi who boycotted #Oscars over Trump travel ban https://t.co/\u2026"
6,836058248646180865,"RT @Nick_Offerman: You can just stop at ""Trump rejects intelligence"" https://t.co/lRlN9ws657"
7,836058240420999170,RT @Rosie: jail should be FANTASTIC !!! #russiagate #resist #FUN https://t.co/uPqzw7nzpk
8,836058153225699329,"RT @CNNent: Oscar winner #TheSalesman was directed by Asghar Farhadi, who boycotted the #Oscars over Trump's travel ban. https://t.co/qrL95\u2026"
9,836058141854994433,#AsgharFarhadi \U0001f339\U0001f339\U0001f339 https://t.co/WdPheCOHXY


In [49]:
tweet = imTweets.iloc[4]['text']
stopwords = ['#muslimban', '#nobannowall', '#nomuslimban','#JFKTerminal4',\
             '#refugeeswelcome', '#immigrationban', '#TravelBan']
             
print("Original:\n", tweet)
print("\n\nNew:\n ", removeWords(tweet, stopwords))

imTweets['text'] = imTweets['text'].apply(removeWords, wordlist = stopwords)

Original:
 RT @ACLU: More from Asghar Farhadi on #Muslimban at @UTAFoundation  #UnitedVoices rally https://t.co/30EkrRTJYJ


New:
  RT @ACLU: More from Asghar Farhadi on at @UTAFoundation #UnitedVoices rally https://t.co/30EkrRTJYJ


In [50]:
imTweets.to_csv('gs://csc3002/pretrain_data/tweetText/removedTerms/immigrationTweets.csv', sep = ',', \
                 encoding='utf-8', index = False, header = 0)
imTweets.head(10)

Unnamed: 0,id,text
0,836058186796847108,RT @thehill: Iranian director\u2019s Oscar acceptance message hits Trump travel ban https://t.co/hv88n3dhKq https://t.co/m8k65UKcZ6
1,836058145990434816,"RT @brianstelter: Best foreign language film: ""The Salesman,"" directed by Asghar Farhadi, who boycotted the #Oscars due to Trump's travel b\u2026"
2,836058233672327168,RT @ACLU: Asghar Farhadi did not attend the #Oscars out of respect for the Iranians and people in six other countries affected by the #Musl\u2026
3,836058203213422594,"RT @BostonGlobe: Watch: Iranian director wins #Oscars, but skips awards show over Trump\u2019s \u2018inhumane\u2019 travel ban https://t.co/dPi2V1zcUg htt\u2026"
4,836058161786183680,RT @ACLU: More from Asghar Farhadi on at @UTAFoundation #UnitedVoices rally https://t.co/30EkrRTJYJ
5,836058249753350145,"RT @CNN: The Salesman wins best Foreign Language Film, directed by Asghar Farhadi who boycotted #Oscars over Trump travel ban https://t.co/\u2026"
6,836058248646180865,"RT @Nick_Offerman: You can just stop at ""Trump rejects intelligence"" https://t.co/lRlN9ws657"
7,836058240420999170,RT @Rosie: jail should be FANTASTIC !!! #russiagate #resist #FUN https://t.co/uPqzw7nzpk
8,836058153225699329,"RT @CNNent: Oscar winner #TheSalesman was directed by Asghar Farhadi, who boycotted the #Oscars over Trump's travel ban. https://t.co/qrL95\u2026"
9,836058141854994433,#AsgharFarhadi \U0001f339\U0001f339\U0001f339 https://t.co/WdPheCOHXY
