# Preface

Upon research, I found a favourable approach to further pre-training was to perform further within-task or in-domain pretraining from the existing BERT model checkpoint.

I've already collected around 150,000 tweets that are categorized as either hate speech, offensive or benign so this roughly satisfies the requirement that the pretraining data is within-task or in-domain.

However there are far more tweet databases that can be used to further-pretrain my BERT model. There is a wealth of unsupervised tweet datasets online that I can make use of, these datasets often do not come in text form but rather the tweets are represented by their tweet IDs. Below I will describe the tweet datasets and retrieve their associated text using the tweepy module.

# Retrieving more pre-training data from unsupervised tweet datasets
Using as wide a variety of sources as possible, we will increase the knowledge of our further-pretrained model. My goal when sourcing these datasets was to try and find tweet datasets which are largely user-generated, as hate speech online largely comes from user generated content.

If I could find it, I would use tweet datasets likely to have aggressive or abusive content as well as possibly containing sexist and racial slurs. Even having some tweets just talking about racial, sex or homophobic issues would be beneficial for my model to be trained on as it may use the language associated with these issues.

The following tweets can all be retrieved by ID and they come from the following sources:


*   ### <b>#UniteTheRight tweet database:</b>

The Unite the Right rally (also known as the Charlottesville rally) was a protest in Charlottesville, Virginia, United States from August 11–12, 2017, to oppose the removal of a statue of Robert E. Lee in Emancipation Park, which itself was renamed from Lee Park two months earlier. Protesters included white supremacists, white nationalists, neo-Confederates, neo-Nazis, and militias. This dataset contains 200,113 tweet ids collected with the #unitetheright hashtag. The time ranges for the tweets are from 2017-08-04 11:44:12 to 2017-08-15 16:03:30 GMT.

*   ### <b>Bill 10 Twitter IDs:</b>

A list of 24876 Twitter IDs for tweets harvested between Nov. 28 and Dec. 6 2014 containing the hashtag #bill10. Bill 10 in the Alberta legislature would have given public and Catholic school boards the right to refuse student requests to form gay-straight alliances in schools. Under intense public interest it was withdrawn by the Conservative government.

*  ### <b>BLMKidnapping:</b>

These 136,990 tweet ids represent reaction to a Facebook Live video that was posted on January 3rd, 2017, showing four African American men violently attacking a white, mentally disabled man. The tweets were collected on 01/05/2017. After the video surfaced, the Twitter hashtag, #BLMkidnapping, was created and used to incorrectly attribute the violent attack to members of the Black Lives Matter movement. Police in Chicago, where the attack took place, have found no evidence the attack has any connection to the Black Lives Matter movement. This link is to a CNN story documenting the police denial of Black Lives Matter connection: http://www.cnn.com/2017/01/05/us/black-lives-matter-chicago-facebook-live-beating/index.html

*   ### <b>#Charlottesville:</b>

On Friday, August 11th, 2017 large groups of racist white nationalists carrying torches marched on the University of Virginia campus in Charlottesville, VA as an intimidation tactic against proponents for the removal of confederate statues of Robert E. Lee. The Friday evening march was held ahead of a much larger racist white nationalist rally in the center of Charlottesville planned for Saturday, August 12th, 2017. 

*   ### <b>Replies to Ocasio-Cortez Tweets:</b>

Replies to senator Rep. Alexandria Ocasio-Cortez’s Tweets and Retweets in March 2019. Whilst many tweets to Ms. Ocasio-Cortez may be glowing praise, I'm counting on a sizable portion of the tweets directed at her to be abusive and perhaps sexist and racist - I feel like this is a good enough hypothesis, due to how toxic america's political climate has become in recent years.


<b>Importing and Installing Dependencies ... </b>

In [1]:
!pip install tweepy
!pip install google-cloud-storage



In [2]:
import os
import pandas as pd
import re
import json
import tweepy

#Below is to authenticate google bucket access for local machines

#Put GCS service account credentials json in current working directory
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]= "./storageCreds.json"
from google.cloud import storage
storage_client = storage.Client()
buckets = list(storage_client.list_buckets())
print(buckets) # Testing if access to GCS has been granted

pd.set_option('display.max_colwidth', -1) # Set col width to -1 so we can see entire text column

#Below is for google colab environment
"""from google.colab import auth
auth.authenticate_user()
!gcloud config set project 'my-project-csc3002'"""

!pip install gcsfs

[<Bucket: csc3002>]


**Combining tweet ID dataframes**

In [2]:
#UniteTheRight
utr = 'gs://csc3002/pretrain_data/UniteTheRight.txt'
utr = pd.read_csv(utr, sep=',',  index_col = False, encoding = 'utf-8', header = None, names = ['id'])
print("#UniteTheRight tweet database contains", len(utr.index), "tweet IDs")

#Bill 10
b10 = 'gs://csc3002/pretrain_data/bill10tweets.txt'
b10 = pd.read_csv(b10, sep=',',  index_col = False, encoding = 'utf-8', header = None, names = ['id'])
print("\nBill 10 tweet database contains", len(b10.index), "tweet IDs")

#BlmKidnapping
blm = 'gs://csc3002/pretrain_data/blmkidnapping_tweet_ids_v1.txt'
blm = pd.read_csv(blm, sep=',',  index_col = False, encoding = 'utf-8', header = None, names = ['id'])
print("\n#blmkidnapping tweet database contains", len(blm.index), "tweet IDs")

#Ocasio-Cortez Replies
aoc = 'gs://csc3002/pretrain_data/ocasio_cortez_replies.csv'
aoc = pd.read_csv(aoc, sep=',',  index_col = False, encoding = 'utf-8', header = 0, names = ['id'])
print("\nAOC tweet database contains", len(aoc.index), "tweet IDs")



#UniteTheRight tweet database contains 200113 tweet IDs

Bill 10 tweet database contains 24876 tweet IDs

#blmkidnapping tweet database contains 136990 tweet IDs

AOC tweet database contains 109201 tweet IDs


Now combining all of the tweet IDs into a single dataframe.

In [3]:
full = pd.concat([aoc, blm, b10, utr], axis =0)
full.drop_duplicates(subset='id',inplace =True ) #Important to drop duplicates

#Shuffle Data
full = full.sample(frac=1)
full.reset_index(drop = True, inplace = True)

print("There are", len(full.index), "tweet IDs in this dataset")
full.head()

There are 471180 tweet IDs in this dataset


Unnamed: 0,id
0,896436979776143361
1,896565219501232128
2,897079831506153472
3,895836384468054017
4,896198992710717445


Below is a tweepy method to obtain tweets via ID. Twitter API only allows us to extract tweets 100 at a time, as there are rate limits - therfore, we must set the wait_on_rate_limit parameter to True.

In [3]:
def lookup_tweets(tweet_IDs, api):
    
    full_tweets = []
    tweet_count = len(tweet_IDs)
    print("\nThere are", tweet_count, "tweet IDs to fetch")
    
    #Catching error if empty list
    if tweet_count < 1:
        return full_tweets
    
    #Below code to monitor progress
    #It's divided by 500 because we retrieve tweets via API call 100 at a time.
    #and we're looking to monitor progess each 5th of the way complete
    x = int(tweet_count/500) 
    progress = {x: '20%', x*2: '40%', x*3: '60%', x*4: '80%'}
    print("Fetching tweets...")
    try:
        for i in range((tweet_count // 100) + 1):
            if i in list(progress.keys()):
                print(progress[i], "complete")

        
            # Catch the last group if it is less than 100 tweets
            end_loc = min((i + 1) * 100, tweet_count)
            #It was .extend but it is slower than .append
            full_tweets.append(
                api.statuses_lookup(id_=tweet_IDs[i * 100:end_loc], map_ = True))
            
        return full_tweets
    
    except tweepy.TweepError as e:
        print("Around index:", i*100, "\n", e.reason)       
        #Recursive call to continue even with exception
        return full_tweets + lookup_tweets(tweet_IDs[(i+1)*100:tweet_count], api)
                

#Google colab
"""from google.colab import drive
drive.mount('/content/drive')
with open('/content/drive/My Drive/twitter_credentials.json', "r") as f:
  creds = json.load(f)"""

#Local machine
with open('./twitter_credentials.json', "r") as f:
    creds = json.load(f)


auth = tweepy.OAuthHandler(creds['CONSUMER_KEY'], creds['CONSUMER_SECRET'])
auth.set_access_token(creds['ACCESS_TOKEN'], creds['ACCESS_SECRET'])

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True, \
                 retry_count=10, retry_delay=5, retry_errors=set([503])) # These last three params catch over-capacity error

Applying the custom tweepy function to this dataframe to retrieve the text content corresponding to each tweet ID, then wrangling all of the text data into a singular dataframe.

<b>The below cell may take a while to run.</b>

In [5]:
tweet_ids = list(full['id'])

#Below works as long as it's not a multiple of 100. takes a while
results = lookup_tweets(tweet_ids, api)
                                  
temp = json.dumps([status._json for status in results]) #create JSON
final = pd.read_json(temp, orient='records')
final = final[['id','text']]
pd.set_option('display.max_colwidth', -1)
print(final.info())
final.head()

Rate limit reached. Sleeping for: 177
Rate limit reached. Sleeping for: 137
Rate limit reached. Sleeping for: 207
Rate limit reached. Sleeping for: 29


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 471180 entries, 0 to 471179
Data columns (total 2 columns):
id      471180 non-null int64
text    235551 non-null object
dtypes: int64(1), object(1)
memory usage: 7.2+ MB
None


Unnamed: 0,id,text
0,1108908334445322241,@AOC Says the bartender...
1,1102941794491351040,@AOC @bungarsargon Right now all you are doing is hurting our own party with division. Why can you just acknowledge… https://t.co/OYpLKjt6bM
2,896445976617132032,
3,1111591492567613440,@dustbusterz @hmcghee @AOC @MSNBC Flopped as in her pushing her green new deal. But nuance takes some thought
4,540340474972626944,


It'll be interesting to see if the amount of NaNs is the result of an error in my code, the database or if it's just banned accounts

In [6]:
#Quick function to beautify the error message
def getExceptionMessage(msg):
    words = msg.split(' ')

    errorMsg = ""
    for index, word in enumerate(words):
        if index not in [0,1,2]:
            errorMsg = errorMsg + ' ' + word
    errorMsg = errorMsg.rstrip("\'}]")
    errorMsg = errorMsg.lstrip(" \'")

    return errorMsg

def getErrorDesc(df): 
    for row in range(0, len(df.index)):
        current = df.loc[row]
        if pd.isna(current.text) == True:  
            try:    
                twt = api.get_status(current.id)
            except tweepy.TweepError as err:
                df.loc[row, 'text'] = getExceptionMessage(err.reason)
    return df

#If I attempt this with the whole dataset it'll take ages,
#So I'll just randomly sample the dataset to get a rough idea
df = final.sample(1000)
df.reset_index(drop = True, inplace = True)
print(df.info())
df = getErrorDesc(df)
print("\n", df.text.value_counts()[1:10])
df.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
id      1000 non-null int64
text    515 non-null object
dtypes: int64(1), object(1)
memory usage: 15.8+ KB
None
User has been suspended.                                                                                                                         204
Sorry, you are not authorized to see this status.                                                                                                28 
RT @AynRandPaulRyan: David Duke in #Charlottesville saying this Nazi, #UniteTheRight fiasco "fulfills the promises of Donald Trump." \nhttps…    24 
RT @RealAlexRubi: Massive brawl breaks out, tiki torches thrown as #UniteTheRight reaches Jefferson monument in #Charlottesville. Chemicals…     7  
RT @RVAwonk: Here's former KKK leader David Duke explicitly stating that Trump motivated the white supremacist #UniteTheRight rally in #Cha…     7  
RT @PrisonPlanet: Tucker is the only n

Unnamed: 0,id,text
0,896630257977085952,User has been suspended.
1,896375834415697920,User has been suspended.
2,897199189775527939,User has been suspended.
3,816865867371679744,RT @MattWalshBlog: Will @deray and @ShaunKing apologize for fomenting the kind of racial hate that led to the #BLMKidnapping?
4,1103074785859297280,User has been suspended.
5,893978535081201668,User has been suspended.
6,1111105464744493056,"@AOC This is not infrastructure @AOC this is building maintenance. \n\nInfrastructure is roads, airports, bridges..… https://t.co/hMqbzbc0Z4"
7,896476851572310016,"RT @AynRandPaulRyan: David Duke in #Charlottesville saying this Nazi, #UniteTheRight fiasco ""fulfills the promises of Donald Trump."" \nhttps…"
8,1107334308451237888,@my_surreality @JoeySalads @AOC Lol this guy... https://t.co/SlsnzlerQc
9,540531308183572480,RT @mikesbloggity: If you want to send an email to your MLA about #Bill10. Visit: http://t.co/C7onNIA7HH It takes less than ten seconds.


Interesting that not only has our function above showcased the errors associated, but also it's shown that there are many duplicate tweets as the result of RTs. We can remove this through the drop duplicates method

The text must be preprocessed before it becomes it is exported to a csv, otherwise it throws an error. The error is caused by the function not being able to encode unicode characters such as emojis

In [23]:
def preprocess(text_string):
    """
    Accepts a text string and:
    1) Removes URLS
    2) lots of whitespace with one instance
    3) Removes mentions
    4) Uses the html.unescape() method to convert unicode to text counterpart
    5) Replace & with and
    6) Remove the fact the tweet is a retweet if it is - knowing the tweet is 
       a retweet does not help towards our classification task.
    This allows us to get standardized counts of urls and mentions
    Without caring about specific people mentioned
    """
    space_pattern = '\s+'
    giant_url_regex = ('http[s]?://(?:[a-zA-Z]|[0-9]|[#$-_@.&+]|'
        '[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
    mention_regex = '@[\w\-]+:'
    mention_regex1 = '@[\w\-]+'
    RT_regex = '(RT|rt)[ ]*@[ ]*[\S]+'
    
    # Replaces urls with URL
    parsed_text = re.sub(giant_url_regex, '', text_string)
    parsed_text = re.sub('URL', '', parsed_text)
    
    # Remove the fact the tweet is a retweet. 
    # (we're only interested in the language of the tweet here)
    parsed_text = re.sub(RT_regex, ' ', parsed_text) 
    
    # Removes mentions as they're redundant information
    parsed_text = re.sub(mention_regex, '',  parsed_text)
    #including mentions with colons after - this seems to come up often
    parsed_text = re.sub(mention_regex1, '',  parsed_text)  

    # Remove unicode
    parsed_text = re.sub(r'[^\x00-\x7F]','', parsed_text) 
    parsed_text = re.sub(r'&#[0-9]+;', '', parsed_text)  

    # Remove excess whitespace at the end
    parsed_text = re.sub(space_pattern, ' ', parsed_text) 

    #Replace &amp; with and
    parsed_text = re.sub('&amp;', 'and', parsed_text)
    
    # Set text to lowercase and strip
    parsed_text = parsed_text.lower()
    parsed_text = parsed_text.strip()

    return parsed_text

<b>Now concatenating the charlottesville dataset which already has the tweets associated with each ID so no text retrieval function via tweepy is necessary</b>

In [None]:
nullvals = final.text.isna().sum()
print("Amount of tweets IDs not returning text:", nullvals)
per = (nullvals/len(final.index)) * 100
print(("Which is %.2f%% of the overall dataset") % (per))


In [12]:
final.dropna(inplace = True)
final.reset_index(drop=True, inplace = True)
final['text'] = final['text'].apply(preprocess)

#Interim save
#final.to_csv('gs://csc3002/pretrain_data/idstext.csv', sep = ',', encoding='utf-8', \
#               index = False, header = True)
final.tail()

Unnamed: 0,id,text
235546,896201447838187520,happening now at uva. our people on the march. will you be at #unitetheright tomorrow?
235547,896835566624546816,jason kessler organized the #unitetheright rally. he deserves the shaming for organizing &amp; the violence it incited. https
235548,1102729934894653440,i 2nd that shout out!
235549,1103276581060005889,perhaps it is just a
235550,896247350808784896,"alt right #unitetheright woman tells #antifa counter-protester that he ""sounds like a n-----"" #charlottesville"


In [20]:
#Charlottesville - Contains full text in columns
charl = 'gs://csc3002/pretrain_data/charlottesville_aug15_sample.csv'
charl = pd.read_csv(charl, sep=',',  index_col = False, encoding = 'utf-8', header = 0,)
charl1 = 'gs://csc3002/pretrain_data/aug16_sample.csv'
charl1 = pd.read_csv(charl1, sep=',',  index_col = False, encoding = 'utf-8', header = 0,)
charl2 = 'gs://csc3002/pretrain_data/aug17_sample.csv'
charl2 = pd.read_csv(charl2, sep=',',  index_col = False, encoding = 'utf-8', header = 0,)
charl3 = 'gs://csc3002/pretrain_data/aug18_sample.csv'
charl3 = pd.read_csv(charl3, sep=',',  index_col = False, encoding = 'utf-8', header = 0,)

dfs = [charl[['id','full_text']], charl1[['id','full_text']], \
       charl2[['id','full_text']], charl3[['id','full_text']]]
       
charlot = pd.concat(dfs, axis = 0)
charlot.rename(columns = {'full_text':'text'}, inplace = True)
print("\n#Charlottesville tweet database contains", len(charlot.index), "tweet IDs")

print(charlot.info())
charlot.head()


#Charlottesville tweet database contains 200000 tweet IDs
<class 'pandas.core.frame.DataFrame'>
Int64Index: 200000 entries, 0 to 49999
Data columns (total 2 columns):
id      200000 non-null int64
text    200000 non-null object
dtypes: int64(1), object(1)
memory usage: 4.6+ MB
None


Unnamed: 0,id,text
0,897661668787982336,It's almost as if people are exactly who they say they are https://t.co/MnWFXZd9c3
1,897654901534228480,"@Slate Conservative media: Yes, Trump's response to Charlottesville was bad, but what about Obama? https://t.co/jjINXL5Qp0 via @slate"
2,897659748597870592,👀 https://t.co/qeyzYeblwu
3,897660496656179202,😂 😂 😂 Karma really isn't wasting time.. https://t.co/JYRqf6vlSX
4,897642311903055872,"After Charlottesville, Black Lives Matter Issues New Demand - https://t.co/Vuw3IvrhL2"


In [23]:
final1 = pd.concat([final, charlot], axis = 0)
final1 = final1.sample(frac=1) #shuffle
final1['text'] = final1['text'].apply(preprocess)
final1.reset_index(drop = True, inplace = True)
final1.to_csv('gs://csc3002/pretrain_data/tweetText/full.csv', sep = ',', encoding='utf-8', \
                 index = False, header = True)
final1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 435551 entries, 0 to 435550
Data columns (total 2 columns):
id      435551 non-null int64
text    435551 non-null object
dtypes: int64(1), object(1)
memory usage: 6.6+ MB


<b>Forgot to remove duplicates from the tweets in the last section. Will reload the tweet text dataset and remove duplicates. We'll then view what effect this has overall</b>

In [37]:
dat = pd.read_csv('gs://csc3002/pretrain_data/tweetText/full.csv', sep=',',  index_col = False, header=0)

dat.drop_duplicates(subset='text',inplace =True )
dat['text'] = dat['text'].astype(str)
dat['text'] = dat['text'].apply(preprocess)
pd.set_option('display.max_colwidth', -1)
print("There are ", len(dat.index), "unique tweets in this database")
dat.to_csv('gs://csc3002/pretrain_data/tweetText/full.csv', sep = ',', encoding='utf-8', \
                 index = False, header = True)
dat.head(20)

There are  272215 unique tweets in this database


Unnamed: 0,id,text
0,896252155799244800,reading leftist tweets about #charlottesville is a great morale boost. they're afraid. #unitetheright
1,898207292105125892,tim cook announces $2 million in donations after charlottesville - the hill
2,896421621002645504,the mra who blamed elliot rodger's killing spree on feminists is amongst the #unitetheright nazis in charlottesville. htt
3,1103960270072762373,all on socialism talking points?
4,897647699662696448,"20-year-old deandre harris speaks out about being assaulted by white supremacists in charlottesville, va."
5,898585342877380608,how will the church grapple with charlottesville? reports:
6,898266246185181184,#nazi #charlottesville #whitesupremacists #altright
7,893620507785834496,"guys there is a war coming, i am a jew and i stand with , and it's time to #unitetheright."
8,897664391428141057,"""as said before: the news media make sound pres trump is racist. pres trump say, charlotteville violence need investigation. didn't he? """
9,1104237137636012032,"that was amazing! hell, she even had me feeling ashamed of myself, and"


## More pre-training data... Specifically datasets likely to have women or immigrants as the subject

The main hate speech dataset I'll be testing with will be the HatEval dataset, which has women and immigrants as their target. With this in mind, I sought to source more tweet datasets which were likely to have women or immigrants as the subject
*   ### <b>#thechalkening tweet database:</b>

The Chalkening is a campaign launched by Donald Trump supporters on college campuses that involves writing pro-Trump messages in chalk on campus facilities. This mass, chalk-based, protest happened alongside an outpouring of media criticism of an incident at Emory University in March 2016. An Emory university administrator sent an email expressing support for students who claimed to feel threatened and unsafe by hate speech in the form of pro-Trump chalkings on the campus.

In [15]:
chalkening = pd.read_csv('gs://csc3002/pretrain_data/thechalkening-ids-20160412.txt', sep=',',  index_col = False, header=None, names =['id'])

chalkening1 = pd.read_csv('gs://csc3002/pretrain_data/thechalkening-ids-20160615.txt', sep=',',  index_col = False, header=None, names =['id'])

chalk = pd.concat([chalkening, chalkening1], axis = 0)
print("There are", len(chalk.index), "tweets with the #chalkening hashtag")

There are 115524 tweets with the #chalkening hashtag


In [19]:
tweet_ids = list(chalk['id'])

#Below works as long as it's not a multiple of 100. takes a while
results = lookup_tweets(tweet_ids, api)
                                  
temp = json.dumps([status._json for status in results]) #create JSON
final = pd.read_json(temp, orient='records')
final = final[['id','text']]
pd.set_option('display.max_colwidth', -1)
print(final.info())
final.head()


There are 115524 tweet IDs to fetch
Fetching tweets...
20% complete
40% complete
60% complete


Rate limit reached. Sleeping for: 302


80% complete
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115524 entries, 0 to 115523
Data columns (total 2 columns):
id      115524 non-null int64
text    52040 non-null object
dtypes: int64(1), object(1)
memory usage: 1.8+ MB
None


Unnamed: 0,id,text
0,720103525401931777,RT @DanScavino: #TheChalkening- thank you! #Trump2016 #StudentsForTrump https://t.co/W6aTj6TzzL
1,720118772808474624,
2,720112336632168448,
3,720107218495016960,
4,720123193269338112,


In [20]:
final.dropna(inplace = True)
final.reset_index(drop=True, inplace = True)
final['text'] = final['text'].apply(preprocess)
final.drop_duplicates(subset='text',inplace =True ) #Important to drop duplicates

print("After Dropping nulls and duplicate tweets (such as retweets) there are",\
     len(final.index), "tweets")
final.to_csv('gs://csc3002/pretrain_data/tweetText/chalkTweets.csv', sep = ',', encoding='utf-8', \
                 index = False, header = True)
final.tail()

After Dropping nulls and duplicate tweets (such as retweets) there are 4726 tweets


Unnamed: 0,id,text
52033,739815224946102272,#mondaymotivation #thechalkening #gameofthrones
52034,739716451779497984,undeniable. #womenfortrump #thechalkening #studentsfortrump
52036,739667266086817792,#thechalkening #gameofthrones #obamagate #frenchopenfinal
52037,739867836294660097,la times article admits it. #womenfortrump #thechalkening #studentsfortrump
52039,739580349462777856,"i'm like 3 months late, but here's my retarded addition to #thechalkening"


In [21]:
final.tail(100)

Unnamed: 0,id,text
51796,732245621495746562,women* love donald trump!!! *cis gendered upper class white women who lack empathy and general reasoning skills
51803,732065372505939968,they assimilate so well no saw pics like this in nyc i say normal or nuts. nuts. not in usa #trump2016
51805,732211730776985600,#thechalkening coming to a campus near you.
51806,732379415510802433,dear sweet pea coming soon... #thechalkening irritate conservatives with lies irritate liberals with the truth.
51808,731587511986704384,first #thechalkening now #thebathrooming result?
51809,732311274332127232,#thechalkening look out !!! get to a safe space now! #thechalkening
51817,731765859291672576,i accept your apology and was showing everyone his whiny liberal classmates by pouring some water on the ground #thechalkening
51822,732260636789334017,#thechalkening #dobbs #maga
51828,731904534704840704,#thechalkening revisited for #dc: student support for #trump is beyond campus groups
51829,731590719215943681,retreat to your safe spaces people. people wrote the next presidents name in chalk #safespace #itsnotradicaltosay


<b>There are very few tweets returned surprisingly. There must have been an overload of retweets in the dataset. Still we'll use them in the pre-training regardless


In [36]:
chalk.dropna(inplace = True)
chalk.to_csv('gs://csc3002/pretrain_data/tweetText//removedTerms/chalkTweets.csv', sep = ',', encoding='utf-8', \
                 index = False, header = True)
chalk.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4726 entries, 0 to 4725
Data columns (total 2 columns):
id      4726 non-null int64
text    4726 non-null object
dtypes: int64(1), object(1)
memory usage: 74.0+ KB



*   ### <b>#NotAllMen Twitter IDs:</b>

Around 70,000 tweets with the #NotAllMen hashtag. A Time magazine article on the subject states that "Not all men" was previously stated as an object of frustration, but in early 2014 it became usually used as an object of mockery. Intended to counter generalizations about men's behavior, some critics claim the phrase deflects conversations from uncomfortable topics, such as sexual assault.

*   ### <b>#NotAllWomen Twitter IDs:</b>

A counter to the #YesAllWomen protest (explained in more detail later). Many tweets with this hashtag at a glance seem to be quite sexist or at least subversive. Basically a vacant protest on somebody else's protest like #AllLivesMatter

In [42]:
notallmen = pd.read_csv('gs://csc3002/pretrain_data/NotAllMen.ids.txt', sep=',',  index_col = False, header=None, names =['id'])
print("There are", len(notallmen.index), "tweets with the #NotAllMen hashtag")

notAllWomen = pd.read_csv('gs://csc3002/pretrain_data/NotAllWomen.ids.txt', sep=',',  index_col = False, header=None, names =['id'])
print("\nThere are", len(notAllWomen.index), "tweets with the #NotAllWomen hashtag")

sexism = pd.concat([notallmen, notAllWomen], axis = 0)
print("\nThere are", len(sexism.index), "tweets total")
sexism = sexism.sample(frac=1)

tweet_ids = list(sexism['id'])

#Below works as long as it's not a multiple of 100. takes a while
results = lookup_tweets(tweet_ids, api)
                                  
temp = json.dumps([status._json for status in results]) #create JSON
final1 = pd.read_json(temp, orient='records')
final1 = final1[['id','text']]
pd.set_option('display.max_colwidth', -1)
print(final1.info())
final1.head()

There are 69873 tweets with the #NotAllMen hashtag

There are 1827 tweets with the #NotAllWomen hashtag

There are 71700 tweets total

There are 71700 tweet IDs to fetch
Fetching tweets...
20% complete
40% complete
60% complete
80% complete


Rate limit reached. Sleeping for: 67


Around index: 71700 
 [{'code': 38, 'message': 'id parameter is missing.'}]

There are 0 tweet IDs to fetch
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71700 entries, 0 to 71699
Data columns (total 2 columns):
id      71700 non-null int64
text    40559 non-null object
dtypes: int64(1), object(1)
memory usage: 1.1+ MB
None


Unnamed: 0,id,text
0,472492987516465152,RT @mensrightsrdt: #Rapeculture #yesallmen #yesallwomen #Yesallpeople #killallmen #Notallmen ---&gt;&gt;&gt;&gt;#ExpectShowtrials
1,470591049211985920,"RT @AmandaMagee: If the idea of #YesAllWomen threatens you and makes you feel panic, rage &amp; that life is unfair, congrats, you just got a t…"
2,470441116433543168,RT @schemaly: #notallmen practice violence against women but #YesAllWomen live with the threat of male violence. Every. Single. Day. All ov…
3,470799043564941312,
4,472382404087123968,"'@k8yk Yes, but some men's feelings getting hurt &amp; they want to stop. We forgot 4 a moment #notallmen more important than #YesALLWhiteWomen"


In [48]:
final1.dropna(inplace = True)
final1.reset_index(drop=True, inplace = True)
final1['text'] = final1['text'].apply(preprocess)
final1.drop_duplicates(subset='text',inplace =True ) #Important to drop duplicates

print("After Dropping nulls and duplicate tweets (such as retweets) there are",\
     len(final1.index), "tweets")
final1.to_csv('gs://csc3002/pretrain_data/tweetText/sexismTweets.csv', sep = ',', encoding='utf-8',\
              index = False, header = True)
final1.tail()

After Dropping nulls and duplicate tweets (such as retweets) there are 14860 tweets


Unnamed: 0,id,text
14855,473481837369495552,"sure #notallmen, but that's never been the point. don't get it twisted, apologists: #enoughmenare. #yesallwomen"
14856,472076611601592320,#notallwomen see men as the enemy.
14857,471123309724454913,mras are not a hate group. one individual acting out violent tendencies does not represent an entire group. #notallmen #notallmra
14858,470393553643122688,"#notallmen on twitter are bumming me out today, but some of them..."
14859,471463238350430208,"#notallmen were lucky enough to have a father, like mine, who demanded i treat women with respect and a mother who exemplified it every day."


## Large tweet datasets

The next couple of cells are containing very large tweet ID datasets. I'm not sure if I'll be able to retrieve all of the tweets in one session as it takes so long, also converting a very large dataframe to a json via pandas throws a MemoryError

I'll slightly alter the previous lookup_tweets function to checkpoint by saving the tweet text to a designated google bucket file path after however many IDs have been fetched

These datasets - while they have a much larger volume than the others - have less desirable tweets to my pretraining than the other datasets. For one, these datasets are likely to have more non-user generated tweets and possibly spam as they're based on huge movements or global issues, unlike the niche subjects before.

Also, the immigration executive order tweets are much more difficult to anticipate what common terms to remove from the tweets that might affect the learning of the word-masking task (for example in the #Chalkening tweets I'll remove #chalkening from each tweet because the model could just learn that #chalkening is likely to be the missing word in each sequence, just because it comes up in each tweet).

Therefore, I'll not use all of the tweets from each of these sets, but rather a sample. Still, I might as well retrieve as many as I can, in case my strategy changes later and I find that the more tweets I do further pre-training on the better my model performs.

In [8]:
 def lookup_tweets_ckpt(tweet_IDs, api, dirc, checkpoint = 500000):
    print("Saving tweet text in directory at path", dirc)
    full_tweets = []
    tweet_count = len(tweet_IDs)
    
    print("\nThere are", tweet_count, "tweet IDs to fetch")
    print("Checkpoint saved every", checkpoint, "IDs")
    
    if tweet_count < 1:
        return full_tweets
    
    #Below code to monitor progress
    #It's divided by 500 because we retrieve tweets via API call 100 at a time.
    #and we're looking to monitor progess each 5th of the way complete
    x = int(tweet_count/500) 
    progress = {x: '20%', x*2: '40%', x*3: '60%', x*4: '80%'}
    
    #Value for checkpoint saves. Dictates how many files there are
    ckpt = 1
    print("Fetching tweets...")
    try:
        for i in range((tweet_count // 100) + 1):
            if i in list(progress.keys()):
                print(progress[i], "complete")

        
            # Catch the last group if it is less than 100 tweets
            end_loc = min((i + 1) * 100, tweet_count)
            #It was .extend but it is slower than .append
            full_tweets.append(
                api.statuses_lookup(id_=tweet_IDs[i * 100:end_loc], map_ = True))
                
            #Checkpointing every time the tweet ID dataset reaches a multiple of 500,000
            if (i*100)%checkpoint == 0 and i != 0:
                    
                print("Checkpoint at tweet ID No.", i*100)
                temp = json.dumps([status._json for status in full_tweets]) #create JSON
                interim = pd.read_json(temp, orient='records')
                interim = interim[['id','text']]
                interim.dropna(inplace = True)
                interim.reset_index(drop=True, inplace = True)
                interim['text'] = interim['text'].apply(preprocess)
                interim.drop_duplicates(subset='text',inplace =True ) #Important to drop duplicates
                path = dirc + '/' + str(ckpt) + '.csv'
                print("Saving at path", path)
                interim.to_csv(path, sep = ',', encoding='utf-8',\
                              index = False, header = True)
                ckpt = ckpt + 1
                full_tweets = [] # Reinitialise to empty dataframe
        
        
        #Outside of the loop
        print("Final save")
        temp = json.dumps([status._json for status in full_tweets]) #create JSON
        interim = pd.read_json(temp, orient='records')
        interim = interim[['id','text']]
        interim.dropna(inplace = True)
        interim.reset_index(drop=True, inplace = True)
        interim['text'] = interim['text'].apply(preprocess)
        interim.drop_duplicates(subset='text',inplace =True ) #Important to drop duplicates
        path = dirc + '/' + str(ckpt) + '.csv'
        print("Saving at path", path)
        interim.to_csv(path, sep = ',', encoding='utf-8',\
                      index = False, header = True)
        
# Keep this return statement in the case of an TweepError so it can be recursively called and continue the function
        return full_tweets 
    
    except tweepy.TweepError as e:
        print("Around index:", i*100, "\n", e.reason)       
        #Recursive call to continue even with exception
        return full_tweets + lookup_tweets_ckpt(tweet_IDs[(i+1)*100:tweet_count], api, dirc, checkpoint)
                
    
    

*   ### <b>#YesAllWomen Twitter IDs:</b>

This hashtag was popular in May 2014, and was created partly in response to the Twitter hashtag #NotAllMen. #YesAllWomen reflected a grassroots campaign in which women shared their personal stories about harassment and discrimination. The campaign attempted to raise awareness of sexism that women experience, often from people they know.

<b>There are around 2.7 million tweet IDs in this database</b>

In [None]:
yesAllWomen = pd.read_csv('gs://csc3002/pretrain_data/YesAllWomen.ids.txt', sep=',',  index_col = False, header=None, names =['id'])
print("There are", len(yesAllWomen.index), "tweets with the #YesAllWomen hashtag")
tweet_ids = list(yesAllWomen['id'])

#Path to directory holding all saves
path = 'gs://csc3002/pretrain_data/tweetText/YesAllWomenTweets' 
results = lookup_tweets_ckpt(tweet_ids, api, path)

There are 2705985 tweets with the #YesAllWomen hashtag
Saving tweet text in directory at path gs://csc3002/pretrain_data/tweetText/YesAllWomenTweets

There are 2705985 tweet IDs to fetch
Checkpoint saved every 500000 IDs
Fetching tweets...


Rate limit reached. Sleeping for: 36
Rate limit reached. Sleeping for: 420


<b>We've finished retrieving the tweets from this dataset in a separate, unsupervised jupyter notebook session. We can combine all of the tweet text files into one csv and view how many tweets we have fetched<b>

In [95]:
wom = pd.read_csv('gs://csc3002/pretrain_data/tweetText/YesAllWomenTweets/1.csv', sep=',',  index_col = False, header=0)
wom2 = pd.read_csv('gs://csc3002/pretrain_data/tweetText/YesAllWomenTweets/2.csv', sep=',',  index_col = False, header=0)
wom3 = pd.read_csv('gs://csc3002/pretrain_data/tweetText/YesAllWomenTweets/3.csv', sep=',',  index_col = False, header=0)
wom4 = pd.read_csv('gs://csc3002/pretrain_data/tweetText/YesAllWomenTweets/4.csv', sep=',',  index_col = False, header=0)
wom5 = pd.read_csv('gs://csc3002/pretrain_data/tweetText/YesAllWomenTweets/5.csv', sep=',',  index_col = False, header=0)
wom6 = pd.read_csv('gs://csc3002/pretrain_data/tweetText/YesAllWomenTweets/6.csv', sep=',',  index_col = False, header=0)

womTweets = pd.concat([wom, wom2, wom3, wom4, wom5, wom6], axis =0)
womTweets.drop_duplicates(subset='text',inplace =True )
womTweets.dropna(inplace = True)
print("\nThere are", len(womTweets.index), "unique tweets in this csv file after dropping duplicates over the entire database")
womTweets['text'] = womTweets['text'].apply(preprocess)
womTweets.head(50)


There are 388799 unique tweets in this csv file after dropping duplicates over the entire database


Unnamed: 0,id,text
0,470315730706399232,"because i get in an elevator with a guy and think ""what's my escape plan going to be?"" #yesallwomen"
1,470317902776647680,"because it starts earlyschool dress codes punish girls for wearing clothes that are ""distracting"" to boys. #yesallwomen"
2,470317461024555009,"because ""boys will be boys"" is a phrase that still exists. #yesallwomen"
3,470316502445744130,"because we're taught ""never leave your drink alone,"" instead of ""don't drug someone."" #yesallwomen"
4,470317309022978048,#yesallwomen have the right to set boundaries. prevalent misogyny has led to the extinction of private space for women.
5,470315790885863424,because women are taught to carry our keys like a weapon in case we're attacked in a parking lot. #yesallwomen
6,470315890060189696,hell yeah. #yesallwomen deserve the right to say no to a man without a given reason. they owe you
7,470316527959670784,"because there is a moment, daily, weekly, monthly, where you're in a situation where you think: ""is today the day i get rap"
8,470314968492302338,sounds like something that needs to get shared right now. #yesallwomen
9,470315514980757504,hell yeah. #yesallwomen deserve the right to say no to a man without a given reason. they owe you nothing!


Finally saving these tweets to file

In [None]:
womTweets.to_csv('gs://csc3002/pretrain_data/tweetText/YesAllWomenTweets.csv', sep = ',', \
                 encoding='utf-8', index = False, header = 0)

*   ### <b>Immigration and Travel Ban Tweet Ids:</b>

This dataset contains the tweet ids of 16,875,766 tweets related to the immigration and travel ban executive order announced by the Trump Administration in January 2017. They were collected between January 30, 2017 and April 20, 2017. 

The terms using for the filter were: #MuslimBan, #NoBanNoWall, #NoMuslimBan, #JFKTerminal4, #RefugeesWelcome, muslim ban, immigrant ban, immigration ban, travel ban, immigration order, #ImmigrationBan, #TravelBan.

In [None]:
imm = pd.read_csv('gs://csc3002/pretrain_data/immigration_exec_order.txt', sep=',',  index_col = False, header=None, names =['id'])
print("There are", len(imm.index), "tweets which are in this dataset. \nThe subject of this dataset is the excutive order restricting immigration which trump signed,", \
     "many believe the intentional target were muslims. \nHence the predominant hashtag in this dataset is #MuslimBan or #NoMuslimBan")
tweet_ids = list(imm['id'])

#Path to directory holding all saves
path = 'gs://csc3002/pretrain_data/tweetText/immigrationTweets' 
results = lookup_tweets_ckpt(tweet_ids, api, path)

There are 16875766 tweets which are in this dataset. 
The subject of this dataset is the excutive order restricting immigration which trump signed, many believe the intentional target were muslims. 
Hence the predominant hashtag in this dataset is #MuslimBan or #NoMuslimBan
Saving tweet text in directory at path gs://csc3002/pretrain_data/tweetText/immigrationTweets

There are 16875766 tweet IDs to fetch
Checkpoint saved every 500000 IDs
Fetching tweets...


Rate limit reached. Sleeping for: 294
Rate limit reached. Sleeping for: 331
Rate limit reached. Sleeping for: 302
Rate limit reached. Sleeping for: 273
Rate limit reached. Sleeping for: 283


Checkpoint at tweet ID No. 500000
Saving at path gs://csc3002/pretrain_data/tweetText/immigrationTweets/1.csv


Rate limit reached. Sleeping for: 267
Rate limit reached. Sleeping for: 337
Rate limit reached. Sleeping for: 278
Rate limit reached. Sleeping for: 303


In [101]:
im = pd.read_csv('gs://csc3002/pretrain_data/tweetText/immigrationTweets/1.csv', sep=',',  index_col = False, header=0)
im1 = pd.read_csv('gs://csc3002/pretrain_data/tweetText/immigrationTweets/2.csv', sep=',',  index_col = False, header=0)
im2 = pd.read_csv('gs://csc3002/pretrain_data/tweetText/immigrationTweets/3.csv', sep=',',  index_col = False, header=0)
im3 = pd.read_csv('gs://csc3002/pretrain_data/tweetText/immigrationTweets/4.csv', sep=',',  index_col = False, header=0)
im4 = pd.read_csv('gs://csc3002/pretrain_data/tweetText/immigrationTweets/5.csv', sep=',',  index_col = False, header=0)

imTweets = pd.concat([im, im1, im2, im3, im4], axis =0)
imTweets.drop_duplicates(subset='text',inplace =True ) #Important to drop duplicates
imTweets.dropna(inplace = True)
print("\nThere are ", len(imTweets.index), "unique tweets in this csv file after dropping duplicates again")
imTweets.head(50)


There are  362847 unique tweets in this csv file after dropping duplicates again


Unnamed: 0,id,text
0,830392337662554113,aka judge roberts and 9thcircuit court in big ass trouble.
1,830392257861865472,born and raised in canada. not a refugee. not an immigrant. no ties to the 7 counties of the eo. it's a #muslimban https
2,830392310701502464,trey gowdy blasts the liberal 9th circuit over trump immigration order
3,830392243928301568,more than 300 protest trump's immigration ban at pittsburgh internationa
4,830392409691451393,"there is no #muslimban just arrived lax, no one asked me about my religious faith even though my passport has my muslim name"
6,830392357329715200,"trump to sign brand new immigration executive order, not appeal to scotus by #rt_com via"
7,830392395216855040,and all have been through vetting for years so stop fear mongering because you didn't win your muslim b
8,830392360349659136,#iran #isis #israel #muslimban #nomuslimban #trump #alqaeeda
9,830392412321218560,if the u.s. were at war with iran would liberal judges understand how important a travel ban would be?
10,830392240631586819,trump travel ban kills surgeon's life-saving trip to iran


In [None]:
imTweets.to_csv('gs://csc3002/pretrain_data/tweetText/immigrationTweets.csv', sep = ',', \
                encoding='utf-8', index = False, header = 0)

# Removing common terms for word-masking
It has occured to me that in the way these tweet datasets were sourced - (by filtering through a particular term/hashtag via twitter API), there will be recurring terms or hashtags in each sequence - which may be counter-productive to the word masking excercise later in further BERT pre-training. 

<i>(Also though, removing these terms may render some sequences non-sensical - however I'm not sure how concerining this might be from an NLP standpoint) </i>

Therefore I'll develop a function to remove these common terms from my tweet data. These terms can be often followed by punctuation. So I'll fill an initial list with the terms I'd like to remove, then from that I'll create an augmented list which contains all the terms, plus versions of them with punctuation at the end.
    
<i> (I'll do further pre-training on data that has had common terms removed, as well as data that hasn't - and instead hashtags are segmented to mitigate the commonality of terms) </i>

I'll convert this list into a set at the end of the function, as a set is much quicker to access than a list - which will be useful later on

In [9]:
#This dataset was the combo of AOC replies, #charlottesville, #blmkidnapping, #bill10 and #uniteTheRight tweets'
dat = pd.read_csv('gs://csc3002/pretrain_data/tweetText/full.csv',\
                  sep=',',  index_col = False, header=0)

list1 = ['#unitetheright', '#charlottesville', '#utr', '#blmkidnapping',\
            '#bill10', 'charlottesville']

def addPunc(wordlist):
    
    
    newlist = wordlist + [word + "." for word in wordlist] + [word + "," for word in wordlist] \
     + [word + "?" for word in wordlist] + [word + "!" for word in wordlist] + \
    [word + "-" for word in wordlist] + [word + ":" for word in wordlist] + \
    [word + ";" for word in wordlist]
    
    wordset = set(newlist) # Convert to set for faster lookup later
    return wordset

print(addPunc(list1))

{'#utr-', '#blmkidnapping.', '#utr?', '#unitetheright.', '#blmkidnapping!', '#unitetheright?', '#charlottesville:', '#charlottesville!', '#unitetheright-', '#bill10!', '#charlottesville?', '#charlottesville-', '#blmkidnapping-', '#unitetheright;', '#blmkidnapping:', 'charlottesville,', 'charlottesville', '#unitetheright', '#blmkidnapping;', 'charlottesville;', '#charlottesville.', '#unitetheright!', '#charlottesville,', 'charlottesville-', '#utr', '#blmkidnapping?', '#bill10-', '#bill10', '#unitetheright,', '#bill10,', '#utr!', '#utr:', '#charlottesville;', '#bill10:', '#blmkidnapping,', '#utr.', 'charlottesville?', '#utr;', '#blmkidnapping', '#charlottesville', '#bill10.', '#utr,', '#bill10?', 'charlottesville!', '#bill10;', '#unitetheright:', 'charlottesville.', 'charlottesville:'}


In [10]:
def removeWords(text_string, wordlist):
    
    #Make sure row entry is a string
    text_string = str(text_string)
    
    # Add Punctuation to each word so function can account for occurences where string follows a punctuation mkr
    wordSet = addPunc(wordlist) 
    
    #Add spaces between hashtags. This ensures strings with consecutive hashtags are processed properly
    text_string = text_string.replace('#', ' #') 
    
    querywords = text_string.split()
    resultwords  = [word for word in querywords if word.lower() not in wordSet]
    resultwords = ' '.join(resultwords)
    return resultwords

stopwords = ['#unitetheright', '#charlottesville', '#utr', '#blmkidnapping',\
            '#bill10', 'charlottesville']
tweet = dat.iloc[0]['text']
print("Original:\n", tweet)
print("\n\nNew:\n ", removeWords(tweet, stopwords))

Original:
 reading leftist tweets about #charlottesville is a great morale boost. they're afraid. #unitetheright


New:
  reading leftist tweets about is a great morale boost. they're afraid.


<b>Let's see the effect this function has</b>

In [11]:
dat['text'] = dat['text'].apply(removeWords, wordlist = stopwords)
dat.head(40)

Unnamed: 0,id,text
0,896252155799244800,reading leftist tweets about is a great morale boost. they're afraid.
1,898207292105125892,tim cook announces $2 million in donations after - the hill
2,896421621002645504,the mra who blamed elliot rodger's killing spree on feminists is amongst the nazis in htt
3,1103960270072762373,all on socialism talking points?
4,897647699662696448,20-year-old deandre harris speaks out about being assaulted by white supremacists in va.
5,898585342877380608,how will the church grapple with reports:
6,898266246185181184,#nazi #whitesupremacists #altright
7,893620507785834496,"guys there is a war coming, i am a jew and i stand with , and it's time to"
8,897664391428141057,"""as said before: the news media make sound pres trump is racist. pres trump say, charlotteville violence need investigation. didn't he? """
9,1104237137636012032,"that was amazing! hell, she even had me feeling ashamed of myself, and"


<b>Let's try hashtag segmentation as a method and see if it can accurately segment hashtags in tweets<b>

In [12]:
!pip install wordsegment
import wordsegment as ws
from wordsegment import load, segment

load()
#The values below of the bigrams reflect the amount of search results on google that come up
ws.BIGRAMS['alt right'] = 1.17e8 # update wordsegment dict so 
                                #it recognises altright as "alt right" rather than salt right
ws.BIGRAMS['white supremacists'] = 3.86e6
ws.BIGRAMS['anti semitism'] = 4.1e6
ws.BIGRAMS['tweets'] = 6.26e10

def hashtagSegment(text_string):
    
    #We target hashtags so that we only segment the hashtag strings.
    #Otherwise the segment function may operate on misspelled words also; which
    #often appear in hate speech tweets owing to the ill education of those spewing it
    temp_str = []
    for word in text_string.split(' '):
        if word.startswith('#') == False:
            temp_str.append(word)
        else:
            temp_str = temp_str + segment(word)
            
    text_string = ' '.join(temp_str)       
    return text_string

teststr = dat.iloc[0]['text']
teststr1 = dat.iloc[6]['text']

print('Normal:\n',teststr,'\n')
print("Hashtag-Segmented:\n", hashtagSegment(teststr))

print('\n\nNormal:\n', teststr1,'\n')
print("Hashtag-Segmented:\n", hashtagSegment(teststr1))

Collecting wordsegment
  Downloading https://files.pythonhosted.org/packages/cf/6c/e6f4734d6f7d28305f52ec81377d7ce7d1856b97b814278e9960183235ad/wordsegment-1.3.1-py2.py3-none-any.whl (4.8MB)
Installing collected packages: wordsegment
Successfully installed wordsegment-1.3.1
Normal:
 reading leftist tweets about is a great morale boost. they're afraid. 

Hashtag-Segmented:
 reading leftist tweets about is a great morale boost. they're afraid.


Normal:
 #nazi #whitesupremacists #altright 

Hashtag-Segmented:
 nazi white supremacists alt right


Let's now see what effect segmenting hashtags has

In [13]:
dat['text'] = dat['text'].apply(hashtagSegment)
dat.head(30)

Unnamed: 0,id,text
0,896252155799244800,reading leftist tweets about is a great morale boost. they're afraid.
1,898207292105125892,tim cook announces $2 million in donations after - the hill
2,896421621002645504,the mra who blamed elliot rodger's killing spree on feminists is amongst the nazis in htt
3,1103960270072762373,all on socialism talking points?
4,897647699662696448,20-year-old deandre harris speaks out about being assaulted by white supremacists in va.
5,898585342877380608,how will the church grapple with reports:
6,898266246185181184,nazi white supremacists alt right
7,893620507785834496,"guys there is a war coming, i am a jew and i stand with , and it's time to"
8,897664391428141057,"""as said before: the news media make sound pres trump is racist. pres trump say, charlotteville violence need investigation. didn't he? """
9,1104237137636012032,"that was amazing! hell, she even had me feeling ashamed of myself, and"


In [16]:
dat.to_csv('gs://csc3002/pretrain_data/tweetText/removedTerms/full.csv', sep = ',', encoding='utf-8', index = False, header = 0)

<b>Likewise we'll remove common terms for the #thechalkening tweet database. I just see the term #thechalkening come up often</b>

In [17]:
#This is the #thechalkening databse
chalk = pd.read_csv('gs://csc3002/pretrain_data/tweetText/chalkTweets.csv', sep=',',  index_col = False, header=0)
chalk.head(10)

Unnamed: 0,id,text
0,720103525401931777,#thechalkening- thank you! #trump2016 #studentsfortrump
1,720107446149193728,19 yr old kurdish girl celebrates 1 year of fighting #isis... needs no #safespace from #thechalkening #tcot #pjnet
2,720102290749841408,sullivan county ny loves #trump #thechalkening #trump2016 #maga
3,720103135440715776,next thing you know there will be a background check and waiting period to buy chalk. #thechalkening
4,720100823729102848,"huge props to students who conducted #thechalkening on campus! together, we will #makeamericagreatagain https"
5,720101963812233216,. #thechalkening at shopping malls! libraries! especially starbucks! everywhere #trump2016 #trumpnewyork https:/
6,720140123141185537,we will not be silenced! we will win! #teamtrump #trump2016 #thechalkening #studentsfortrump #standwithstudents https://
7,720116749765582848,"a trump supporter once said, ""in real life, there are no safe spaces"" #thechalkening"
8,720102767130472450,"you can wash it off, but you can't erase trump support at nu! #thechalkening"
9,720111911908548610,#thechalkening


Apply pre-processing...

In [18]:
stopwords = ['#thechalkening']
chalk['text'] = chalk['text'].apply(removeWords, wordlist = stopwords)
chalk['text'] = chalk['text'].apply(hashtagSegment)
chalk.head(10)

Unnamed: 0,id,text
0,720103525401931777,thank you! trump2016 students for trump
1,720107446149193728,19 yr old kurdish girl celebrates 1 year of fighting isis needs no safe space from t cot pj net
2,720102290749841408,sullivan county ny loves trump trump2016 maga
3,720103135440715776,next thing you know there will be a background check and waiting period to buy chalk.
4,720100823729102848,"huge props to students who conducted on campus! together, we will make america great again https"
5,720101963812233216,. at shopping malls! libraries! especially starbucks! everywhere trump2016 trump new york https:/
6,720140123141185537,we will not be silenced! we will win! team trump trump2016 students for trump stand with students https://
7,720116749765582848,"a trump supporter once said, ""in real life, there are no safe spaces"""
8,720102767130472450,"you can wash it off, but you can't erase trump support at nu!"
9,720111911908548610,


And Save

In [19]:
chalk.to_csv('gs://csc3002/pretrain_data/tweetText/removedTerms/chalkTweets.csv', sep = ',', \
             encoding='utf-8', index = False, header = 0)

<b>And for the #NotAllMen and #NotAllWomen databases</b>

In [24]:
final = pd.read_csv('gs://csc3002/pretrain_data/tweetText/sexismTweets.csv',sep=',',  index_col = False)
final.text = final.text.apply(preprocess)
final.head(10)

Unnamed: 0,id,text
0,472492987516465152,#rapeculture #yesallmen #yesallwomen #yesallpeople #killallmen #notallmen ---&gt;&gt;&gt;&gt;#expectshowtrials
1,470591049211985920,"if the idea of #yesallwomen threatens you and makes you feel panic, rage and that life is unfair, congrats, you just got a t"
2,470441116433543168,#notallmen practice violence against women but #yesallwomen live with the threat of male violence. every. single. day. all ov
3,472382404087123968,"' yes, but some men's feelings getting hurt and they want to stop. we forgot 4 a moment #notallmen more important than #yesallwhitewomen"
4,471771588182827008,that's #notallmen hashtag. that you say men don't have real issues to deal with proves we need it. #misandry
5,470363513303887872,"if #notallmen want to be seen as dangerous misogynists, the ones who aren't had better start doing something about the ones w"
6,470408018966753280,"#notallmen victimize women, but #yesallwomen feel victimized by men sometime in their lives."
7,470296669536268289,we can scapegoat the mentally-ill and mentally-disabled forever but it's really important to scream #notallmen. got it!
8,470784178166824960,"it's nice that #notallmen get defensive when they see #yesallwomen in their timeline. for those that do, it's really not a"
9,475013829353435136,getting a lot of sideways looks walking around in an army surplus coat. don't these people know that #notallmen in surplus gear are killers?


In [25]:
stopwords = ['#notallmen', '#notallwomen', '#yesallwomen']
tweet = final.iloc[3]['text']

print("Original:\n", tweet)
print("\n\nNew:\n ", removeWords(tweet, stopwords))

final['text'] = final['text'].apply(removeWords, wordlist = stopwords)
final['text'] = final['text'].apply(hashtagSegment)
final.head(10)

Original:
 ' yes, but some men's feelings getting hurt and they want to stop. we forgot 4 a moment #notallmen more important than #yesallwhitewomen


New:
  ' yes, but some men's feelings getting hurt and they want to stop. we forgot 4 a moment more important than #yesallwhitewomen


Unnamed: 0,id,text
0,472492987516465152,rape culture yes all men yes all people kill all men ---&gt;&gt;&gt;&gt; expect show trials
1,470591049211985920,"if the idea of threatens you and makes you feel panic, rage and that life is unfair, congrats, you just got a t"
2,470441116433543168,practice violence against women but live with the threat of male violence. every. single. day. all ov
3,472382404087123968,"' yes, but some men's feelings getting hurt and they want to stop. we forgot 4 a moment more important than yes all white women"
4,471771588182827008,that's hashtag. that you say men don't have real issues to deal with proves we need it. mi sandry
5,470363513303887872,"if want to be seen as dangerous misogynists, the ones who aren't had better start doing something about the ones w"
6,470408018966753280,"victimize women, but feel victimized by men sometime in their lives."
7,470296669536268289,we can scapegoat the mentally-ill and mentally-disabled forever but it's really important to scream got it!
8,470784178166824960,"it's nice that get defensive when they see in their timeline. for those that do, it's really not a"
9,475013829353435136,getting a lot of sideways looks walking around in an army surplus coat. don't these people know that in surplus gear are killers?


In [26]:
final.to_csv('gs://csc3002/pretrain_data/tweetText/removedTerms/sexismTweets.csv', sep = ',', \
             encoding='utf-8', index = False, header = 0)

<b>#YesAllWomen Dataset</b>

In [30]:
womTweets = pd.read_csv('gs://csc3002/pretrain_data/tweetText/YesAllWomenTweets.csv',sep=',',  index_col = False, names =['id', 'text'])
womTweets.text = womTweets.text.apply(preprocess)
womTweets.head(10)

Unnamed: 0,id,text
0,470315730706399232,"because i get in an elevator with a guy and think ""what's my escape plan going to be?"" #yesallwomen"
1,470317902776647680,"because it starts earlyschool dress codes punish girls for wearing clothes that are ""distracting"" to boys. #yesallwomen"
2,470317461024555009,"because ""boys will be boys"" is a phrase that still exists. #yesallwomen"
3,470316502445744130,"because we're taught ""never leave your drink alone,"" instead of ""don't drug someone."" #yesallwomen"
4,470317309022978048,#yesallwomen have the right to set boundaries. prevalent misogyny has led to the extinction of private space for women.
5,470315790885863424,because women are taught to carry our keys like a weapon in case we're attacked in a parking lot. #yesallwomen
6,470315890060189696,hell yeah. #yesallwomen deserve the right to say no to a man without a given reason. they owe you
7,470316527959670784,"because there is a moment, daily, weekly, monthly, where you're in a situation where you think: ""is today the day i get rap"
8,470314968492302338,sounds like something that needs to get shared right now. #yesallwomen
9,470315514980757504,hell yeah. #yesallwomen deserve the right to say no to a man without a given reason. they owe you nothing!


In [31]:
#A lot of the tweets follow the format "because.... -comes up too often"
stopwords = ['because', '#yesallwomen']

print("Original:\n", tweet)
print("\n\nNew:\n ", removeWords(tweet, stopwords))

womTweets['text'] = womTweets['text'].apply(removeWords, wordlist = stopwords)
womTweets['text'] = womTweets['text'].apply(hashtagSegment)

Original:
 ' yes, but some men's feelings getting hurt and they want to stop. we forgot 4 a moment #notallmen more important than #yesallwhitewomen


New:
  ' yes, but some men's feelings getting hurt and they want to stop. we forgot 4 a moment #notallmen more important than #yesallwhitewomen


In [32]:
womTweets.to_csv('gs://csc3002/pretrain_data/tweetText/removedTerms/YesAllWomenTweets.csv', sep = ',', \
                 encoding='utf-8', index = False, header = 0)
womTweets.head(10)

Unnamed: 0,id,text
0,470315730706399232,"i get in an elevator with a guy and think ""what's my escape plan going to be?"""
1,470317902776647680,"it starts earlyschool dress codes punish girls for wearing clothes that are ""distracting"" to boys."
2,470317461024555009,"""boys will be boys"" is a phrase that still exists."
3,470316502445744130,"we're taught ""never leave your drink alone,"" instead of ""don't drug someone."""
4,470317309022978048,have the right to set boundaries. prevalent misogyny has led to the extinction of private space for women.
5,470315790885863424,women are taught to carry our keys like a weapon in case we're attacked in a parking lot.
6,470315890060189696,hell yeah. deserve the right to say no to a man without a given reason. they owe you
7,470316527959670784,"there is a moment, daily, weekly, monthly, where you're in a situation where you think: ""is today the day i get rap"
8,470314968492302338,sounds like something that needs to get shared right now.
9,470315514980757504,hell yeah. deserve the right to say no to a man without a given reason. they owe you nothing!


<b>Immigration Executive Order Dataset</b>

In [34]:
imTweets = pd.read_csv('gs://csc3002/pretrain_data/tweetText/immigrationTweets.csv',sep=',',  index_col = False, names = ['id', 'text'])
imTweets.text = imTweets.text.apply(preprocess)
imTweets.head(10)

Unnamed: 0,id,text
0,830392337662554113,aka judge roberts and 9thcircuit court in big ass trouble.
1,830392257861865472,born and raised in canada. not a refugee. not an immigrant. no ties to the 7 counties of the eo. it's a #muslimban https
2,830392310701502464,trey gowdy blasts the liberal 9th circuit over trump immigration order
3,830392243928301568,more than 300 protest trump's immigration ban at pittsburgh internationa
4,830392409691451393,"there is no #muslimban just arrived lax, no one asked me about my religious faith even though my passport has my muslim name"
5,830392357329715200,"trump to sign brand new immigration executive order, not appeal to scotus by #rt_com via"
6,830392395216855040,and all have been through vetting for years so stop fear mongering because you didn't win your muslim b
7,830392360349659136,#iran #isis #israel #muslimban #nomuslimban #trump #alqaeeda
8,830392412321218560,if the u.s. were at war with iran would liberal judges understand how important a travel ban would be?
9,830392240631586819,trump travel ban kills surgeon's life-saving trip to iran


In [35]:
tweet = imTweets.iloc[1]['text']
stopwords = ['#muslimban', '#nobannowall', '#nomuslimban','#JFKTerminal4',\
             '#refugeeswelcome', '#immigrationban', '#TravelBan']
             
print("Original:\n", tweet)
print("\n\nNew:\n ", removeWords(tweet, stopwords))

imTweets['text'] = imTweets['text'].apply(removeWords, wordlist = stopwords)
imTweets['text'] = imTweets['text'].apply(hashtagSegment)

Original:
 born and raised in canada. not a refugee. not an immigrant. no ties to the 7 counties of the eo. it's a #muslimban https


New:
  born and raised in canada. not a refugee. not an immigrant. no ties to the 7 counties of the eo. it's a https


In [36]:
imTweets.to_csv('gs://csc3002/pretrain_data/tweetText/removedTerms/immigrationTweets.csv', sep = ',', \
                 encoding='utf-8', index = False, header = 0)
imTweets.head(10)

Unnamed: 0,id,text
0,830392337662554113,aka judge roberts and 9thcircuit court in big ass trouble.
1,830392257861865472,born and raised in canada. not a refugee. not an immigrant. no ties to the 7 counties of the eo. it's a https
2,830392310701502464,trey gowdy blasts the liberal 9th circuit over trump immigration order
3,830392243928301568,more than 300 protest trump's immigration ban at pittsburgh internationa
4,830392409691451393,"there is no just arrived lax, no one asked me about my religious faith even though my passport has my muslim name"
5,830392357329715200,"trump to sign brand new immigration executive order, not appeal to scotus by rt com via"
6,830392395216855040,and all have been through vetting for years so stop fear mongering because you didn't win your muslim b
7,830392360349659136,iran isis israel trump al qa eeda
8,830392412321218560,if the u.s. were at war with iran would liberal judges understand how important a travel ban would be?
9,830392240631586819,trump travel ban kills surgeon's life-saving trip to iran
