# 2. Dataset Creation - Mining Mood Tweets from Twitter API

Now that our proof-of-concept that the sentiment analysis training method works, We will now collate our mood tweets dataset. I will refer to the sad/happy tweets as "mood tweets" from here on.

We will be using the Twitter Premium Search API (Sandbox version). Do note that it is free for the Sandbox version (limited functionality, but just enough for us to gather our dataset). The Twitter Premium Search API is a RESTful API that requires us to create:

1. A Twitter App - https://developer.twitter.com/en/apps
2. A Dev Environment for the App - https://developer.twitter.com/en/account/environments

API Reference: https://developer.twitter.com/en/docs/tweets/search/api-reference/premium-search

**Why the Twitter Premium Search API:**
- Gives us ability to use operators to refine our search, to get cleaner data (for example, english only suicidal tweets can be expressed as "suicidal OR suicide OR "kill myself" lang:en"
- Free version gives us 250 queries in a month, with each query giving us 100 tweets. 25000 tweets should form a sufficiently large dataset
- The API reference is really quite good
- I tried the free "Standard Search API", doesn't return full tweets.

You will have to register for a developer account to get the API keys in order to follow along this notebook. Otherwise, you can just follow my output here.

In [5]:
import requests
import my_api_keys as api_keys

# to get your bearer token, go to https://developer.twitter.com/en/docs/basics/authentication/guides/bearer-tokens
# edit my_api_keys_SAMPLE.py to my_api_keys.py and put your token it in

bearer_access_token = api_keys.bearer_access_token

Here, we will set up the POST request to query the premium search API. Our query must have these few premium operators in it:
- `lang:en` - to search only English tweets
- `-has:links` - the hyphen is a negation operator (NOT), we want to only search for tweets WITHOUT links, so that we can filter out advertisements
- `-has:media` - tweets WITHOUT media, for me I thought it was more likely to get real "mood tweets" if we filter out tweets with media, to remove those "meme" or "gif" tweets
- there was also one more flag I wanted to add but unfortunately cannot, because the free Sandbox version of the API does not allow for this flag, `-is:retweet`, which filters out retweets. Retweets should be filtered because they are very likely to contain "emo quotes twitter accounts" or advertisements

In [95]:
# endpoint URI is https://api.twitter.com/1.1/tweets/search/30day/<YOUR DEV ENV NAME>.json
endpoint = "https://api.twitter.com/1.1/tweets/search/30day/dev.json"

headers = {
    "Authorization" : "Bearer " + bearer_access_token,
    "Content-Type": "application/json",
}

queries = {
    "suicidal" : 'suicidal OR suicide OR "kill myself"',
    "depressed" : "depressed OR depression",
    "sad" : "sad -depressed -suicidal -suicide",
    "happy" : "happy OR great -sadness -depression -disappointment",
    "cheerful" : "cheerful OR awesome OR glad OR pleased",
    "overjoyed" : "overjoyed OR elated OR ecstatic OR thrilled",
}

In [7]:
# do your math and calculate how many sets of 100 tweets per class you want
# you have a limit of 250 queries in a month for the sandbox version of the API
# DON'T SCREW THIS UP OR YOU WILL WASTE QUERIES
NO_OF_QUERIES_PER_CLASS = 40

In [8]:
from datetime import datetime
import os
import time
import json

# Directory creation for dataset
DATASET_ROOT_DIR = ""
if not os.path.exists("./datasets/mood_tweets/"):
    os.mkdir("./datasets/mood_tweets/")
if not os.path.exists("./datasets/mood_tweets/" + datetime.now().strftime("%Y%m%d-%H%M")):
    DATASET_ROOT_DIR = "./datasets/mood_tweets/" + datetime.now().strftime("%Y%m%d-%H%M")
    os.mkdir(DATASET_ROOT_DIR)
else:
    print("You sure? The directory already exists!")
    raise KeyboardInterrupt

# Actual API Querying Part
query_params = {
    'query' : '', # blank string because we populate it later in the loop
    'maxResults' : '100',
}

for class_name, current_query in queries.items():
    print(class_name)
    
    # set the right query term
    query_params["query"] = current_query + ' -has:links -has:media lang:en'
    
    # remove "next" parameter from previous term's pagination
    if "next" in query_params:
        query_params.pop("next")
        
    # load initial query
    data = json.dumps(query_params)
    
    for n in range(0, NO_OF_QUERIES_PER_CLASS):

        # query API
        response = requests.post(endpoint, data=data, headers=headers)
        response_json = json.loads(response.text)
        
        # write response to file
        filename = "{}/{}-{}.json".format(DATASET_ROOT_DIR, class_name, format(n, '03'))
        with open(filename, "w") as file:
            json.dump(response_json, file, ensure_ascii=False, indent=4)
        print(filename)
        
        # pause to query API slower
        time.sleep(1)
        
        # handle pagination for next call
        if "next" in response_json:
            query_params["next"] = response_json["next"]
            data = json.dumps(query_params)
        else:
            # we maxed out the number of pages for this available term, this is the last "page" of results
            print("Ended prematurely at {} because no further pages are available".format(filename))
            break

print("DONE")

happy
./datasets/mood_tweets/20190405-0528/happy-000.json
./datasets/mood_tweets/20190405-0528/happy-001.json
./datasets/mood_tweets/20190405-0528/happy-002.json
./datasets/mood_tweets/20190405-0528/happy-003.json
./datasets/mood_tweets/20190405-0528/happy-004.json
./datasets/mood_tweets/20190405-0528/happy-005.json
./datasets/mood_tweets/20190405-0528/happy-006.json
./datasets/mood_tweets/20190405-0528/happy-007.json
./datasets/mood_tweets/20190405-0528/happy-008.json
./datasets/mood_tweets/20190405-0528/happy-009.json
./datasets/mood_tweets/20190405-0528/happy-010.json
./datasets/mood_tweets/20190405-0528/happy-011.json
./datasets/mood_tweets/20190405-0528/happy-012.json
./datasets/mood_tweets/20190405-0528/happy-013.json
./datasets/mood_tweets/20190405-0528/happy-014.json
./datasets/mood_tweets/20190405-0528/happy-015.json
./datasets/mood_tweets/20190405-0528/happy-016.json
./datasets/mood_tweets/20190405-0528/happy-017.json
./datasets/mood_tweets/20190405-0528/happy-018.json
./data

./datasets/mood_tweets/20190405-0528/cheerful-074.json
./datasets/mood_tweets/20190405-0528/cheerful-075.json
./datasets/mood_tweets/20190405-0528/cheerful-076.json
./datasets/mood_tweets/20190405-0528/cheerful-077.json
./datasets/mood_tweets/20190405-0528/cheerful-078.json
./datasets/mood_tweets/20190405-0528/cheerful-079.json
overjoyed
./datasets/mood_tweets/20190405-0528/overjoyed-000.json
./datasets/mood_tweets/20190405-0528/overjoyed-001.json
./datasets/mood_tweets/20190405-0528/overjoyed-002.json
./datasets/mood_tweets/20190405-0528/overjoyed-003.json
./datasets/mood_tweets/20190405-0528/overjoyed-004.json
./datasets/mood_tweets/20190405-0528/overjoyed-005.json
./datasets/mood_tweets/20190405-0528/overjoyed-006.json
./datasets/mood_tweets/20190405-0528/overjoyed-007.json
./datasets/mood_tweets/20190405-0528/overjoyed-008.json
./datasets/mood_tweets/20190405-0528/overjoyed-009.json
./datasets/mood_tweets/20190405-0528/overjoyed-010.json
./datasets/mood_tweets/20190405-0528/overjoy

## Cleaning up the dataset

Before mass processing all the data, we are going to process a sample of one of the files to see if they are complete / can be extracted without issues first.

In [26]:
sample_file = './datasets/mood_tweets/20190405-0528/overjoyed-079.json'

with open(sample_file, "r") as file:
    result = json.loads(file.read())
    filename = os.path.basename(file.name)
    classname = filename.split("-")[0]
classname

'overjoyed'

In [22]:
print(len(result["results"]), "tweets found")

100 tweets found


In [21]:
for x in result["results"]:
    if "retweeted_status" in x:
        if "extended_tweet" in x["retweeted_status"]:
            print("\nRETWEET - LONG")
            print(x["retweeted_status"]["extended_tweet"]["full_text"])
        else:
            print("\nRETWEET - SHORT")
            print(x["retweeted_status"]["text"])
    elif "extended_tweet" in x:
        print("\nNORMAL TWEET - LONG")
        print(x["extended_tweet"]["full_text"])
    else:
        print("\nNORMAL TWEET - SHORT")
        print(x["text"])


RETWEET - SHORT
Nerves broken, almost a heart attack, but overjoyed ❤️ #FCBFCH #DFBPokal https://t.co/sMWG0h5GaV

NORMAL TWEET - SHORT
@ecstatic_shocks koreabooism at it’s finest🤡

RETWEET - LONG
Special to hit helicopter shot with @msdhoni watching: Hardik

"Hoped MS would congratulate me after that shot 😜"
An overjoyed @hardikpandya7 talks about emulating inspiration MSD's pet stroke against CSK. Interview by @Moulinparikh #MIvCSK @mipaltan 

📹 https://t.co/jLLWXuZRYe https://t.co/aci6s6cPBF

RETWEET - LONG
Thrilled to hear @UnitedWaySummit is expanding their diversity, equity, &amp; inclusion work! @SummitArtsNow partnered w/ UWSC in March on a social service+arts, culture, and environment collaboration. Looking forward to the next phase! Thank you Andre Campbell and Adrienne Bradley!

RETWEET - LONG
Thrilled to join @PKMackie &amp; @cardiffbusiness for breakfast briefing on homelessness. Thank you for such a warm &amp;  positive response - @LlamauUK &amp;  EYHC’s message is that e

Awesome! Now we need to think of how to clean up the data **on one file first**. By observing the tweets, you can see that:
1. Some tweets have URLs in them somehow. We should get rid of them
2. We should replace '&amp;' with '&'
3. Let's collapse multi-line tweets into single lines.
4. The retweets are causing us to have a lot of duplicate tweets... (sad thing of being too poor to afford the paid API, where we can use the `-is:retweet` flag in our search operators)
5. Removing commas because I'm not training for grammar and I want to process these tweets in a csv (this is specific for my use case to make my work easier)

In [67]:
import re, random

def clean_tweet(x):
    if "retweeted_status" in x:
        if "extended_tweet" in x["retweeted_status"]:
            # RETWEET - LONG
            tweet = x["retweeted_status"]["extended_tweet"]["full_text"]
        else:
            # RETWEET - SHORT
            tweet = x["retweeted_status"]["text"]
    elif "extended_tweet" in x:
        # NORMAL TWEET - LONG
        tweet = x["extended_tweet"]["full_text"]
    else:
        # NORMAL TWEET - SHORT
        tweet = x["text"]
        
    # clean up tweet here
    
    # removing all URLs
    tweet = re.sub(r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))''',
                   '', tweet, flags=re.MULTILINE)
    
    # replacing all URI encoded ampersands with proper ampersands
    tweet = tweet.replace("&amp;", "&")
    
    # removing newlines
    tweet = tweet.replace("\n", "").replace("\r", "")
    
    # removing commas
    tweet = tweet.replace(",", "")
    
    """
    names = ["TOM", "DICK", "HARRY"]
    tweet = re.sub("^@\w+", random.choice(names), tweet)
    pattern = re.compile(" {1}@\w+")
    while pattern.search(tweet):
        tweet = re.sub(" {1}@\w+", " {}".format(random.choice(names)), tweet, count=1)
    """
    
    return tweet

for x in result["results"]:
    print(clean_tweet(x), "\n")

Nerves broken almost a heart attack but overjoyed ❤️ #FCBFCH #DFBPokal  

@ecstatic_shocks koreabooism at it’s finest🤡 

Special to hit helicopter shot with @msdhoni watching: Hardik"Hoped MS would congratulate me after that shot 😜"An overjoyed @hardikpandya7 talks about emulating inspiration MSD's pet stroke against CSK. Interview by @Moulinparikh #MIvCSK @mipaltan 📹   

Thrilled to hear @UnitedWaySummit is expanding their diversity equity & inclusion work! @SummitArtsNow partnered w/ UWSC in March on a social service+arts culture and environment collaboration. Looking forward to the next phase! Thank you Andre Campbell and Adrienne Bradley! 

Thrilled to join @PKMackie & @cardiffbusiness for breakfast briefing on homelessness. Thank you for such a warm &  positive response - @LlamauUK &  EYHC’s message is that each and everyone of us in Wales needs to commit to playing our part in preventing and ending homelessness 💪 

WATCH: PBB Otso Batch 2 Housemates are all elated meeting each ot

Cleaning up the tweets for one file is successful! Time to mass process the rest of the files!

In [29]:
from glob import glob

dataset = glob(DATASET_ROOT_DIR + "/*")
len(dataset)

480

In [78]:
from IPython.display import display, clear_output

tweets, labels, duplicates = [[], [], 0]

for f in dataset:
    with open(f, "r") as file:
        result = json.loads(file.read())
        filename = os.path.basename(file.name)

    classname = filename.split("-")[0]
    for x in result["results"]:
        tweet = clean_tweet(x)
        if tweet not in tweets:
            # unique tweet, add to collections
            tweets.append(tweet)
            labels.append(classname)
        else:
            duplicates += 1

print("{} unique tweets found".format(len(tweets)))
print("{} duplicate tweets".format(duplicates))

25758 unique tweets found
22239 duplicate tweets


In [91]:
import pandas as pd

df = pd.DataFrame({
    "class" : labels,
    "tweet" : tweets
})
df.head()

Unnamed: 0,class,tweet
0,suicidal,"""He was lost & scared"" says a Newport woman ab..."
1,suicidal,@TheDumbMedico Ameen. Or ye suicide wali bat a...
2,suicidal,You showed the leak in @HouseofCommons today @...
3,suicidal,The number in my bio is for a Suicide hotline ...
4,suicidal,cw: suicidal ideation it's unsurprisingly hard...


In [92]:
df.describe()

Unnamed: 0,class,tweet
count,25758,25758
unique,6,25758
top,cheerful,Why is this limited to girls? Young everyones....
freq,5885,1


In [104]:
for key in queries.keys():
    print("{}: {} unique tweets".format(key, df[df["class"] == key].count()[0]))

suicidal: 3204 unique tweets
depressed: 2561 unique tweets
sad: 5464 unique tweets
happy: 5266 unique tweets
cheerful: 5885 unique tweets
overjoyed: 3378 unique tweets


In [105]:
df.to_csv("./datasets/mood_tweets.csv")