By Alexander Stratton / als15@illinois.edu / Copyright 2022. All Rights Reserved.

References:
    - https://github.com/echen102/ukraine-russia/tree/master/2022-04

In [89]:
import requests
import pandas as pd
from csv import writer
import json
import time
import glob
import os
from pathlib import Path
import tqdm
from twarc import Twarc
from nltk.tokenize import TweetTokenizer

tk = TweetTokenizer()
tw = Twarc('consumer_key', 'consumer_key_secret', 'access_token', 'access_token_secret')
path = str(Path.cwd()) + '/data/'

In [64]:
# This function hydrates the tweets and writes them to a csv file.
# Twarc comes with a prepackaged hydrate script, but I wanted to save only part of the information it retrieves \
# and I wanted it as a csv not a json.

def hydrate(path, file):
    csvFile = open(file, "a", newline="", encoding='utf-8')
    csvWriter = writer(csvFile)
    
    
    for tweet in tw.hydrate(open(path)):
        tweet_id = tweet['id']
        text = tweet['full_text']
        
        tweet = [tweet_id, text]
        csvWriter.writerow(tweet)
        
    csvFile.close()

In [90]:
all_files = glob.glob(os.path.join(path, "*.txt"))
li = [pd.read_csv(filename, sep=" ", header=None) for filename in all_files]
all_tweets = pd.concat(li, axis=0, ignore_index=True)

In [91]:
# This is how many tweets the dataset has for one day (2022-04-01).

len(all_tweets)

3029712

In [36]:
need_verbs = ['needs', 'need', 'needing', 'require', 'requiring', 'needed',
              'required', 'demand', 'demands', 'demanding', 'request', 'requesting', 'requests']

We have a limited number of tweets we can pull per month using the academic API. Thus, I am taking a subset of this one day to determine how many usable tweets are contained in this dataset. I am going to pull 10,000 tweets and look at how many contain the "need verbs" listed above.

In [39]:
tweet_sample = all_tweets.sample(10000)
# Writing my sample into a txt file so I can rehydrate the tweets from their ids.

with open(os.path.join(path, "sample/sample.txt"), 'a') as f:
    tweets_as_string = tweet_sample.to_string(header=False, index=False)
    f.write(tweets_as_string)

In [65]:
csvFile = open(os.path.join(path, "sample/sample_tweets.csv"), "a", newline="", encoding='utf-8')
csvWriter = writer(csvFile)
csvWriter.writerow(['id', 'text'])
csvFile.close()

hydrate(os.path.join(path, "sample/sample.txt"), os.path.join(path, "sample/sample_tweets.csv"))

In [76]:
tweets = pd.read_csv(os.path.join(path, "sample/sample_tweets.csv"))
tagged = [tk.tokenize(tweets.iloc[i, 1]) for i in range(len(tweets.iloc[:, 1]))]

In [133]:
usable_tweets = []

for idx, tweet in enumerate(tagged):
    for need in need_verbs:
        if need in tweet:
            usable_tweets.append(tweets.iloc[idx, 1])
            break
            
ratio = len(usable_tweets) / len(tweets)

In [136]:
usable_tweets[0:10]

['RT @SteveSchmidtSES: Where are these Ukrainians? The world must demand immediate answers. This is ominous and reeks of the stench of an evi…',
 'RT @KpsZSU: We have not received the tools we need to defend our sky and achieve victory.\r\nIn the sky, the greatest need is for fighter jet…',
 'RT @watch_waste: This #FridaysForFuture, we demand you to #StopWar #UkraineWar, @WhiteHouse @EU_Commission @EUCouncil #Kremlin @JoeBiden @v…',
 'RT @jcokechukwu: BREAKING: Putin signs into law, a decree requiring foreign buyers of Russian gas to pay in Rubles starting April 1. This m…',
 '@thehill At the ways Putin is going about the war with Ukraine,  it’s not going to halt until Ukraine totally surrender and give in to all demands from Russia.  Biden is now caught in a dilemma and NATO may be fragmented in continuing with the war.',
 'RT @SUNNYLAND24: Sounds like there is a MAJOR need for $BYRG @buyergroupinc 👀👇\n\n#UnitedStates based and owned #Platinum #Palladium #Rhodium…',
 'RT @KpsZSU: We h

In [135]:
ratio

0.02449078564500485

If this sample is representative, then roughly 2.4491% of tweets are usable. If this is the case, from the dataset's almost 454.5 million tweets, aproximately 11.13 million tweets are usable. In reality, not all these tweets will be usable because while they all contain the need verbs, not all will be relevant.