# Data collection
May 5, 2022
## Description:
In this notebook we collect data from Twitter, looking for tweets about new arXiv submission, written by the paper's author.

To access Twitter database, we employ Twitter API (v2) using the tweepy package. We use Academic Research access (which is superior compared to usual access plans primarily due to the access to the entire databse, not only the last week). The bearer_token is saved in twitter_key.txt.

To access arXiv, we use a simple HTML parser as defined in arxiv_utils.py. Its job is to access export.arxiv.org/abs/xxxx.xxxxx pages, read the title, author list and abstract. Currently we do not access the paper.

The mining algorithm is the following:

<code>
for each month in range of years (~2016 to 2022)
    search Twitter for tweets in month (plus 15 days into next month) 
        use query 'arxiv.org/abs/yymm' + 'new paper' or 'we find', etc.
        append found tweets to tweets list
for every tweet in tweets list
    find arXiv link and mine arXiv data
    retain tweet if the tweet was written by the one of the authors listed on arXiv (by some similarity metric)
        search Twitter for subsequent thread of the tweet, if exists
save dataset of {TweetID, AuthorID, AuthorName, Tweet thread, arXiv link, abstract}
</code>

### Reasoning of some design choices:
We search Twitter month by month because (i) we find that the Twitter search function does not handle queries well for old papers well if a ~month time range is not specified; (ii) queries typically return < 500 results, not exceeding the maximum amount per query; (iii) the search can fail due to several reasons (max limit, connection error, etc.), therefore it is convenient to restrict it to a month.

We impose early that the tweet was written by an author because many tweets are written by bots or unrelated people, which is data that we are not interested in.

## Ways to improve:
The methodology here is fairly basic and could be made more sophisticated, to the end goal of collecting overlooked data and avoid contamination of irrelevant data.
1) We currently do not use Pagination, which allows for more than 500 results per query. The main reason is that it seemed tricky to get the author names for tweets with Pagination. In any case, the current queries (per month) seem to cap at ~200, therefore it is not a serious limitation.
2) The queries can be revised and be made more comprehensive.
3) To determine whether a tweet was written by a paper's author, we compare the tweet's author to the arXiv author list, and comparing them using a similarity metric (see method clean_coarse_list). This similarity metric could be improved.
4) We try to expand links in various ways (see method link_entities_to_clean_arxiv_list in tweet_utils) but fail in some. For example, the url-expanders fail with Facebook-redirection (e.g. https://t.co/tr7qNaQjMd)
5) It would be very effective if the initial query would also download the entire thread for each tweet. Currently, every tweet from the initial query (passing some quality criteria) has to be sent in its own query (see method get_tweets_thread). This is quite expensive because each query has 1 second downtime due to Twitter's rate limits.
6) Save more data, perhaps even raw tweets for any case.
7) Access other forms of data from tweets: links to other websites (e.g. GitHub), pictures and videos.

In [1]:
import tweepy
import tweet_utils
import arxiv_utils
import pandas as pd
import time
import difflib
from datetime import datetime, timedelta

In [17]:
# define Twitter client with proper authentication
import os
from pathlib import Path

# we save the Twitter project "bearer token" in twitter_key.txt two folders up. Can easily change to another directory.
path_here = os.getcwd()
path = Path(path_here)
with open(str(path.parent.parent.absolute())+'/twitter_key.txt') as key_f:
    bearer_token = key_f.read()

client = tweepy.Client(bearer_token=bearer_token)

In [3]:
# some helper functions, may move it to separate .py file

def next_month_yr_mon(year,month):
    if month<12:
        return year, month+1
    else:
        return year+1, 1
        
# converts a year and month to start_time and end_time
def year_month_to_start_end_time(year, month, extra_days=15, hour_diff=-3):
    
    new_yr, new_mon = next_month_yr_mon(year,month)
    # standard values for looking into past 
    start_time = str(year)+'-'+str(month)+'-'+str(1)+'T00:00:00Z'
    end_time = str(new_yr)+'-'+str(new_mon)+'-'+str(extra_days)+'T00:00:00Z'

    # check if end_time > current time, and regulate it
    t = time.gmtime()
    if (new_yr > t.tm_year) or (new_yr == t.tm_year and new_mon > t.tm_mon) or (new_yr == t.tm_year and new_mon == t.tm_mon and extra_days >= t.tm_mday):
        end_time = str(t.tm_year)+'-'+str(t.tm_mon)+'-'+str(t.tm_mday)+'T'+str(t.tm_hour+hour_diff)+':'+str(t.tm_min)+':'+str(t.tm_sec)+'Z'
    
    return start_time, end_time

# generates a list of [yr, month] from [first_yr, first_mon] to [last_yr, last_mon]
def get_year_month_list(first_yr, first_mon, last_yr, last_mon):
    mon_block = []
    for yr in range(first_yr, last_yr+1):
        last_mon_i = 12 if yr<last_yr else last_mon
        first_mon_i = 1 if yr>first_yr else first_mon
        
        mon_block.append([[yr,month] for month in range(first_mon_i,last_mon_i+1)])
    mon_block = [num for sublist in mon_block for num in sublist]
    return mon_block

def author_overlap_metric(str1,str2):
    d = difflib.SequenceMatcher(None,str1,str2)
    match = max(d.get_matching_blocks(),key=lambda x:x[2])
    return match.size/max(len(str1),len(str2))

def clean_tweet_from_links(tweet, link_token='<LINK>'):
    s = tweet['text']
    if tweet.entities != None:
        for i in range(len(tweet.entities['urls'])):
            s = s.replace(tweet.entities['urls'][i]['url'],link_token)
    return s

def get_username(user_id,user_list):
    for user in user_list:
        if user.id == user_id:
            return user.name
        

In [4]:
def obtain_coarse_data_for(year=2022,month=4):
    coarse_tweets = []
    start_time, end_time = year_month_to_start_end_time(year, month)
    arxiv_prompt_1 = 'arxiv.org/abs/'+str((year-2000))+(str(month) if month>=10 else '0'+str(month)) ## set /abs/yymm format
    arxiv_prompt_2 = 'arxiv.org/pdf/'+str((year-2000))+(str(month) if month>=10 else '0'+str(month)) ## set /pdf/yymm format
    
    tweets_1 = client.search_all_tweets(query=arxiv_prompt_1+' new paper -replaced -is:retweet -is:reply -from:750791077012107264', 
                                  tweet_fields=['author_id','entities','conversation_id','created_at'], user_fields=['name'], expansions=['author_id'], 
                                  max_results=500, start_time=start_time, end_time=end_time)
    time.sleep(1.1) # to avoid exceeding the rate
    tweets_2 = client.search_all_tweets(query=arxiv_prompt_2+' new paper -replaced -is:retweet -is:reply', 
                                  tweet_fields=['author_id','entities','conversation_id','created_at'], user_fields=['name'], expansions=['author_id'], 
                                  max_results=500, start_time=start_time, end_time=end_time)
    time.sleep(1.1)
    tweets_3 = client.search_all_tweets(query=arxiv_prompt_1+' we (study OR studied OR find OR propose) -is:retweet -is:reply', 
                                  tweet_fields=['author_id','entities','conversation_id','created_at'], user_fields=['name'], expansions=['author_id'], 
                                  max_results=500, start_time=start_time, end_time=end_time)
    time.sleep(1.1)
    tweets_4 = client.search_all_tweets(query=arxiv_prompt_2+' we (study OR studied OR find OR found OR propose) -is:retweet -is:reply', 
                                  tweet_fields=['author_id','entities','conversation_id','created_at'], user_fields=['name'], expansions=['author_id'], 
                                  max_results=500, start_time=start_time, end_time=end_time)
    time.sleep(1.1)

    coarse_tweets.append(tweets_1)
    coarse_tweets.append(tweets_2)
    coarse_tweets.append(tweets_3)
    coarse_tweets.append(tweets_4)
    return coarse_tweets

def clean_coarse_list(coarse_tweets, author_closeness_cut = 0.3):
    # 2nd run on Twitter: keep only tweets whose authors appear on arXiv link (by some similarity metric)
# rearrange data to one list of (tweets, users, urls)

    tweets = []

    for coarse_tweets_i in coarse_tweets:
        print('new query')
        if coarse_tweets_i.data == None:
            print('Empty query')
            continue
            
        user_list = coarse_tweets_i.includes['users']
        tweet_data = coarse_tweets_i.data

        for j, tweet in enumerate(coarse_tweets_i.data):
            user_name = get_username(tweet.author_id, user_list)
            
            #tweet_urls = tweet_utils.get_urls_single_tweet(tweet)
            tweet_urls_dict = tweet.entities['urls']
            url_list = tweet_utils.link_entities_to_clean_arxiv_list(tweet_urls_dict)

            #new_url_list = tweet_utils.remove_non_arxiv_and_replace_pdf_single(tweet_urls)

            if len(url_list) == 0:
                print('EMPTY!')
                print(tweet.entities['urls'])
                continue

            count = 0
            while True:
                try: # try to get arXiv details. Some
                    arxiv_title, author_list, paper_abstract = arxiv_utils.get_arXiv_details(url_list[0])
                except:
                    print('arXiv connection failed. Trying again. Attempt: ' + str(count+1))
                    count = count+1
                    time.sleep(1)
                    if count>9:
                        break
                    else:
                        continue
                break
            
            #print('Compare: ' + ",".join(author_list) + ' vs. ' + user_name)
            author_closeness_metric = [author_overlap_metric(author,user_name) for author in author_list]

            if len(author_closeness_metric) == 0: # somehow link passed previous checks but still broken somehow... avoid
                continue
            if max(author_closeness_metric) > author_closeness_cut:
                #print('Passed author cut!')
                tweets.append((tweet, user_name, url_list, paper_abstract, arxiv_title))
    return tweets


def get_tweets_thread(tweets, add_threads_flag=True, single_tweet_token='<SNGTWT>',
                      multi_tweet_token='<MLTTWT>', LINK_TOKEN = '<LINK>'):
# 3rd run: look for underlying threads in Twitter

    tweetIDs = []
    texts = []
    links = []
    authorID = []
    authorName = []
    abstracts = []
    arxivTitle = []

    for i, (tweet, user_name, url_list, abstract, arxiv_title) in enumerate(tweets):

        # keep some useful information
        authorID.append(tweet.author_id)
        tweetIDs.append(tweet['id'])
        abstracts.append(abstract)
        authorName.append(user_name)
        arxivTitle.append(arxiv_title)
        
        first_tweet_text = tweet['text']
        first_tweet_text = clean_tweet_from_links(tweet, link_token=LINK_TOKEN) # in case we want to remove links

        # Look for additional tweets in the thread
        if add_threads_flag:
            query = 'conversation_id:' + str(tweet['id']) + ' from:' + str(tweet.author_id)
            #thread_tweets = client.search_all_tweets(query=query, tweet_fields=['author_id','entities','conversation_id'], 
            #                                         user_fields=['name'], expansions=['author_id'], max_results=10)
            create_date = tweet.created_at
            start_time = (create_date + timedelta(hours=-5)).strftime("%Y-%m-%dT%H:%M:%SZ")
            end_time = (create_date + timedelta(hours=15)).strftime("%Y-%m-%dT%H:%M:%SZ")

            thread_tweets = client.search_all_tweets(query=query, tweet_fields=['author_id'], max_results=500, start_time=start_time, end_time=end_time)

            thread_text = ''
            if thread_tweets.data != None:
                for tweet_thread in thread_tweets.data:
                    thread_text = clean_tweet_from_links(tweet_thread, link_token=LINK_TOKEN) + thread_text #tweet_thread['text'] + thread_text
                texts.append(multi_tweet_token + first_tweet_text + thread_text)
            else:
                texts.append(single_tweet_token + first_tweet_text)
        else:
            texts.append(first_tweet_text)

        links.append(url_list[0])

        time.sleep(1) # somehow works better...

    texts = [text.replace('\n',' ') for text in texts]
    
    data = {'TweetID': tweetIDs,
        'AuthorID': authorID,
        'AuthorName': authorName,
        'Tweet': texts,
        'arxiv': links,
        'Abstract': abstracts,
        'Title': arxivTitle}

    df = pd.DataFrame(data)

    return df

def get_tweets_thread_list(tweets, add_threads_flag=True, single_tweet_token='<SNGTWT>',
                      multi_tweet_token='<MLTTWT>', LINK_TOKEN = '<LINK>'):
# 3rd run: look for underlying threads in Twitter

    tweetIDs = []
    tweets_text = []
    links = []
    authorID = []
    authorName = []
    abstracts = []
    arxivTitle = []

    for i, (tweet, user_name, url_list, abstract, arxiv_title) in enumerate(tweets):

        # keep some useful information
        authorID.append(tweet.author_id)
        tweetIDs.append(tweet['id'])
        abstracts.append(abstract)
        authorName.append(user_name)
        arxivTitle.append(arxiv_title)
        
        first_tweet_text = tweet['text']
        first_tweet_text = clean_tweet_from_links(tweet, link_token=LINK_TOKEN) # in case we want to remove links
        
        tweet_list = []
        # Look for additional tweets in the thread
        if add_threads_flag:
            query = 'conversation_id:' + str(tweet['id']) + ' from:' + str(tweet.author_id)
            #thread_tweets = client.search_all_tweets(query=query, tweet_fields=['author_id','entities','conversation_id'], 
            #                                         user_fields=['name'], expansions=['author_id'], max_results=10)
            create_date = tweet.created_at
            start_time = (create_date + timedelta(hours=-5)).strftime("%Y-%m-%dT%H:%M:%SZ")
            end_time = (create_date + timedelta(hours=15)).strftime("%Y-%m-%dT%H:%M:%SZ")

            thread_tweets = client.search_all_tweets(query=query, tweet_fields=['author_id'], max_results=500, start_time=start_time, end_time=end_time)

            thread_text = ''
            if thread_tweets.data != None:
                for tweet_thread in thread_tweets.data:
                    tweet_list.insert(0, clean_tweet_from_links(tweet_thread, link_token=LINK_TOKEN)) #tweet_thread['text'] + thread_text
                #texts.append(multi_tweet_token + first_tweet_text + thread_text)
            #else:
                #texts.append(single_tweet_token + first_tweet_text)
        #else:
            #texts.append(first_tweet_text)
        tweet_list.insert(0, first_tweet_text)

        links.append(url_list[0])
        tweets_text.append(tweet_list)

        time.sleep(1) # somehow works better...

    #texts = [text.replace('\n',' ') for text in texts]
    
    data = {'TweetID': tweetIDs,
        'AuthorID': authorID,
        'AuthorName': authorName,
        'Tweets': tweets_text,
        'arxiv_link': links,
        'Abstract': abstracts,
        'Title': arxivTitle}

    df = pd.DataFrame(data)

    return df


In [5]:
years = [[2016, [1,12]],
         [2017, [1,12]],
         [2018, [1,12]],
         [2019, [1,12]],
         [2020, [1,12]],
         [2021, [1,12]],
         [2022, [1,4]]]

to_do_yr_mon = []
for year, months in years:
    for month in range(months[0],months[1]+1):
        to_do_yr_mon.append([year,month])

In [28]:
# for extraction rate or connection reasons, the process sometimes crashes. For now, we retain this part of the notebook 
count = 0
yr_mon = to_do_yr_mon.copy()

for yr, month in yr_mon:
    try:
        coarse_tweets = obtain_coarse_data_for(year=yr,month=month)
        tweets = clean_coarse_list(coarse_tweets)
        df = get_tweets_thread_list(tweets)
        df.to_csv('output_n/data'+str(yr)+'_'+str(month)+'.csv')
        to_do_yr_mon.remove([yr, month])
        time.sleep(1)
    except ConnectionError:#Exception: 
        print('------Error encountered. Will try again later-------')
        print(str(Exception))
    count = count + 1
    if count > 100:
        break
        
to_do_yr_mon

new query
Gave up on https://github.com/drprojects/DeepViewAgg
Gave up on https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model
Gave up on https://twitter.com/DeepMind/status/1519686445258231811
Gave up on https://www.linkedin.com/feed/update/urn:li:share:6927525467763716096
Gave up on https://github.com/jwwangchn/BONAI
Gave up on https://github.com/christophebedard/ros2-message-flow-analysis
Gave up on https://isorc2022.github.io/
Gave up on https://twitter.com/yayitsamyzhang/status/1520105865365270529
Gave up on https://twitter.com/PhilippSiedler/status/1519788983852711937/video/1
Gave up on https://twitter.com/PlotAstro/status/1518603424845152257
Gave up on https://tweetedtimes.com/v/9144
Gave up on https://schemathesis.io:443/
Gave up on https://github.com/philipph77/ACSE-Framework
Gave up on https://twitter.com/astroash42/status/1508718414055059458
Gave up on https://twitter.com/KhoaVuUmn/status/1484634165714853890
Gave up on https://www.youtube

[]

In [128]:
# important for Tweets list evaluation

import ast
ast.literal_eval(file['Tweets'][0])

['I have published a new blog entry on our recent paper (<LINK>) on #2HDM at <LINK> #np3']

# Load and put together

In [42]:
import glob
# All files and directories ending with .txt and that don't begin with a dot:
files = glob.glob("output/*.csv")

In [81]:
df_all = pd.read_csv(files[0])
for i in range(1,len(files)):
    df_i = pd.read_csv(files[i])
    if len(df_i) > 0:
        #df_all.append(pd.read_csv(files[i]))
        df_all = pd.concat([df_all, df_i])


In [85]:
df_all.to_csv('data.csv')


# Save to file

In [13]:
df = df.drop(columns=['index'])

In [14]:
df_all.to_csv('data_x.csv')
