# Using snscraper to Get Tweets

## Step 1: Make sure you are running python v. 3.8 or higher.
If you are not, to update python try:
!conda upgrade notebook 
or 
!conda update jupyter

In [1]:
# check version
!python -V

Python 3.8.1


## Step 2: First time use requires installation of dev version of snsscrape
Uncomment and run the code below.

In [4]:
# !pip install git+https://github.com/JustAnotherArchivist/snscrape.git

## Step 3: Imports and Functions
Please run the code blocks below.

In [3]:
import snscrape.modules.twitter as sntwitter
import pandas as pd
from datetime import date
import time

### Functions

In [338]:
def getPosts(newsOutlet, maxTweets, start, end):
    '''
    Based on MartinBeckUT's python wrapper for snscraper.
    Get all tweets and replies from March 2021.
    
    Inputs:
        - newsOutlet: string, name of twitter news source user
        - maxTweets: int, cut off for search results
        - start: string ('YYYY-MM-DD'), first date to include in search
        - end: string ('YYYY-MM-DD'), first date to exclude in search
        
    Returns: tuple ( tweets_df, replies_list )
        - tweets_df: a pandas dataframe with columns: 
            Datetime,TweetId, text, userName, newsOutlet, 
            MentionedUsers, ConversationId
        - replies_list: list, all convesationIds from the tweets
        
    '''     
    # Creating list to append tweet data to
    tweets_list = []
    replies_list = []

    # Using TwitterSearchScraper to scrape data and append tweets to list
    for i, tweet in enumerate(sntwitter
                              .TwitterSearchScraper(f'from:{ newsOutlet } since:{start} ' +
                                               f'until:{end} -is:retweet -is:reply')
                              .get_items()):
        if i>maxTweets:
            break
            
        mentions = ''

        if tweet.id != tweet.conversationId:
            if tweet.mentionedUsers:
                for user in tweet.mentionedUsers:
                    mentions += (' ' + user.username)
                else:
                    mentions = ''

        text = tweet.content
        text = text.replace(',', '')
        text = text.replace('"', '')
        text = text.replace("'", '')
        
        tweets_list.append([tweet.date, 
                            str(tweet.id), 
                            text, 
                            tweet.user.username,
                            mentions,
                            str(tweet.conversationId)])
        
        if tweet.conversationId not in replies_list:
            replies_list.append(tweet.conversationId)
    
    # Creating a dataframes from the tweets lists above
    tweets_df = pd.DataFrame(tweets_list, columns=['Datetime', 
                                                    'TweetId', 
                                                    'Text', 
                                                    'Username',
                                                    'MentionedUsers',
                                                    'ConversationId'])
    
    return ( tweets_df, replies_list )


In [67]:
def getReplies(newsOutlet, convsID, start, end, maxTweets=2000):
    '''
    Extension of original scraper code
    to pull all following conversations
    
    Inputs:
         - newsOutlet: string, userId from orginal tweet post
         - convsID: string, conversationId from original post
         - start: string ('YYYY-MM-DD'), first date to include in search
         - end: string ('YYYY-MM-DD'), first date to exclude in search
         - maxTweets: int, cut off for search results
     
     Returns: tweets_list: list of lists,
         representing each reply in the original thread.
         Lowest level lists contain:
             Datetime, TweetId, text, username, 
             newsOutlet, MentionedUsers, conversationId
    '''
       
    # Creating list to append tweet data to
    tweets_list = []

    # Using TwitterSearchScraper to scrape data and append tweets to list

    scraper_instance = sntwitter.TwitterSearchScraper(f'lang:en conversation_id:{convsID} ' + 
                                                      f'(filter:safe OR -filter:safe) -is:retweet')

    
    for i, tweet in enumerate(scraper_instance.get_items()):
        if i>maxTweets:
            break
        
        mentions = []

        if tweet.id != tweet.conversationId:
            if not tweet.mentionedUsers is None:
                for user in tweet.mentionedUsers:
                    mentions.append(user.username)
            
            text = tweet.content
            text = text.replace(',', '')
            text = text.replace('"', '')
            text = text.replace("'", '')
            
            tweets_list.append([tweet.date,
                                str(tweet.id), 
                                text, 
                                tweet.user.username,
                                newsOutlet,
                                " ".join(mentions),
                                str(tweet.conversationId)])
    
    
    #scraper_instance._unset_guest_token()
    
    return tweets_list
    

## Step 4: Define your search ranges under "Globals"
The START_DATE is included. The END_DATE is excluded.

### Globals

In [639]:
MAX_TWEETS = 50000
START_DATE = '2021-04-01'
END_DATE = '2021-05-01'

FILE_SUFFIX = "_Apr2021"

## Step 5: Run the scraper for each news source

Due to issues with the scraper instance, I'm running each separately and saving out the results each time incase of interruptions. Then in the final step I'm merging dataframes.

Each news source takes considerable time to run.
An error of "unable to find guest token" may indicate an issue with the scraper instance.
This happens frequently in retrieving replies.

#### What to do when you get "guest token" error in replies:
Processed conversation ids are recorded. If an error occurs, run the test code
directly below to see if the full list of conversations were caputured.
If not, run the replies code again until all have been saved.

#### Immediate "guest token" errors without any scraper activity:
Sometimes successive runs are strangely sticky.
A short little request sometimes unsticks it.

Try something like:
getReplies("nytimes", 1377405371279568898, START_DATE, END_DATE, 5)

Or even just running something else like:
print(MAX_TWEETS)

For a few minutes until you get a response.
Then resume.


### Data Gathering:

#### 5.1: All Sources Original Posts

In [587]:
# pull fox news data
fox_tweets, fox_conv_ids = getPosts("FoxNews", 
                                   MAX_TWEETS, 
                                   START_DATE, 
                                   END_DATE)

print("Fox Posts: ", fox_tweets.shape)

Fox Posts:  (841, 6)


In [588]:
# pull new york times post data
nyt_tweets, nyt_conv_ids = getPosts("nytimes", 
                                     MAX_TWEETS, 
                                     START_DATE, 
                                     END_DATE)

print("NY Times Posts: ", nyt_tweets.shape)

NY Times Posts:  (2682, 6)


In [640]:
# pull reuters post data
reuters_tweets, reuters_conv_ids = getPosts("Reuters", 
                                     MAX_TWEETS, 
                                     START_DATE, 
                                     END_DATE)

print("Reuters Posts: ", reuters_tweets.shape)

Reuters Posts:  (12870, 6)


In [590]:
# merging news sources
posts_df = pd.concat([fox_tweets, nyt_tweets, reuters_tweets], ignore_index=True)

# export dataframe into a CSV
posts_df.to_csv(f'./data/tweets{FILE_SUFFIX}.csv',
                index=False, quotechar='"')

#### 5.2: Replies Set up

In [591]:
# set up replies list
replies = []
processed_ids = []

# Today
if date.today().day < 10:
    day = f'0{date.today().day}'
else:
    day = f'{date.today().day}'
    
if date.today().month < 10:
    month = f'0{date.today().month}'
else:
    month = f'{date.today().month}'

today = f'{date.today().year}-{month}-{day}'


# if this is a continuation of a prior run
# uncomment and run the below
# proc_ids = pd.read_csv("./data/processed_ids.csv")
# processed_ids = proc_ids['conv_id'].tolist()

#### 5.2.1: New York Times Replies

In [595]:
#time.sleep(600)
# add nyt replies 
# NOTE: if this breaks, run again until all ids are cleared

for conv_id in nyt_conv_ids:
    if not conv_id in processed_ids:

        replies.extend(getReplies("nytimes",
                                  conv_id,
                                  START_DATE,
                                  today,
                                  MAX_TWEETS))

        processed_ids.append(conv_id)
    
        time.sleep(0.05)


In [596]:
# Test for done-ness

if len(nyt_conv_ids) == len(processed_ids):
    print("All NY Times replies retrieved. Move forward!")
else:
    print(f'{len(nyt_conv_ids) - len(processed_ids)} conversations ' +
         'left to process. Run Again.')

All NY Times replies retrieved. Move forward!


In [597]:
# convert reply list to dataframe
replies_df = pd.DataFrame(replies, columns=['Datetime',
                                            'TweetId',
                                            'Text',
                                            'Username',
                                            'NewsOutlet',
                                            'MentionedUsers',
                                            'ConversationId'])

# export
replies_df.to_csv(f'./data/replies{FILE_SUFFIX}_nyt.csv', sep=',', index=False)
nyt = replies_df

# clear
replies = []
processed_ids = []

In [598]:
nyt_articles = nyt.groupby('ConversationId')['TweetId'].count()
nyt_articles = nyt_articles.reset_index()

print("unique posts: ", nyt_tweets.shape[0])
print("unique reply threads: ", nyt_articles.shape[0])
print("matches: ", nyt_tweets.merge(nyt_articles, 
                                    on='ConversationId', 
                                    how='inner').shape[0])

unique posts:  2682
unique reply threads:  2347
matches:  2674


#### 5.2.2: Reuters Replies

In [665]:
#time.sleep(600)
# add reuters replies
# NOTE: if this breaks, run again until all ids are cleared

for conv_id in reuters_conv_ids:
    if not conv_id in processed_ids:
        replies.extend(getReplies("Reuters",
                                  conv_id,
                                  START_DATE,
                                  today,
                                  MAX_TWEETS))

        processed_ids.append(conv_id)
    
        time.sleep(0.05)


In [666]:
# test for done-ness again

if len(reuters_conv_ids) == len(processed_ids):
    print("All Reuters replies retrieved. Move forward!")
else:
    print(f'{len(reuters_conv_ids) - len(processed_ids)} conversations ' +
         'left to process. Run Again.')

All Reuters replies retrieved. Move forward!


In [667]:
# convert reply list to dataframe
replies_df = pd.DataFrame(replies, columns=['Datetime',
                                            'TweetId',
                                            'Text',
                                            'Username',
                                            'NewsOutlet',
                                            'MentionedUsers',
                                            'ConversationId'])

# export
replies_df.to_csv(f'./data/replies{FILE_SUFFIX}_reu.csv', sep=',', index=False)
reu = replies_df

#clear
replies = []
processed_ids = []

In [668]:
reu_articles = reu.groupby('ConversationId')['TweetId'].count()
reu_articles = reu_articles.reset_index()

print("unique posts: ", reuters_tweets.shape[0])
print("unique reply threads: ", reu_articles.shape[0])
print("matches: ", reuters_tweets.merge(reu_articles, 
                                    on='ConversationId', 
                                    how='inner').shape[0])

unique posts:  12870
unique reply threads:  11392
matches:  11560


#### 5.2.3: Fox News Replies

In [599]:
#time.sleep(500)
# add fox replies
# NOTE: if this breaks, run again until all ids are cleared

for conv_id in fox_conv_ids:
    if not conv_id in processed_ids:
        replies.extend(getReplies("FoxNews",
                                  conv_id,
                                  START_DATE,
                                  today,
                                  MAX_TWEETS))

        processed_ids.append(conv_id)
    
        time.sleep(0.05)


In [600]:
# test for done-ness again

if (len(fox_conv_ids) == len(processed_ids)):
    print("All Fox News replies retrieved. Move forward!")
else:
    print(f'{(len(fox_conv_ids) - len(processed_ids))} conversations ' +
         'left to process. Run Again.')

All Fox News replies retrieved. Move forward!


In [601]:
# convert reply list to dataframe
replies_df = pd.DataFrame(replies, columns=['Datetime',
                                            'TweetId',
                                            'Text',
                                            'Username',
                                            'NewsOutlet',
                                            'MentionedUsers',
                                            'ConversationId'])

# export
replies_df.to_csv(f'./data/replies{FILE_SUFFIX}_fox.csv', sep=',', index=False)
fox = replies_df

# clear
replies = []
processed_ids = []

In [602]:
fox_articles = fox.groupby('ConversationId')['TweetId'].count()
fox_articles = fox_articles.reset_index()

print("unique posts: ", fox_tweets.shape[0])
print("unique reply threads: ", fox_articles.shape[0])
print("matches: ", fox_tweets.merge(fox_articles, 
                                    on='ConversationId', 
                                    how='inner').shape[0])

unique posts:  841
unique reply threads:  841
matches:  841


# Other notes and code snippets

## If you need to stop during step 5. Save processed Ids and begin again later

In [304]:
# save processed ids incase of interrupted run
proc_ids = pd.DataFrame({"conv_id": processed_ids})
proc_ids.to_csv("./data/processed_ids.csv", sep=',', index=False)

## For easy of use, you can join them all now (or later)
...but without compression this file is too big to share.

In [423]:
# remove stuff that breaks the file some times
for dataset in [nyt, fox, reu]:
    dataset['Text'] = dataset['Text'].str.replace(r"[,'\"]+","")
    dataset['Text'] = dataset['Text'].str.replace(r"[\r\n]+","")


  dataset['Text'] = dataset['Text'].str.replace(r"[,'\"]+","")
  dataset['Text'] = dataset['Text'].str.replace(r"[\r\n]+","")


In [None]:
# import all if needed
reu = pd.read_csv(f'./data/replies{FILE_SUFFIX}_reu.csv', quotechar='"')
fox = pd.read_csv(f'./data/replies{FILE_SUFFIX}_fox.csv', quotechar='"')
nyt = pd.read_csv(f'./data/replies{FILE_SUFFIX}_nyt.csv', quotechar='"')

In [532]:
# export all together
all_replies = pd.concat([reu, fox, nyt], ignore_index=True)
all_replies.to_csv(f'./data/replies{FILE_SUFFIX}.csv', sep=',', index=False)

## Get Tweets filtered by Topic/Keywords

In [80]:
def getPosts_custom_query(maxTweets, start, end):
    # Creating list to append tweet data to
    tweets_list = []
    
    query = f'(from:nytimes OR from:FoxNews OR from:CNN \
    OR from:BBCBreaking OR from:Reuters OR from:washingtonpost \
    OR from:realDailyWire) AND (black OR Trump OR Chauvin OR Floyd OR Biden OR \
    asian OR attacked OR killed OR arrested) -is:retweet -is:reply since:{start} until:{end}'
    
    try:
        # Using TwitterSearchScraper to scrape data and append tweets to list
        for i, tweet in enumerate(sntwitter
                                  .TwitterSearchScraper(query)
                                  .get_items()):
            if i>maxTweets:
                break

            mentions = ''

            if tweet.id != tweet.conversationId:
                if tweet.mentionedUsers:
                    for user in tweet.mentionedUsers:
                        mentions += (' ' + user.username)
                    else:
                        mentions = ''

            text = tweet.content
            text = text.replace(',', '')
            text = text.replace('"', '')
            text = text.replace("'", '')

            tweets_list.append([tweet.date, 
                                str(tweet.id), 
                                text, 
                                tweet.user.username,
                                mentions,
                                str(tweet.conversationId)])
    
    except Exception as e:
        time.sleep(400)
        print(e)
        return ([], '500')
    
    # Creating a dataframes from the tweets lists above
    tweets_df = pd.DataFrame(tweets_list, columns=['Datetime', 
                                                    'TweetId', 
                                                    'Text', 
                                                    'Username',
                                                    'MentionedUsers',
                                                    'ConversationId'])
    
    return (tweets_df, '200')

In [81]:
import time
import datetime as dt
from datetime import datetime

epochs = 60
data_frames ={}
start_time = datetime.strptime('2020-08-01', '%Y-%m-%d')
end_time = datetime.strptime('2020-08-05', '%Y-%m-%d') #5 more days

data = pd.DataFrame([], columns=['Datetime', 'TweetId','Text', 'Username','MentionedUsers','ConversationId'])


for i in range(epochs):
    start_string = datetime.strftime(start_time + dt.timedelta(days=5*i), '%Y-%m-%d')
    end_string = datetime.strftime(end_time + dt.timedelta(days=5*i), '%Y-%m-%d')
    
    print('Requesting: {} epoch'.format(i))
    print('Start date: {}'.format(start_string))
    print('End date: {}'.format(end_string))
    
    max_tweets = 10000
    tweets_df, code = getPosts_custom_query(MAX_TWEETS, start_string, end_string)
    
    if code == 500:
        i -= 1
    else:
        data = data.append(tweets_df)
        print('Observations added:{}'.format(len(tweets_df)))
    
    print('')
        
    
    

Requesting: 0 epoch
Start date: 2020-08-01
End date: 2020-08-05
Observations added:411

Requesting: 1 epoch
Start date: 2020-08-06
End date: 2020-08-10
Observations added:447

Requesting: 2 epoch
Start date: 2020-08-11
End date: 2020-08-15
Observations added:661

Requesting: 3 epoch
Start date: 2020-08-16
End date: 2020-08-20
Observations added:561

Requesting: 4 epoch
Start date: 2020-08-21
End date: 2020-08-25
Observations added:511

Requesting: 5 epoch
Start date: 2020-08-26
End date: 2020-08-30
Observations added:679

Requesting: 6 epoch
Start date: 2020-08-31
End date: 2020-09-04
Observations added:530

Requesting: 7 epoch
Start date: 2020-09-05
End date: 2020-09-09
Observations added:318

Requesting: 8 epoch
Start date: 2020-09-10
End date: 2020-09-14
Observations added:417

Requesting: 9 epoch
Start date: 2020-09-15
End date: 2020-09-19
Observations added:552

Requesting: 10 epoch
Start date: 2020-09-20
End date: 2020-09-24
Observations added:499

Requesting: 11 epoch
Start date

In [83]:
data.to_csv('data/tweets_first_round.csv', index=False)

In [86]:
data

Unnamed: 0,Datetime,TweetId,Text,Username,MentionedUsers,ConversationId
0,2020-08-04 23:39:29+00:00,1290794516312739840,Dramatic video shows a massive explosion near ...,CNN,,1290794516312739840
1,2020-08-04 23:15:55+00:00,1290788587873730561,Sen. Lindsey Graham was tasked with talking Pr...,CNN,,1290788587873730561
2,2020-08-04 23:15:32+00:00,1290788491282898944,🎥 WALSH: George Floyd Bodycam Footage Is Out. ...,realDailyWire,,1290788491282898944
3,2020-08-04 22:59:03+00:00,1290784342806089730,Many thought Trump’s Axios interview felt like...,washingtonpost,,1290784342806089730
4,2020-08-04 22:47:10+00:00,1290781350228762624,“When a woman is in a role that’s new people u...,nytimes,,1290009491354710017
...,...,...,...,...,...,...
40,2021-05-23 01:35:00+00:00,1396278456493789185,WARNING: GRAPHIC CONTENT — Disturbing video of...,Reuters,,1396278456493789185
41,2021-05-23 01:25:00+00:00,1396275939684745218,ICYMI: President Biden went to a Ford test tra...,Reuters,,1396275939684745218
42,2021-05-23 01:00:00+00:00,1396269649268989957,Biden administration grants temporary protecte...,FoxNews,,1396269649268989957
43,2021-05-23 00:01:07+00:00,1396254831606747138,President Biden took an electric Ford pickup t...,CNN,,1396254831606747138


In [87]:
#conversations = list(data.ConversationId.unique())
#processed_ids =[]
#replies = []

In [113]:
for conversation in conversations:
    
    if conversation not in processed_ids:
        
        try:
            replies.extend(getReplies("",conversation,
                                  '','',2000))
            processed_ids.append(conversation)
            
        except Exception as e:
            print('First attempt unsuccessful')
            print(e)
            time.sleep(400)
            try:
                replies.extend(getReplies("",conversation,
                                  '','',2000))
                processed_ids.append(conversation)
            except Exception as e:
                print('Second attempt unsuccessful')
                print(e)
                time.sleep(400)
            
        
    

In [115]:
replies_df = pd.DataFrame(replies, columns=['Datetime',
                                            'TweetId',
                                            'Text',
                                            'Username',
                                            'NewsOutlet',
                                            'MentionedUsers',
                                            'ConversationId'])

In [91]:
#replies_df.to_csv('data/replies_first_round.csv', index=False)

In [114]:
len(conversations) - len(processed_ids)

0

In [116]:
len(replies)

4578898