# Using snscraper to Get Tweets

## Step 1: Make sure you are running python v. 3.8 or higher.
If you are not, to update python try:
!conda upgrade notebook 
or 
!conda update jupyter

In [1]:
# check version
!python -V

Python 3.8.2


## Step 2: First time use requires installation of dev version of snsscrape
Uncomment and run the code below.

In [2]:
# !pip install git+https://github.com/JustAnotherArchivist/snscrape.git

## Step 3: Imports and Functions
Please run the code blocks below.

In [3]:
import snscrape.modules.twitter as sntwitter
import pandas as pd
from datetime import date
import time

### Functions

In [4]:
def getPosts(newsOutlet, maxTweets, start, end):
    '''
    Based on MartinBeckUT's python wrapper for snscraper.
    Get all tweets and replies from March 2021.
    
    Inputs:
        - newsOutlet: string, name of twitter news source user
        - maxTweets: int, cut off for search results
        - start: string ('YYYY-MM-DD'), first date to include in search
        - end: string ('YYYY-MM-DD'), first date to exclude in search
        
    Returns: tuple ( tweets_df, replies_list )
        - tweets_df: a pandas dataframe with columns: 
            Datetime,TweetId, text, userName, newsOutlet, 
            MentionedUsers, ConversationId
        - replies_list: list, all convesationIds from the tweets
        
    '''     
    # Creating list to append tweet data to
    tweets_list = []
    replies_list = []

    # Using TwitterSearchScraper to scrape data and append tweets to list
    for i, tweet in enumerate(sntwitter
                              .TwitterSearchScraper(f'from:{ newsOutlet } since:{start} ' +
                                               f'until:{end} -is:retweet -is:reply')
                              .get_items()):
        if i>maxTweets:
            break
            
        mentions = ''

        if tweet.id != tweet.conversationId:
            if tweet.mentionedUsers:
                for user in tweet.mentionedUsers:
                    mentions += (' ' + user.username)
                else:
                    mentions = ''

        text = tweet.content
        text = text.replace(',', '')
        text = text.replace('"', '')
        text = text.replace("'", '')
        
        tweets_list.append([tweet.date, 
                            str(tweet.id), 
                            text, 
                            tweet.user.username,
                            mentions,
                            str(tweet.conversationId)])
        
        if tweet.conversationId not in replies_list:
            replies_list.append(tweet.conversationId)
    
    # Creating a dataframes from the tweets lists above
    tweets_df = pd.DataFrame(tweets_list, columns=['Datetime', 
                                                    'TweetId', 
                                                    'Text', 
                                                    'Username',
                                                    'MentionedUsers',
                                                    'ConversationId'])
    
    return ( tweets_df, replies_list )


In [5]:
def getReplies(newsOutlet, convsID, start, end, maxTweets=2000):
    '''
    Extension of original scraper code
    to pull all following conversations
    
    Inputs:
         - newsOutlet: string, userId from orginal tweet post
         - convsID: string, conversationId from original post
         - start: string ('YYYY-MM-DD'), first date to include in search
         - end: string ('YYYY-MM-DD'), first date to exclude in search
         - maxTweets: int, cut off for search results
     
     Returns: tweets_list: list of lists,
         representing each reply in the original thread.
         Lowest level lists contain:
             Datetime, TweetId, text, username, 
             newsOutlet, MentionedUsers, conversationId
    '''
       
    # Creating list to append tweet data to
    tweets_list = []

    # Using TwitterSearchScraper to scrape data and append tweets to list

    scraper_instance = sntwitter.TwitterSearchScraper(f'lang:en conversation_id:{convsID} ' + 
                                                      f'(filter:safe OR -filter:safe) -is:retweet')

    
    for i, tweet in enumerate(scraper_instance.get_items()):
        if i>maxTweets:
            break
        
        mentions = []

        if tweet.id != tweet.conversationId:
            if not tweet.mentionedUsers is None:
                for user in tweet.mentionedUsers:
                    mentions.append(user.username)
            
            text = tweet.content
            text = text.replace(',', '')
            text = text.replace('"', '')
            text = text.replace("'", '')
            
            tweets_list.append([tweet.date,
                                str(tweet.id), 
                                text, 
                                tweet.user.username,
                                newsOutlet,
                                " ".join(mentions),
                                str(tweet.conversationId)])
    
    
    #scraper_instance._unset_guest_token()
    
    return tweets_list
    

## Step 4: Define your search ranges under "Globals"
The START_DATE is included. The END_DATE is excluded.

### Globals

In [6]:
MAX_TWEETS = 50000
START_DATE = '2021-01-01'
END_DATE = '2021-02-01'

FILE_SUFFIX = "_Jan2021"

## Step 5: Run the scraper for each news source

Due to issues with the scraper instance, I'm running each separately and saving out the results each time incase of interruptions. Then in the final step I'm merging dataframes.

Each news source takes considerable time to run.
An error of "unable to find guest token" may indicate an issue with the scraper instance.
This happens frequently in retrieving replies.

#### What to do when you get "guest token" error in replies:
Processed conversation ids are recorded. If an error occurs, run the test code
directly below to see if the full list of conversations were caputured.
If not, run the replies code again until all have been saved.

#### Immediate "guest token" errors without any scraper activity:
Sometimes successive runs are strangely sticky.
A short little request sometimes unsticks it.

Try something like:
getReplies("nytimes", 1377405371279568898, START_DATE, END_DATE, 5)

Or even just running something else like:
print(MAX_TWEETS)

For a few minutes until you get a response.
Then resume.


### Data Gathering:

#### 5.1: All Sources Original Posts

In [7]:
# pull fox news data
fox_tweets, fox_conv_ids = getPosts("FoxNews", 
                                   MAX_TWEETS, 
                                   START_DATE, 
                                   END_DATE)

print("Fox Posts: ", fox_tweets.shape)

Fox Posts:  (637, 6)


In [8]:
# pull new york times post data
nyt_tweets, nyt_conv_ids = getPosts("nytimes", 
                                     MAX_TWEETS, 
                                     START_DATE, 
                                     END_DATE)

print("NY Times Posts: ", nyt_tweets.shape)

NY Times Posts:  (2647, 6)


In [9]:
# pull reuters post data
reuters_tweets, reuters_conv_ids = getPosts("Reuters", 
                                     MAX_TWEETS, 
                                     START_DATE, 
                                     END_DATE)

print("Reuters Posts: ", reuters_tweets.shape)

Reuters Posts:  (13252, 6)


In [10]:
# merging news sources
posts_df = pd.concat([fox_tweets, nyt_tweets, reuters_tweets], ignore_index=True)

# export dataframe into a CSV
posts_df.to_csv(f'./data/tweets{FILE_SUFFIX}.csv',
                index=False, quotechar='"')

#### 5.2: Replies Set up

In [11]:
# set up replies list
replies = []
processed_ids = []

# Today
if date.today().day < 10:
    day = f'0{date.today().day}'
else:
    day = f'{date.today().day}'
    
if date.today().month < 10:
    month = f'0{date.today().month}'
else:
    month = f'{date.today().month}'

today = f'{date.today().year}-{month}-{day}'


# if this is a continuation of a prior run
# uncomment and run the below
# proc_ids = pd.read_csv("./data/processed_ids.csv")
# processed_ids = proc_ids['conv_id'].tolist()

#### 5.2.1: New York Times Replies

In [14]:
time.sleep(600)
# add nyt replies 
# NOTE: if this breaks, run again until all ids are cleared

for conv_id in nyt_conv_ids:
    if not conv_id in processed_ids:

        replies.extend(getReplies("nytimes",
                                  conv_id,
                                  START_DATE,
                                  today,
                                  MAX_TWEETS))

        processed_ids.append(conv_id)
    
        time.sleep(0.05)


In [15]:
# Test for done-ness

if len(nyt_conv_ids) == len(processed_ids):
    print("All NY Times replies retrieved. Move forward!")
else:
    print(f'{len(nyt_conv_ids) - len(processed_ids)} conversations ' +
         'left to process. Run Again.')

All NY Times replies retrieved. Move forward!


In [16]:
# convert reply list to dataframe
replies_df = pd.DataFrame(replies, columns=['Datetime',
                                            'TweetId',
                                            'Text',
                                            'Username',
                                            'NewsOutlet',
                                            'MentionedUsers',
                                            'ConversationId'])

# export
replies_df.to_csv(f'./data/replies{FILE_SUFFIX}_nyt.csv', sep=',', index=False)
nyt = replies_df

# clear
replies = []
processed_ids = []

In [17]:
nyt_articles = nyt.groupby('ConversationId')['TweetId'].count()
nyt_articles = nyt_articles.reset_index()

print("unique posts: ", nyt_tweets.shape[0])
print("unique reply threads: ", nyt_articles.shape[0])
print("matches: ", nyt_tweets.merge(nyt_articles, 
                                    on='ConversationId', 
                                    how='inner').shape[0])

unique posts:  2647
unique reply threads:  2357
matches:  2647


#### 5.2.2: Reuters Replies

In [44]:
time.sleep(600)
# add reuters replies
# NOTE: if this breaks, run again until all ids are cleared

for conv_id in reuters_conv_ids:
    if not conv_id in processed_ids:
        replies.extend(getReplies("Reuters",
                                  conv_id,
                                  START_DATE,
                                  today,
                                  MAX_TWEETS))

        processed_ids.append(conv_id)
    
        time.sleep(0.05)


In [45]:
# test for done-ness again

if len(reuters_conv_ids) == len(processed_ids):
    print("All Reuters replies retrieved. Move forward!")
else:
    print(f'{len(reuters_conv_ids) - len(processed_ids)} conversations ' +
         'left to process. Run Again.')

All Reuters replies retrieved. Move forward!


In [46]:
# convert reply list to dataframe
replies_df = pd.DataFrame(replies, columns=['Datetime',
                                            'TweetId',
                                            'Text',
                                            'Username',
                                            'NewsOutlet',
                                            'MentionedUsers',
                                            'ConversationId'])

# export
replies_df.to_csv(f'./data/replies{FILE_SUFFIX}_reu.csv', sep=',', index=False)
reu = replies_df

#clear
replies = []
processed_ids = []

In [47]:
reu_articles = reu.groupby('ConversationId')['TweetId'].count()
reu_articles = reu_articles.reset_index()

print("unique posts: ", reuters_tweets.shape[0])
print("unique reply threads: ", reu_articles.shape[0])
print("matches: ", reuters_tweets.merge(reu_articles, 
                                    on='ConversationId', 
                                    how='inner').shape[0])

unique posts:  13252
unique reply threads:  11920
matches:  12373


#### 5.2.3: Fox News Replies

In [48]:
#time.sleep(500)
# add fox replies
# NOTE: if this breaks, run again until all ids are cleared

for conv_id in fox_conv_ids:
    if not conv_id in processed_ids:
        replies.extend(getReplies("FoxNews",
                                  conv_id,
                                  START_DATE,
                                  today,
                                  MAX_TWEETS))

        processed_ids.append(conv_id)
    
        time.sleep(0.05)


In [49]:
# test for done-ness again

if (len(fox_conv_ids) == len(processed_ids)):
    print("All Fox News replies retrieved. Move forward!")
else:
    print(f'{(len(fox_conv_ids) - len(processed_ids))} conversations ' +
         'left to process. Run Again.')

All Fox News replies retrieved. Move forward!


In [50]:
# convert reply list to dataframe
replies_df = pd.DataFrame(replies, columns=['Datetime',
                                            'TweetId',
                                            'Text',
                                            'Username',
                                            'NewsOutlet',
                                            'MentionedUsers',
                                            'ConversationId'])

# export
replies_df.to_csv(f'./data/replies{FILE_SUFFIX}_fox.csv', sep=',', index=False)
fox = replies_df

# clear
replies = []
processed_ids = []

In [51]:
fox_articles = fox.groupby('ConversationId')['TweetId'].count()
fox_articles = fox_articles.reset_index()

print("unique posts: ", fox_tweets.shape[0])
print("unique reply threads: ", fox_articles.shape[0])
print("matches: ", fox_tweets.merge(fox_articles, 
                                    on='ConversationId', 
                                    how='inner').shape[0])

unique posts:  637
unique reply threads:  637
matches:  637


# Other notes and code snippets

## If you need to stop during step 5. Save processed Ids and begin again later

In [304]:
# save processed ids incase of interrupted run
proc_ids = pd.DataFrame({"conv_id": processed_ids})
proc_ids.to_csv("./data/processed_ids.csv", sep=',', index=False)

## For easy of use, you can join them all now (or later)
...but without compression this file is too big to share.

In [423]:
# remove stuff that breaks the file some times
for dataset in [nyt, fox, reu]:
    dataset['Text'] = dataset['Text'].str.replace(r"[,'\"]+","")
    dataset['Text'] = dataset['Text'].str.replace(r"[\r\n]+","")


  dataset['Text'] = dataset['Text'].str.replace(r"[,'\"]+","")
  dataset['Text'] = dataset['Text'].str.replace(r"[\r\n]+","")


In [None]:
# import all if needed
reu = pd.read_csv(f'./data/replies{FILE_SUFFIX}_reu.csv', quotechar='"')
fox = pd.read_csv(f'./data/replies{FILE_SUFFIX}_fox.csv', quotechar='"')
nyt = pd.read_csv(f'./data/replies{FILE_SUFFIX}_nyt.csv', quotechar='"')

In [532]:
# export all together
all_replies = pd.concat([reu, fox, nyt], ignore_index=True)
all_replies.to_csv(f'./data/replies{FILE_SUFFIX}.csv', sep=',', index=False)