## 3. Natural Tweet Acquisition
To obtain a selection of natural tweets that we can use for a control group, we will use the Tweepy library, which provides access to the Twitter API. It includes the ability to search for specific keywords, get tweets from specific users, and a number of other useful functions.

Our path here will be as follows:
1. Iterate through the named entities we identified in the previous notebook, using the search API to get a decent number of tweets on each entity.
2. For each user identified in step #1, pull a number of past tweets from that user to create a sequence.

### 3.1 Setup

In [None]:
import pandas as pd
import os
import re
import csv
import utilities.tweet_utils as tweet_utilities

Access to the Twitter API through Tweepy requires a number of keys and tokens. In order to obtain access, you will need to go to the [Twitter Developer Portal](https://developer.twitter.com/en/portal/dashboard) and sign up. Twitter requires an application for access, but this can be obtained if you have a legitimate project. 

In [None]:
API_KEY = 'j2nAmqKcjV0Vw4vftNrEHjYN0'
API_KEY_SECRET = 'HRCx48ogjz88nv1IFwOhZYuPjkNh1bMx3WY2GU7rmwr7bV9IDo'
BEARER_TOKEN = 'AAAAAAAAAAAAAAAAAAAAANQOaAEAAAAAVUQvu%2Fkc6A%2Bh3Mp2ZnDX4%2FQmceM%3DZ3TEtjYZhKNNt1FwS5pPc0t2bBYlvhx0jHrjR6itGAiMD9hOMR'
ACCESS_TOKEN = '3409085501-gWVxnfOYeZGaqb9ub272PXNC8nvizlWDTpk6BEW'
ACCESS_TOKEN_SECRET = 'ln8RicUizrYgCMkJRXRGEYLhoDEXY6KJmSZGuD9BseQJp'

Here we set up API access and authenticate.

In [None]:
import tweepy

# Authenticate to Twitter
auth = tweepy.OAuthHandler(API_KEY, API_KEY_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

api = tweepy.API(auth)

try:
    api.verify_credentials()
    print("Authentication OK")
except:
    print("Error during authentication")

### 3.2 Load Topics

We use this function to clean up the entities identified from the NER model. We will then use those to search Twitter for related tweets.

In [None]:
def filter_clean_entities(entity_df,min_count=10):
    """Cleans up entities from NER step

    Args:
        entity_df (dataframe): Contains entities, entity type, and count
        min_count (int): Minimum count of mentions of each entity to include

    Returns:
        entity_df (dataframe): Cleaned dataframe
    """
    entity_df = entity_df.sort_values(by='count',ascending=False)
    entity_df['len'] = entity_df['word'].map(lambda x: len(str(x)))
    entity_df = entity_df.drop(entity_df[entity_df['count']<min_count].index)
    entity_df = entity_df.drop(entity_df[entity_df['len']<2].index)
    entity_df = entity_df.drop(entity_df[entity_df['word']=='RT'].index)
    entity_df = entity_df.drop(entity_df[entity_df['word'].str.contains(r'[@#&$%+-/*]')].index)

    return entity_df

#### 3.2.1 Chinese Tweet Entities
The following lines simply gather and display the entities to be included in the search.

In [None]:
chinese_entities = pd.read_csv('../working_files/chinese_entities.csv')
chinese_entities = filter_clean_entities(chinese_entities,min_count=100)

In [None]:
len(chinese_entities)

In [None]:
chinese_entities.head()

#### 3.2.2 Russian Tweet Entities

In [None]:
russian_entities = pd.read_csv('../working_files/russian_entities.csv')
russian_entities = filter_clean_entities(russian_entities,min_count=100)

In [None]:
len(russian_entities)

In [None]:
russian_entities.head()

#### 3.2.3 Combined Entities

In [None]:
combined_entities = pd.concat([chinese_entities, russian_entities],axis=0)
combined_entities = combined_entities.drop_duplicates(subset=['word'])
combined_entities.to_csv('../working_files/state_operator_entities.csv')

#### 3.2.4 Final Processing
After creating this list, I did some manual marking of several duplicate and/or misspelled entries in Excel. The manually edited list is named `state_operator_entities_edited.csv`. Entries to be excluded are marked with a `1` in the `exclude` column. 

### 3.3 Gather Search Sample
We will now run through the entities collected above and use each one as a search query via the Twitter API. Below we collect 100 samples for each query, and save them in the same file. Once we identify the users making these tweets, we will then examine their timelines for recent tweets so that we can build our tweet sequences for use in training.

In [None]:
def tweet_searcher(query, client, filename, max_results=100):
    """Searches and writes tweets from the last 7 days using given search term
    
    Args:
        topic (string): Query to be passed to the Twitter API
        client (tweepy.Client): Client object
        filename (string): Name of file to write tweets to
    
    """
    response = client.search_recent_tweets(query, max_results=max_results,
                tweet_fields=['id','author_id','created_at','lang','text'])
    # The search_recent_tweets method returns a Response object, a named tuple 
    # with data, includes, errors, and meta fields
    print(query)

    # In this case, the data field of the Response returned is a list of Tweet
    # objects
    tweets = response.data

    # Each Tweet object has default ID and text fields
    for tweet in tweets:
        with open(filename, 'a') as f:
                writer = csv.writer(f)
                writer.writerow(
                [
                    tweet.id,
                    tweet.author_id,
                    tweet.created_at,
                    tweet.lang,
                    tweet.text
                ]
            )

In [None]:
ent = pd.read_csv('../working_files/state_operator_entities_edited.csv',index_col=0)
ent = ent[ent['exclude']==0]
len(ent)

This code will loop through the topics and conduct a search for each one, excluding retweets in the results.

In [None]:
client = tweepy.Client(BEARER_TOKEN)
SAMPLE_DESTINATION = '../working_files/tweets_05.16.2022.csv'

for t in ent['word']:
    q = t + ' -is:retweet lang:en' # "-is:retweet" excludes retweets
    tweet_searcher(q, client, SAMPLE_DESTINATION,max_results=100)


### 3.4 Pull Recent Tweets From Users
The following cells will pull the last N tweets from the unique users identified in the query results above. This will be used to build the Tweet sequences we will use for training.

#### 3.4.1 Load Query Results

In [None]:
# Load query results
df = pd.read_csv(SAMPLE_DESTINATION,
                 names=['id','author_id','created_at','lang','text'], 
                 index_col=False)
df = df[df['lang']=='en']

In [None]:
df.head()

In [None]:
users = pd.unique(df['author_id'])

#### 3.4.2 Pull Screennames
Unfortunately, there is no option to return screen names from the above API calls, and there is no way to pull recent tweets (as we will need to below) from users using `author_id`. Because of this, we will need to manually get the screen names from the author IDs found above.

Here we will carry out that process. This step will take over 2.5 hours using the default settings, due to Twitter's API rate limits.

In [None]:
import time

In [None]:
def collect_twitter_user_screennames(author_ids):
    """Pulls screennames from Twitter API based on author_id

    Args:
        author_ids (list of ints): author_ids as provided by Twitter

    Returns:
        screennames (list of strings): screennames corresponding to each author 

    """
    screen_names = []
    for s in range(0,len(author_ids),900):
        # API is rate-limited to 900 user lookups every 15 minutes 
        end = len(author_ids) if s+900>=len(author_ids) else s+900
        print(f"Acquiring range {s} to {end}")
        for id in users[s:s+900]:
            try:
                screen_names.append(api.get_user(id=id).screen_name)
            except:
                print("Error collecting user name. User omitted.")
        time.sleep(60*15)
    return screen_names

In [None]:
screennames = collect_twitter_user_screennames(users)

In [None]:
len(screennames)

In [None]:
import pickle

# Save list to use later, if needed
with open('../working_files/screen_names.pkl', 'wb') as handle:
    pickle.dump(screennames, handle, protocol=pickle.HIGHEST_PROTOCOL)

#### 3.4.3 Pull Recent Tweets

In [None]:
def get_tweets(username, filename, tweet_limit=50):
    """Pulls the last N tweets from a user's timeline and writes them to a file
    
    Args:
        username (string): Screen name of user whose timeline is to be searched
        filename (string): Name of file to which tweets are to be appended
        tweet_limit (int): Maximum number of tweets to retrieve from user
    
    Returns:
        Nothing
    
    """
    csv_file = open(filename, "a")
    csv_writer = csv.writer(csv_file)

    # Authorization to consumer key and consumer secret
    auth = tweepy.OAuthHandler(API_KEY, API_KEY_SECRET)

    # Access to user's access key and access secret
    auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

    # Calling api
    api = tweepy.API(auth, wait_on_rate_limit=True)

    print(username)
    # Get tweets
    for tweet in tweepy.Cursor(
        api.user_timeline, 
        screen_name=username,
        include_rts=False
    ).items(limit=tweet_limit):
        csv_writer.writerow(
            [
                tweet.id,
                tweet.author.screen_name,
                tweet.created_at,
                tweet.lang,
                tweet.source,
                tweet.retweet_count,
                tweet.favorited,
                tweet.retweeted,
                tweet.text
            ]
        )

    csv_file.close()

This loop will run for a while and then sleep once it reaches the Twitter API rate limit. After sleeping for a certain amount of time, it will resume. This process will repeat until all the Tweets are downloaded.

In [None]:
TIMELINE_TWEETS_DESTINATION = "../working_files/timeline_tweets_05.16.2022.csv"

for i in range(len(screennames)):
    try:
        get_tweets(screennames[i],TIMELINE_TWEETS_DESTINATION,50)
    except:
        "Error detected. Moving to next user."

### 3.5 Clean & Combine Tweets
Now that we have obtained a large sample of user tweets on the same topics discussed by our state operators, we will apply the same cleaning and filtering process to them as we did to the state operator tweets.

In [None]:
import utilities.tweet_utils as tweet_utils

rts = pd.read_csv(TIMELINE_TWEETS_DESTINATION,
                  names=['id','userid','tweet_time','tweet_language','source',
                         'retweet_count','favorited','retweeted','tweet_text'], 
                  index_col=False)

In [None]:
rts.head()

In [None]:
rts = rts[rts['tweet_language']=='en']
rts.shape, len(pd.unique(rts['userid']))

In [None]:
rts = tweet_utils.apply_filters(rts)
rts.shape, len(pd.unique(rts['userid']))

In [None]:
rts.head()

In [None]:
rts = tweet_utils.combine_tweets(rts, 10)

In [None]:
rts.iloc[11444,11]

In [None]:
final_cols = ['userid','tweet_text','tweet_time','clean_tweets','recent_tweets']

rts = rts[final_cols].copy()

In [None]:
rts.sample(100).head(50)

In [None]:
rts.to_csv('../working_files/real_tweet_sequences.csv',sep=',', quotechar='"',header=True)