# Tweets gathering

## Why Twitter ?
We gather tweets since Twitter is known to be a financial information hub. 
We may emphazise on the fact that in general investor's behavior reflects on the market.

More details may be found on these papers.
> Bollen, Johan, Huina Mao, and Xiaojun Zeng. "Twitter mood predicts the stock market." *Journal of computational science 2.1 (2011)*: 1-8.

> Atkins, Adam, Mahesan Niranjan, and Enrico Gerding. "Financial news predicts stock market volatility better than close price." *The Journal of Finance and Data Science 4.2 (2018)*: 120-137.


# Part 1 - Fetching the tweets
## Comments
In this part we are gathering the tweets using the **tweepy library**. Since the API call rate of twitter is limited to 180 calls every 15 minutes, this process can take a while since we have around 3200 companies.
We save the resulting process into a csv file of raw tweets namely in the **twitter_data.csv** in *7-Data/1-RawTweets*

### Libraries

In [4]:
import pandas as pd
import numpy as np
import tweepy
import time
import sys

# We import the companies name and symbol
We import our companies dataset that are publicly listed on SIX and NYSE.

Data found on :
- https://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NYSE for NYSE
- https://www.six-swiss-exchange.com/shares/companies/issuer_list_en.html for SIX

In [8]:
# We import the csv(s) and select the relevant columns
# SIX companies list
df_crypto = pd.read_csv("./data/crypto.csv")
df_crypto = df_crypto[['symbol','symbol']]
df_crypto.columns = ['Symbol','Name']

## We authenticate on Twitter with our credentials

In [9]:
import tweepy

consumer_key = "z7gp32ZtNDlQOj4W92v6wJqm5" 
consumer_secret = "vbDEgRXlqALekExDz2wqhTiU6CwpBsLeEVNOF1al9hapqywsl6" 

access_token = "901719311143903233-V2t305dvpgFtMwonEIXoof8FOAKZxiH"
access_token_secret = "z9knLifKkDRtJwgTQqxZxpA0Gy05HWVqsVj9K5M8fX0up"


auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

## We define our main functions
1. **parse_tweet** : That we'll use to parse each tweets
2. **format_response** : That we will use to transform in a company/tweet format

In [10]:
# If there is an error creating the api instance
if (not api):
    print ("Can't Authenticate")
    sys.exit(-1)

# define parsing functions
def parse_tweet(tweet):
    """
    Takes result object from tweepy and parses it
    """

    # initialize dict
    parsed_tweet = {}

    # extract relevant info
    parsed_tweet['Author Name'] = tweet.author.name
    parsed_tweet['Text'] = tweet.text
    parsed_tweet['Message ID'] = tweet.id
    parsed_tweet['Published At'] = tweet.created_at
    parsed_tweet['Retweet Count'] = tweet.retweet_count
    parsed_tweet['Favorite Count'] = tweet.favorite_count

    return parsed_tweet

def format_response(response, company):
    """
    Takes list of result objects from tweepy and formats it
    """
    try:
        parsed_tweets = pd.DataFrame([parse_tweet(tweet) for tweet in response], columns=[ 'Company Name', 'Author Name', 'Text', 'Message ID', 'Published At', 'Retweet Count', 'Favorite Count'])
        parsed_tweets['Company Name'] = str(company)
    except TypeError as e:
        print(e)
        parsed_tweets = pd.DataFrame(columns=[ 'Company Name', 'Author Name', 'Text', 'Message ID', 'Published At', 'Retweet Count', 'Favorite Count'])

    return parsed_tweets

## Main algorithm
We now run the process of fetching tweets. Then we save it into a csv file namely: **twitter_data.csv** contained in the folder */7-Data/1-RawTweets*.

This process is **quite time intensive**, so the user may totally **skip** it since tweets already have been fetched beforehand.

In [26]:
# main program
if __name__ == '__main__':
    
    # We create a timestamp
    ts = int(time.time())
    
    # if twitter data exists
    try:
        data = pd.read_csv('./data/update.csv')
    except:
        data = False

    # loop over companies
    for idx,crypto in enumerate(df_crypto['Name']):

        print('Processing {0} {1} of {2}'.format(crypto, str(idx), str(len(df_crypto['Name']))))

        # define params dict
        params={
            'q' : crypto+" cryptocurrency"
        }

        # add max_id if prior data file exists
        if data is not False:

            # if this company exists in our dataset
            if crypto in data['Company Name']:

                # add the max_id param so we dont collect redundant tweets
                params['since_id'] = data[data['Company Name']==crypto]['Message ID'].max()
        
        try:
            
            print(params)
            # make the call to twitter
            response = api.search(**params)

        # handle error
        except tweepy.error.TweepError as e:

            print(e)
            
            # Will run up to the point where it reaches the Rate limit per 15 min.
            response = api.search(**params)

        # format response
        formatted_response = format_response(response, crypto)
        
        # write out the result
        if data is not False:
            formatted_response.to_csv('./data/1-raw-tweets/twitter_data_{}.csv'.format(ts), mode='a', header=False, encoding='utf-8')
        elif idx == 0:
            formatted_response.to_csv('./data/1-raw-tweets/twitter_data_{}.csv'.format(ts), encoding='utf-8')
        else:
            formatted_response.to_csv('./data/1-raw-tweets/twitter_data_{}.csv'.format(ts), mode='a', header=False, encoding='utf-8')

Processing CHSB 0 of 7
{'q': 'CHSB cryptocurrency'}
Processing MCO 1 of 7
{'q': 'MCO cryptocurrency'}
Processing EDO 2 of 7
{'q': 'EDO cryptocurrency'}
Processing CRPT 3 of 7
{'q': 'CRPT cryptocurrency'}
Processing NEXO 4 of 7
{'q': 'NEXO cryptocurrency'}
Processing SXP 5 of 7
{'q': 'SXP cryptocurrency'}
Processing DROP 6 of 7
{'q': 'DROP cryptocurrency'}


# Now the user may go to Part - 2 Tweets cleaning
File *clean_tweets.ipynb* in folder **2-Clean-Data**