## Penn State DS 200 Fall 2019
## Lab 4A Tweets Gathering
In this lab, you will learn to gather tweets using keywords and hashtags you 
obtained from your Twitter Developer account.

### Install Tweepy
The first thing we will do is to install a tweepy, a Python library/module for gathering tweets using Twitter API.

In [1]:
!pip install tweepy
!pip install datascience

Collecting tweepy
  Downloading https://files.pythonhosted.org/packages/36/1b/2bd38043d22ade352fc3d3902cf30ce0e2f4bf285be3b304a2782a767aec/tweepy-3.8.0-py2.py3-none-any.whl
Collecting PySocks>=1.5.7 (from tweepy)
  Downloading https://files.pythonhosted.org/packages/8d/59/b4572118e098ac8e46e399a1dd0f2d85403ce8bbaad9ec79373ed6badaf9/PySocks-1.7.1-py3-none-any.whl
Collecting requests-oauthlib>=0.7.0 (from tweepy)
  Downloading https://files.pythonhosted.org/packages/a3/12/b92740d845ab62ea4edf04d2f4164d82532b5a0b03836d4d4e71c6f3d379/requests_oauthlib-1.3.0-py2.py3-none-any.whl
Collecting oauthlib>=3.0.0 (from requests-oauthlib>=0.7.0->tweepy)
[?25l  Downloading https://files.pythonhosted.org/packages/05/57/ce2e7a8fa7c0afb54a0581b14a65b56e62b5759dbc98e80627142b8a3704/oauthlib-3.1.0-py2.py3-none-any.whl (147kB)
[K     |████████████████████████████████| 153kB 1.9MB/s eta 0:00:01
[?25hInstalling collected packages: PySocks, oauthlib, requests-oauthlib, tweepy
Successfully installed PySocks

Building wheels for collected packages: datascience, docopt
  Building wheel for datascience (setup.py) ... [?25ldone
[?25h  Created wheel for datascience: filename=datascience-0.15.3-cp35-none-any.whl size=44582 sha256=9fbc7638de8baac43a1f536b24f4a4c6d57fbc8e72499a8d5d7a473510b54ac1
  Stored in directory: /home/nbuser/.cache/pip/wheels/b8/37/0a/80274866028f6485c5957f0e1acf8e2b755fbe9dd0fd4ad275
  Building wheel for docopt (setup.py) ... [?25ldone
[?25h  Created wheel for docopt: filename=docopt-0.6.2-py2.py3-none-any.whl size=19851 sha256=f6d3863ff68572585c9e3e48503a90db8c4f149591d7dc192de7eef70cbf4115
  Stored in directory: /home/nbuser/.cache/pip/wheels/9b/04/dd/7daf4150b6d9b12949298737de9431a324d4b797ffd63f526e
Successfully built datascience docopt
Installing collected packages: jinja2, branca, folium, kiwisolver, matplotlib, coverage, docopt, coveralls, datascience
  Found existing installation: Jinja2 2.8
    Uninstalling Jinja2-2.8:
      Successfully uninstalled Jinja2-2.8


In [2]:
import tweepy
from tweepy import OAuthHandler
from tweepy import Stream
from tweepy.streaming import StreamListener



import sys
import os
import json
import time
import datetime
import re

import pandas as pd

### Python Code for Gathering Tweets
The following code defines a group of code that, together, "listens" (responds) to tweets (sent from Twitter API) that match the keywords and hashtags specified.  The code also filters out non-English tweets, and performs some simple preprocessing (e.g., remove non-ASCII characters in the body of the tweet), so that we do not need to worry about them later.

In [3]:
class MyListener(StreamListener):
    def __init__(self, raw_file, csv_file, text_file, max_num=300):
        super().__init__()
        self.raw_file = raw_file
        self.csv_file = csv_file
        self.text_file = text_file
        self.max_num = max_num
        self.count = 0
        self.start_time = time.time()

    def on_data(self, data):
        # Filter out special cases
        if data.startswith('{"limit":'):
            return

        # Filter out non-English tweets
        tweet = json.loads(data)
        if tweet['lang'] != 'en':
            return
        # if 'retweeted_status' in tweet:
        #     return

        # Extract fields from tweet and write to csv_file
        user_id = tweet['user']['id']
        user_name = tweet['user']['name']
        tweet_time = tweet['created_at']
        location = tweet['user']['location']
        text = tweet['text'].strip().replace('\n', ' ').replace('\t', ' ')

        # Remove non-ASCII characters and commas in user_name and location
        if user_name is not None:
            user_name = ''.join([c if ord(c) < 128 else '' for c in user_name])
            user_name = user_name.replace(',', '')
        if location is not None:
            location = ''.join([c if ord(c) < 128 else '' for c in location])
            location = location.replace(',', '')

        # Remove non-ASCII characters in text
        text = ''.join([c if ord(c) < 128 else '' for c in text])
        # Replace commas with space
        text = text.replace(',', ' ')
        # Replace double quotes with blanks
        text = re.sub(r'\"', '', text)
        # Replace consecutive underscores with space
        text = re.sub(r'[_]{2,}', ' ', text)
        # Remove all consecutive whitespace characters
        text = ' '.join(text.split())

        # Check if csv_file, text_file exist
        # If not, create them and write the heads
        if not os.path.isfile(self.csv_file):
            with open(self.csv_file, 'w') as f:
                f.write(','.join(['user_id', 'user_name', 'tweet_time', 'location', 'text']) + '\n')
        if not os.path.isfile(self.text_file):
            with open(self.text_file, 'w') as f:
                f.write('text\n')

        with open(self.raw_file, 'a') as f_raw, open(self.csv_file, 'a') as f_csv, open(self.text_file, 'a') as f_text:
            # Write to files
            f_raw.write(data.strip() + '\n')
            f_csv.write(','.join(map(str, [user_id, user_name, tweet_time, location, text])) + '\n')
            f_text.write(text + '\n')

            # Increment count
            self.count += 1
            # if self.count % 10 == 0 and self.count > 0:
            sys.stdout.write('\r{}/{} tweets downloaded'.format(self.count, self.max_num))
            sys.stdout.flush()

            # Check if reaches the maximum tweets number limit
            if self.count == self.max_num:
                print('\nMaximum number reached.')
                end_time = time.time()
                elapse = end_time - self.start_time
                print('It took {} seconds to download {} tweets'.format(elapse, self.max_num))
                sys.exit(0)

    def on_error(self, status):
        print(status)
        return True

# Get the str representation of the current date and time    
def current_datetime_str():
    return format(datetime.datetime.now(), "%Y-%m-%d_%H-%M-%S")

## Exercise 4.1 Paste your API Keys and Access Tokens into the Tweet Gathering Code
#### Note: Make sure you copy each code exactly as they are.  Especially, pay attention to the first character and the last character to make sure you did not miss any of them.  Also, double check you did not accidentently include space or left parenthesis when you copy keys and token.
#### Create a keywords.txt file directly in Jupyter Notebook or upload it from your computer.

In [6]:
def main():
    # Paste your keys and token below.  
    consumer_key = 'jpaaqDJwGcnlVHo7MwroL3rGp'
    consumer_secret = '5LRMPGExsWP3CWAOFrHYjxl1VYt3k4cTYVsYiIUARi9L5eFbT7'
    access_token = '765609343014219776-Zn5OIY664vWdgAlVBs1KNSZwxMKeUXP'
    access_secret = '3aGmyvkWwfOGIbqoD3WMbwbprAlmGwxkMShb5F0oFeyw9'

    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_secret)
    api = tweepy.API(auth)

    # Welcome
    print('===========================================================')
    print('Welcome to the user interface of gathering tweets pipeline!')
    print('You can press "Ctrl+C" at anytime to abort the program.')
    print('===========================================================')
    print()

    # Prompt for input keywords
    methods = ['manual', 'file']
    print('How do you want to specify your key words?')
    while True:
        m = input('Type "manual" or "file" >>> ')
        if m in methods:
            break
        else:
            print('\"{}\" is an invalid input! Please try again.\n'.format(m))

    # Choose keywords:
    if m == 'file':
        print('===========================================================')
        print('Please input the file name that contains your key words.')
        print('Notes:')
        print('    The file should contain key words in one or multiple lines, and multiple key words should be separated by *COMMA*.')
        print('        For example: NBA, basketball, Lebron James')
        print('    If the file is under the current directory, you can directly type the file name, e.g., "keywords.txt".')
        print('    If the file is in another directory, please type the full file name, e.g., "C:\\Downloads\\keywords.txt" (for Windows), or "/Users/xy/Downloads/keywords.txt" (for MacOS/Linux).')

        while True:
            file_name = input('Type your file name >>> ')
            if os.path.isfile(file_name):
                break
            else:
                print('"{}" is not a valid file name! Please check if the file exists.\n'.format(file_name))

        # Check the content of keywords file
        key_words = []
        with open(file_name, 'r') as f:
            lines = f.readlines()
            if len(lines) == 0:
                print('\n{} is an empty file!\nTask aborted!'.format(file_name))
                sys.exit(1)

            for line in lines:
                line = line.strip()
                # Detect non-ASCII characters
                for c in line:
                    if ord(c) >= 128:
                        print('\n{} contains non-ASCII characters: "{}" \nPlease remove them and try again'.format(file_name, c))
                        sys.exit(1)
                # Check delimiters
                if line.count(' ') > 1 and ',' not in line:
                    print('\nMore than 1 <space> symbols exist in the key words file, but none comma exists')
                    print('I\'m confused about your keywords. Please separate your key words by commas.')
                    sys.exit(1)

                words = line.split(',')
                for w in words:
                    if len(w.strip()) > 0:
                        key_words.append(w.strip())

        # Check key_words
        if len(key_words) == 0:
            print('\nZero key words are found in {}! Please check your key words file.'.format(file_name))
            sys.exit(1)

    elif m == 'manual':
        print('===========================================================')
        print('Please input your key words (separated by comma), and hit <ENTER> when done.')

        while True:
            line = input('Type the key words >>> ')
            line = line.strip()

            invalid_flag = False
            # Check empty
            if len(line) == 0:
                print('\nYour input is empty! Please try again.')
                invalid_flag = True
            # Detect non-ASCII characters
            for c in line:
                if ord(c) >= 128:
                    print('\nYour input contains non-ASCII characters: "{}"! Please try again.'.format(c))
                    invalid_flag = True
                    break
            # Check delimiters
            if line.count(' ') > 1 and ',' not in line:
                print('\nMore than 1 <space> symbols exist in your input, but none comma exists')
                print('I\'m confused about your keywords. Please try again')
                invalid_flag = True

            if invalid_flag:
                continue
            else:
                break

        # Process input
        key_words = []
        for w in line.split(','):
            if len(w.strip()) > 0:
                key_words.append(w.strip())

    # Print valid key words
    key_words = list(set(key_words))
    print('\n{} unique key words being used: '.format(len(key_words)), key_words)

    # Prompt for number of tweets to be gathered
    print('===========================================================')
    print('How many tweets do you want to gather? \nInput an integer number, or just hit <ENTER> to use the default number 300.')
    num_tweets = 300
    while True:
        s = input('Input an integer >>> ')
        s = s.strip()
        if len(s) == 0:
            break
        elif s.isdigit():
            num = int(s)
            if num > 0:
                num_tweets = num
                break
            else:
                print('\nPlease input a number that is greater than 0.')
        else:
            print('\nPlease input a valid integer number.')

    print('{} tweets to be gathered.'.format(num_tweets))

    # Streaming
    # TODO: remvoe '\t', '\n' and ',' in text field, also remove empty text
    print('===========================================================')
    print('Start gathering tweets ...')

    postfix = current_datetime_str()
    raw_file = 'raw_{}.json'.format(postfix)
    csv_file = 'data_{}.csv'.format(postfix)
    text_file = 'text_{}.csv'.format(postfix)

    twitter_stream = Stream(auth, MyListener(raw_file=raw_file, csv_file=csv_file, text_file=text_file, max_num=num_tweets))
    twitter_stream.filter(track=key_words)


if __name__ == '__main__':
    try:
        main()
    except KeyboardInterrupt:
        print('\nTask aborted!')
        


Welcome to the user interface of gathering tweets pipeline!
You can press "Ctrl+C" at anytime to abort the program.

How do you want to specify your key words?
Type "manual" or "file" >>> file
Please input the file name that contains your key words.
Notes:
    The file should contain key words in one or multiple lines, and multiple key words should be separated by *COMMA*.
        For example: NBA, basketball, Lebron James
    If the file is under the current directory, you can directly type the file name, e.g., "keywords.txt".
    If the file is in another directory, please type the full file name, e.g., "C:\Downloads\keywords.txt" (for Windows), or "/Users/xy/Downloads/keywords.txt" (for MacOS/Linux).
Type your file name >>> keywords.txt

3 unique key words being used:  ['#tradewar', 'tradewar', 'Trade War']
How many tweets do you want to gather? 
Input an integer number, or just hit <ENTER> to use the default number 300.
Input an integer >>> 
300 tweets to be gathered.
Start gatheri

SystemExit: 0

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)
