## Twitter Data Gathering 

The goal of this step to collect enough posts for analysis from Twitter.

### Links:

https://developer.twitter.com/en/docs <br>
http://docs.tweepy.org/en/v3.5.0/api.html <br>


### Import Libraries

In [None]:
import time
import json
import tweepy
from tweepy import OAuthHandler
from tweepy import API

import datetime as dt
import pandas as pd
import os
import csv

pd.set_option('display.max_colwidth', -1)

### User credentials

In [None]:
# Variables that contains the user credentials to access Twitter API 
ACCESS_TOKEN = '***'
ACCESS_SECRET = '***'
CONSUMER_KEY = '***'
CONSUMER_SECRET = '***'

### Authentication

In [None]:
# Setup tweepy to authenticate with Twitter credentials:
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)

# Create the api to connect to twitter with your creadentials
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

### Data dictionary

Tweet Data Dictionary is here:
https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object

|Field|Type|Description|
|---|---|---|
|id|Integer|The integer representation of the unique identifier for this Tweet. This number is greater than 53 bits and some programming languages may have difficulty/silent defects in interpreting it. Using a signed 64 bit integer for storing this identifier is safe|
|created_at|String|UTC time when this Tweet was created|
|user|Dictionary|Twitter account metadata and describes the author of the Tweet (all possible data)|
|user_id|Integer|User's id|
|user_name|String|User's name|
|user_location|String|User's location from profile|
|user_description|String|User's description from profile|
|user_followers|Integer|Number of user's followers|
|retweet_count|Integer|Indicates approximately how many times this Tweet has been liked by Twitter users.|
|favorite_count|Integer|Number of times this Tweet has been retweeted|
|lang|String| When present, indicates a BCP 47 language identifier corresponding to the machine-detected language of the Tweet text, or und if no language could be detected|
|is_quote_status|Boolean|Indicates whether this is a Quoted Tweet|
|place|Dictionary|Geotag data from tweet. When posting Tweets, users have the option to geotag their Tweet with an exact location or a Twitter Place|
|place_name|String|Short human-readable representation of the place’s name|
|place_country|String|Shortened country code representing the country containing this place|
|coordinates|List|The longitude and latitude of the Tweet’s location, as a collection in the form [longitude, latitude]|
|coordinates_longitude|Float|The longitude of the Tweet’s location|
|coordinates_latitude|Float|The latitude of the Tweet’s location|


### Collecting data

Defined a function for parsing data. The function accepts id of tweet as an argument, creates request to Twitter's API and returns all information about the tweet as a dictionary:

In [None]:
def tweet_collection(tweet_id):
    json = api.get_status(tweet_id)._json
    tweet = {}
    tweet['id'] = json['id']
    tweet['created_at'] = json['created_at']
    tweet['text'] = json['text']
    #information about user
    tweet['user'] = json['user']
    tweet['user_id'] = json['user']['id']
    tweet['user_name'] = json['user']['name']
    tweet['user_location'] = json['user']['location']
    tweet['user_description'] = json['user']['description']
    tweet['user_followers'] = json['user']['followers_count']
    #tweet info
    tweet['retweet_count'] = json['retweet_count']
    tweet['favorite_count'] = json['favorite_count']
    tweet['lang'] = json['lang']
    tweet['is_quote_status'] = json['is_quote_status']
    #about location if user uses geotags 
    if json['place']!=None:
        tweet['place'] = json['place']
        tweet['place_name'] = json['place']['full_name']
        tweet['place_country'] = json['place']['country_code']
        tweet['coordinates'] = json['place']['bounding_box']['coordinates'][0][0]
        tweet['coordinates_longitude'] = json['place']['bounding_box']['coordinates'][0][0][0]
        tweet['coordinates_latitude'] = json['place']['bounding_box']['coordinates'][0][0][1]
    else:
        tweet['place']= None
        tweet['place_name'] = None 
        tweet['place_country'] = None 
        tweet['coordinates'] = None 
        tweet['coordinates_longitude'] = None 
        tweet['coordinates_latitude'] = None 
    return tweet

Defined the second function for parsing data for client. The function accepts geo location as an argument, creates request to Twitter's API and returns all information about the tweet as a dictionary:

Collectiing tweets from list:

In [None]:
output = open('tweets_results.csv','a+') #create and open final file with results 
header = ",".join(list(tweet_collection(open('tweet_ids.txt').readline()).keys())) #create a header
output.write(header + '\n') #add header there
with open('tweet_ids.txt','r') as file: #loop through list of ids
    for tweet in file:
        try: 
            tweet_dict = tweet_collection(tweet) #get data for selected tweet
            line = ",".join(["\"" + "".join(str(e).splitlines()).replace("\"", "") + "\"" for e in tweet_dict.values()])
            output.write(line + '\n') #write it down     
        except Exception as e:
            f = open('errors.txt','a+') #save errors to errors.txt
            f.write(str(time.ctime()) +':' + str(e) + tweet)
            f.close()
    time.sleep(1) #add 1 sec to delay request in order to follow API rules (up to 900 requests per 15 minutes)           
output.close()

### Collecting data - for client

Defined the second function for parsing data for client. The function accepts geo location as an argument, creates request to Twitter's API and returns all information about the tweet as a dictionary:

In [None]:
def geo_tweets_collection(disaster_list):
    result = []
    for word in disaster_list:
        for tweet in tweepy.Cursor(api.search, q=word,lang='en').items(100): 
            json = tweet._json
            if json['place']!=None:
                tweet = {}
                tweet['id'] = json['id']
                tweet['created_at'] = json['created_at']
                tweet['text'] = json['text']
                #information about user
                tweet['user'] = json['user']
                tweet['user_id'] = json['user']['id']
                tweet['user_name'] = json['user']['name']
                tweet['user_location'] = json['user']['location']
                tweet['user_description'] = json['user']['description']
                tweet['user_followers'] = json['user']['followers_count']
                #tweet info
                tweet['retweet_count'] = json['retweet_count']
                tweet['favorite_count'] = json['favorite_count']
                tweet['lang'] = json['lang']
                tweet['is_quote_status'] = json['is_quote_status']
                #about location if user uses geotags 
                tweet['place'] = json['place']
                tweet['place_name'] = json['place']['full_name']
                tweet['place_country'] = json['place']['country_code']
                tweet['coordinates'] = json['place']['bounding_box']['coordinates'][0][0]
                tweet['coordinates_longitude'] = json['place']['bounding_box']['coordinates'][0][0][0]
                tweet['coordinates_latitude'] = json['place']['bounding_box']['coordinates'][0][0][1]
                result.append(tweet)
            else:
                continue
            time.sleep(3) 
    return result

### Automation

In order to collect more data I automated collection. I put scrip 'twitter_collect.py' to AWS E2 instance and run it.


### Executive Summary of Data Collection 

- Testing data was gathered from Twitter's API, using the Python 'tweepy' library. Twitter's API returns a JSON file for each post. Lists of posts were taken from http://crisislex.org/:  a website that provides repositories of crises-related social media data.

- We Defined a function for parsing data. The function accepts tweet IDs as arguments, then it sends a request to Twitter's API and saves all the information from tweet to a dictionary.

- From that a script was created that loops through list of ids and return csv.

- We automated collection using an AWS E2 instance.

- Created Data Dictionary of collected data.

