# Step 1: Data Acquisition
Series on Designing with Twitter Data 

In this tutorial: 
* Task 1: Retrieving data from Twitter
* Task 2: Managing data

### Output: tweets.csv -> Step 5

For Education Only

@ Wolf & Jacky - SST - IO - TUDelft 

V1.0: April 20, 2020

# Task 1: Retrieving Data from Twitter

The objective of this task is to get you familiar with Twitter REST API. To achieve this, you will first create a data retrieving application. Next, you will use the library tweepy to interact with Twitter and access Tweets.

## 1.1 Data Retrieving Application

To access tweets from Twitter, you need to register an 'application' on Twitter as a Developer before you can use it.

If you do not have one yet, create a Twitter account: https://twitter.com

Then, you need to upgrade this account to a developer account: https://dev.twitter.com

Finally, you can access the application page via the following link: https://dev.twitter.com/apps

By clicking on the blue button 'Create an app', you need to provide some basic information about the application.

After that, you will be able to get the following parameters:

* Consumer key;
* Consumer secret;
* Access token;
* Access token secret.

We need these four parameters to get our code authenticated with OAuth on Twitter. Let's create a .env file in the  project folder and write the following 4 lines, replacing YOUR-... with your own keys and secrets.

```
CONSUMER_KEY=YOUR-KEY
CONSUMER_SECRET=YOUR-SECRET

ACCESS_TOKEN=YOUR-TOKEN
ACCESS_TOKEN_SECRET=YOUR-TOKEN-SECRET
```

With this, we will be able to share our code openly without sharing our credentials.

## 1.2 Retrieving Data

It is tme to start coding! The first thing we need is to tell our code is where to find our Keys and Secrets. We use the library 'dotenv' to extract this information from the .env file.

In [None]:
# Install the library
!pip install python-dotenv
# Load it on the Notebook
from dotenv import load_dotenv
import os
load_dotenv()

# Use it to retieve our four Twitter parameters
consumer_key = os.environ['CONSUMER_KEY']
consumer_secret= os.environ['CONSUMER_SECRET']
access_token=os.environ['ACCESS_TOKEN']
access_token_secret=os.environ['ACCESS_TOKEN_SECRET']

For the next step, we use another Python library, tweepy, to facilitate our interaction with Twitter. Let's install and load the library.

In [None]:
!pip install tweepy
from tweepy import OAuthHandler, API
import json

Once you have the tokens and secrets stored in four variables as above, you can run the following code establish the connection with Twitter via the OAuthHandler and API object from tweepy.

In [None]:
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = API(auth)

Let's retrieve the tweets from your own timeline. The following code uses the API object to get tweets from your timeline (function 'home_timeline') and stores them in a variable called public_tweets.

In [None]:
public_tweets = api.home_timeline()

Printing the results give us a long JSON structure (starting with _json=), not very easy to read.

In [None]:
# foreach tweets from the result
for tweet in public_tweets:
   # printing the text stored inside the tweet object
   print(tweet.text)

## 1.3 Searching per Twitter Id

Our twitter timeline is a good test, but it does not get us far in the exploration of a given topic. A more effective way is to look for a specific account id (e.g. a person, a company). This time, we use the function 'user_timeline' with an twitter acount id and anumber of tweet we want. In the following example we try to pull 20 tweets from Tesla's Twitter account.

In [None]:
# Twitter account to look at
twitter_id = "Tesla"
# Number of tweets to pull
tweetCount = 20
# Calling the user_timeline function with our parameters
timeline_results = api.user_timeline(id=twitter_id, count=tweetCount)
# Toreach tweet of the result
for tweet in results:
   # printing the text stored inside the tweet object
   print(tweet.text)

## 1.4 Searching per Query

Finally, going more into the details, we can specify a more specific query on a topic. In the following example, we use the function 'search' to try to pull 20 tweets in english that include the keywords 'prius' and 'car'.

In [None]:
# Simple query,
query = "prius car"
# Number of tweets to pull
tweetCount = 20
# Language code (follows ISO 639-1 standards)
language = "en"# Calling the user_timeline function with our parameters
query_results = api.search(q=query, lang=language, count=tweetCount)# foreach through all tweets pulled
for tweet in query_results:
   # printing the text stored inside the tweet object
   print(tweet.text)

You can find more options and documentation in the [Tweepy documentation](http://docs.tweepy.org/en/latest/api.html#API.search)

# Task 2: Managing Data

In this second task, we will look at the result itself, how it is formated, what it contains and how to store the tweets.

## 2.1 Navigating the Results

Looking back at the previous example of query with 'prius' and 'car', let's print the raw JSON result of the query. JSON (for JavaScript Object Notification) is a common data structure to exchange information. the function json.dumps help us handling this structure by formatting it. You can try with and without the parameter 'indent=2' for better readability. 

In [None]:
query_results = api.search(q="prius car", lang="en", count=1)
for tweet in query_results:
   # printing the raw json
   print(json.dumps(tweet._json, indent=2))

For this query we set the number of tweet to 1 and still, the result is long: there is much more than the text of the Tweet. Browsing through the key/values you we certainly spot the creation date (create_date), the unique id of the tweet and the text of the tweet. Further down, a data structure reveal information about the user including its name, description and so on. Towards the end, there is also infomation about the number of retweet or favorite. Any of those  attributes can be extracted, but keep in mind that all tweet do not have all the possible information.

To navigate the results to only extract the values we want, we use the dot '.' to enter an attribute. Here is an example to extract the name of the tweet author. Try to change the code for to extract another attribute.

In [None]:
test_results = api.search(q="prius car", lang="en", count=5)
for tweet in test_results:
   # printing the author's name of each tweet
   print(tweet.user.description)

## 2.2 Storing and Retrieving Data from JSON

We can also store the result in a file. To do this, we use the function open() with the name of the file and the option 'a' (append mode) to continuously add at the end of the file.

In [None]:
query_results = api.search(q="prius car", lang="en", count=100)
# Open file with option 'a' for 'append' new content at the end of the file.
json_file = open("tweets.json","a")
count=0
for tweet in query_results:
    # write tweet in file
    json_file.write(json.dumps(tweet._json))
    # Create a new line
    json_file.write('\n')
    count=count+1
json_file.close()
print(count, 'tweets stored as JSON.')

## 2.3 Storing and Retriving Data from CSV

Finally, we can combine the 2 previous steps (selecting attrbutes and storing into files) to store our data in a CSV format. CSV stands for Comma Separated Values. It is a common format to store tabular data such as spreadsheets. Instead of storing all the data retrieved from the query, we will select only the ID and the text. Feel free to experiment and store more fields. 

In [None]:
query_results = api.search(q="prius car", lang="en", count=100)
# Open file with option 'a' for 'append' new content at the end of the file.
csv_file = open("tweets.csv","a")
count=0
for tweet in query_results:
    # Componse a line with the tweet id and the text.
    # Note the double quotes to escape potential comma in the tweets,
    # as well as the replacement of all new line by a space
    line = tweet.id_str + ',"' + tweet.text.replace('\n',' ') + '"\n'
    # Write tweet in file
    csv_file.write(line)
    # Count the number of line for the end message
    count=count+1
csv_file.close()
print(count, 'tweets stored as CSV.')

This file can be use as an input of new data in step 5.