## Five Steps to Transforming Any Twitter User's Timeline into a csv or Dictionary with Python

###### By Eleanor Stribling, July 2018

If you enjoy Natural Language Processing like I do, you might have some ideas for analyses you could do using Twitter. What are the word patterns and hashtags that the tweeter uses?  How do these change over time?  What sites do they link to?  What words appear around others?  Who or what do they talk about?

This is part 1 of a 2 part project to show you how to:

A. Obtain and clean the data from a Twitter user's timeline

B. Export into a csv file and a Python dictonary

C. Apply analysis tools from NLTK to get summary data and explore hypotheses about the user's tweets

D. Use Python libraries to visualize the data you collect

In this workbook, we'll cover A and B from the list above.  First, I'll show you how to gather any user's timeline using just Python and a few libraries. Then, we'll get ready to do the fun part of the project - the analysis - by cleaning the data and putting it into your choice of a csv file or Python dictionary.

I'm using Python 3.6 in this project, as well as an [Anaconda virtual environment](https://conda.io/docs/user-guide/tasks/manage-environments.html).  To complete the steps below, you should already have Python 3.6 and Anaconda installed and the virtual environment activated. You will also need a Twitter account and an internet connection to complete this workbook. 

Since I code on a Mac I use some terminology that doesn't translate to a Windows machine in places; terminal instead of shell for example. 

If you're not too familiar with Jupyter Notebooks, they are awesome, and [this quick start guide](http://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/) is well worth your time.  Knowing the basics will help you through this notebook and be a useful tool when you do your own projects!

This code lives on Github, and is the subject of [this post](https://medium.com/p/ff41b941ed6) on my technical blog, [agatha.codes](https://medium.com/agatha-codes)

### Step 1: Install the required packages 👩🏻‍💻
From your virtual environment, you can run the `requirements.txt` file in this repo, rather than searching out and installing each one separately.  

To do this, go to the folder you'll be working from in your terminal, move the `requirements.txt` file over to that folder, and run the command `pip install -r requirements.txt`.

Once that's installed, type `jupyter notebook twitter_timeline_to_csv`, and the notebook will open in the browser.

Running the cell below will import all of the needed packages into your script and make them ready for use.

In [1]:
import csv
from datetime import datetime
import os
import re
import requests
import twitter
import lxml.html

### Step 2: Get ready to talk to the Twitter API 📞
If you don't already have a Twitter account, you can create one [here](https://twitter.com/i/flow/signup).

If you do have a Twitter account, head over to the Twitter API documentation - to get the keys needed to access the API, you'll need to create an application.  Navigate to the [apps page](https://apps.twitter.com) while signed into your account and click on the `Create New App` button on the top right.
<img src="images/applications_page.png"/>

Once that's done, click on your app's name from the list and then the "Keys and access tokens" tab. 
<img src="images/keys.png"/>

Here, you'll need to get four things that will identify you and give you access to the Twitter API:

From the top of the page:
- Consumer key (API key)
- Consumer secret key (API secret key)

<img src="images/consumer_api.png" />

From the bottom of the page:
- Access token
- Access secret token

<img src="images/access_token.png" />

If you are going to put your code online anywhere, you should set environment variables for all four of these values; pasting them directly in your code is likely to result in you accidentally committing them and pushing them into your online repo at some point.  If someone gets a hold of these values they can access Twitter's API as you, which you don't ever want.

Instructions for setting environment variables in an Anaconda virtual environment are [here](https://conda.io/docs/user-guide/tasks/build-packages/environment-variables.html).

The code below accesses the virtual environment variables I've created to represent my consumer keys and access tokens and saves them as a variable in my code.  If you copy this workbook and run it locally on your computer, change the variable names in the quotes to the names you've given them in your virtual environment.

In [2]:
TWITTER_CONS_KEY = os.environ.get('T_CONS_')
TWITTER_CONS_SEC = os.environ.get('T_CONS_SECRET')
TWITTER_ACCESS_TOKEN = os.environ.get('T_ACCESS_')
TWITTER_ACCESS_SEC = os.environ.get('T_ACCESS_SECRET')

Now that's done, we're going to use the python-twitter library to access the Twitter API.  I find this a lot easier than writing them from scratch. 

We imported the python-twitter libary as `import twitter` above.  It shows up in the [requirements.txt](requirements.txt) file as `python-twitter==3.4.2`, and you can read the documentation [here]
(https://python-twitter.readthedocs.io/en/latest/), which also happens to have a great step by step on getting started with the Twitter API!

The first thing will do is create an object that will hold all of the credentials for accessing the API that we defined above.

In [3]:
t = twitter.Api(
    consumer_key = TWITTER_CONS_KEY,
    consumer_secret = TWITTER_CONS_SEC,
    access_token_key = TWITTER_ACCESS_TOKEN, 
    access_token_secret = TWITTER_ACCESS_SEC,
    tweet_mode='extended' # this ensures that we get the full text of the users' original tweets
)

That's it!  You're ready to talk to the Twitter API!

### Step 3: Get some Tweets! 📦

This is where things start getting fun.

If you haven't already, choose an account you want to analyze.  

Next, you'll need to create a variable to hold that account's Twitter handle.  Input this in the cell below without the '@' (e.g. Use NASA, not @NASA).  If you're starting with the URL of the account, take only the handle after the last '/' (e.g. From https://twitter.com/NASA, use 'NASA'.

In [4]:
screen_name = "NASA"

Now that we have that, let's get some data! We'll call the Twitter API with the `GetUserTimeline` method on our `t` object that we created in the last step.  

Notice that we are passing in the `screen_name` variable, and setting the `count` variable to `200`.  You must have the screen_name to make the API call and if you omit the count variable, you'll get the latest 20 tweets back by default.  You can't get more than 200 tweets back at a time.

In [5]:
# Call the Twitter API using the python-twitter library
first_200 = t.GetUserTimeline(screen_name=screen_name, count=200)

In [6]:
# Let's explore the data!
print("There are %d tweets in our dataset, which has the type %s." % (len(first_200), type(first_200)))
print()
print("Here's a sample of what the data from one tweet looks like:")
print(first_200[0])
print()
print("Each tweet is stored in a %s data structure." % type(first_200[0]))

There are 200 tweets in our dataset, which has the type <class 'list'>.

Here's a sample of what the data from one tweet looks like:
{"created_at": "Sun Jul 01 23:04:00 +0000 2018", "full_text": "RT @NASA_Johnson: This week on \"Houston, We Have a Podcast,\" Harry Roberts, Flight Operation Supervisor for the Aircraft Operations Divisio\u2026", "hashtags": [], "id": 1013558869728071685, "id_str": "1013558869728071685", "lang": "en", "retweet_count": 96, "retweeted_status": {"created_at": "Sun Jul 01 16:47:15 +0000 2018", "favorite_count": 511, "full_text": "This week on \"Houston, We Have a Podcast,\" Harry Roberts, Flight Operation Supervisor for the Aircraft Operations Division, talks about operations out at Ellington Field and the aircraft that helped make human spaceflight possible. https://t.co/nGdx8Lck02 https://t.co/L15ESgugoE", "hashtags": [], "id": 1013464059197493249, "id_str": "1013464059197493249", "lang": "en", "media": [{"display_url": "pic.twitter.com/L15ESgugoE", "expand

So we can see we got the first 200 tweets as a list of objects, and that there's a LOT of data in each one.  We'll come back to that in a moment, but first, let's get some more sweet, sweet data!

The function below will help us get more data for our dataset.  The limitation we're working with, at the time of this writing, is that the Twitter API is rate limited, meaning Twitter puts restrictions on how much data you can take at a time.  For my app, I'm rate limited to 900 requests every 15 minutes.  

Here's what the `get_tweets` function, below does:
1. Takes the `first_200`, and `screen_name` variables as arguments, as well as something called `last_id`.  As you'll see when we call the function, this is the ID number of the last/oldest tweet in the `first_200` list.
2. Makes a new list and adds the `first_200` tweets to it
3. For 900 iterations (because of my rate limit):
   - Gets 200 of the user's tweets, starting with a `max_id` one smaller than `last_id` in the previous list of tweets; the `new` variable this data is stored in gets overwritten each time
   - Adds the list of tweet objects obtained to the `all_tweets` list
   - If there's anything in the list (e.g. if we got any data back), grab a new `last_id` value to feed into the API call in the next iteration.

In [7]:
def get_tweets(first_200, screen_name, last_id):
    all_tweets = []
    all_tweets.extend(first_200)
    for i in range(900):
        new = t.GetUserTimeline(screen_name=screen_name, max_id=last_id-1)
        if len(new) > 0:
            all_tweets.extend(new)
            last_id = new[-1].id
        else:
            break
    
    return all_tweets

<img src="images/question.png" align="left">  <b>Why is max_id set to last_id MINUS 1?</b>

In the 4th line of the `get_tweets` function, the `max_id` is set to `last_id-1`.  Why not just set it to `last_id`?

Each tweet has a unique whole number, stored as the `ID` attribute; in the sample printed above it looks like this: `"id": 1013529203990581249`.   If we set the `max_id` to that number, the first tweet we get in the next 200 tweets will be that exact tweet, meaning it will appear twice in the dataset.  By subtracting one from this ID, we are saying the max ID can be anything less than the last tweet in our previous list of 200, which is exactly what we want. 👏


In [8]:
# let's run our function!
all_tweets = get_tweets(first_200, screen_name, first_200[-1].id)

In [9]:
# now to check the data - we should have more than 200 tweets.
print("There are %d tweets stored in a list as the all_tweets variable." % len(all_tweets))
print("The most recent tweet in our collection was sent %s and the oldest tweet was sent %s." % (
                                                                            all_tweets[0].created_at, 
                                                                            all_tweets[-1].created_at)
     )

There are 3245 tweets stored in a list as the all_tweets variable.
The most recent tweet in our collection was sent Sun Jul 01 23:04:00 +0000 2018 and the oldest tweet was sent Wed Oct 11 15:31:37 +0000 2017.


### Step 4: Clean the data! 🛁

In this step, we're going to condense and clean our data to get it into a more analysis-friendly format.

First, the condensing part.  We need to decide what's important for the analysis.

Let's take another look at the data that comes back for with each tweet from the Twitter API.  There's a lot here, much of it not specific to the tweet itself.  For example, since we're grabbing all of the tweets from a single person's timeline, the whole `"user"` attribute isn't very useful to us, as it's repeated every time.  

In this tutorial we'll focus on keeping and cleaning the following attributes, but you can choose your own and modify the code:
- `id`: the unique identifier for the tweet
- `created_at`: when the tweet was sent
- `full_text`: the text included in the tweet
- `hashtags`: the hashtags (e.g. "#space" appears as "space") included in the tweet
- `urls`: the expanded version of urls included in the tweet (e.g. "https://t.co/sYCFHKxzBf" is the shortened URL in the tweet but we'll get the full url of what it points to, https://twitter.com/NASA/status/1013529203990581249/video/1)
- `favorite_count`: number of times the tweet was favorited
- `retweet_count`: number of times the tweet was retweeted
- `source`: from what platform/app the tweet was posted

In [10]:
print(all_tweets[0])

{"created_at": "Sun Jul 01 23:04:00 +0000 2018", "full_text": "RT @NASA_Johnson: This week on \"Houston, We Have a Podcast,\" Harry Roberts, Flight Operation Supervisor for the Aircraft Operations Divisio\u2026", "hashtags": [], "id": 1013558869728071685, "id_str": "1013558869728071685", "lang": "en", "retweet_count": 96, "retweeted_status": {"created_at": "Sun Jul 01 16:47:15 +0000 2018", "favorite_count": 511, "full_text": "This week on \"Houston, We Have a Podcast,\" Harry Roberts, Flight Operation Supervisor for the Aircraft Operations Division, talks about operations out at Ellington Field and the aircraft that helped make human spaceflight possible. https://t.co/nGdx8Lck02 https://t.co/L15ESgugoE", "hashtags": [], "id": 1013464059197493249, "id_str": "1013464059197493249", "lang": "en", "media": [{"display_url": "pic.twitter.com/L15ESgugoE", "expanded_url": "https://twitter.com/NASA_Johnson/status/1013464059197493249/photo/1", "id": 1013463571764633602, "media_url": "http://pbs.t

One problem is that the data we get back isn't totally clean, so we need to process it a little bit first.  Here are some examples from the data above that could create problems for us later because they include formatting we don't need.  In every case except the `created_at` attribute, we want a string, or list of strings, with just the important parts; we don't need `<a href="https://www.sprinklr.com" rel="nofollow">Sprinklr</a>`, just `Sprinklr`.

For the `created_at` attribute, when using it in Python, we'll want to convert it to a datetime object.

Let's take a look at some fields that need cleaning.

In [11]:
print("Data in the created_at attribute looks like this:", all_tweets[0].created_at)
print("Data in the hashtags attribute looks like this:", all_tweets[0].hashtags)
print("Data in the urls attribute looks like this:", all_tweets[0].urls)
print("Data in the source attribute looks like this:", all_tweets[0].source)

Data in the created_at attribute looks like this: Sun Jul 01 23:04:00 +0000 2018
Data in the hashtags attribute looks like this: []
Data in the urls attribute looks like this: []
Data in the source attribute looks like this: <a href="https://www.sprinklr.com" rel="nofollow">Sprinklr</a>


To get these fields into an easier to use format, I've written some helper functions, each one explained below.

In [12]:
def clean_hashtags(hashtags):
    """
    Turns data with any number of hashtags like this - [Hashtag(Text='STEMonStation')] - to a list like this -
    ['STEMonStation']
    """
    cleaned = []
    if len(hashtags) >= 1:
        for i in range(len(hashtags)):
            cleaned.append(hashtags[i].text)        
    return cleaned

def clean_urls(urls):
    """
    Turns data with any number of expanded urls like this - 
    [URL(URL=https://t.co/sYCFHKxzBf, ExpandedURL=https://youtu.be/34bFgA3H3hQ)]- to a list like this - 
    ["https://youtu.be/34bFgA3H3hQ"]
    """
    cleaned = []
    if len(urls) >= 1:
        for i in range(len(urls)):
            cleaned.append(urls[i].expanded_url)
    return(cleaned)
        

def clean_source(source):
    """
    Turns data including the source and some html like this - 
    <a href="https://www.sprinklr.com" rel="nofollow">Sprinklr</a> - to a list like this -
    ['Sprinklr']
    """
    raw = lxml.html.document_fromstring(source)
    return raw.cssselect('body')[0].text_content()


def string_to_datetime(date_str):
    """
    Turns a string including date and time like this - Sun Jul 01 21:06:07 +0000 2018 - to a Python datetime object
    like this - datetime.datetime(2018, 7, 1, 21, 6, 7, tzinfo=datetime.timezone.utc)
    """
    return datetime.strptime(date_str, '%a %b %d %H:%M:%S %z %Y')


We'll use all of these helper functions in the next step!

### Step 5: Reformat the data as a csv or Python dictionary 🗃
This is the last step of this tutorial, where I'll show you how to get this data, which we now know how to retrieve and clean, into a format that we need to start the analysis.

I'm showing both how to make this into a csv and a Python dictionary because:
- A lot of people like seeing the data all at once as a csv - there are neat ways to print it in Python but this is easier to absorb for a lot of people
- csv is a great format to use if you need to share the data with people who don't program, as they can open it up in any notepad or spreadsheet program, no coding required
- If you want to continue the analysis with Python but in a different project file, you can always use the dictionary or read in a csv in a couple of lines of code

So let's start with the csv file! The `write_to_csv` function will create the file and pull the data we want out of the `all_tweets` list of tweet objects we made in step 3.

In [13]:
def write_to_csv(tweets, filename):
    # the headers are the fields that we identified in step 4
    headers = ['id', 'full_text', 'hashtags', 'urls', 'created_at', 'favorite_count', 'retweet_count', 'source']
    
    # here we create the file and write the header row with the headers list
    # note that the 'filename' argument will be the name of the csv file
    with open(filename + '.csv', 'w', newline='') as csvfile:
        writer = csv.writer(csvfile, delimiter=',')
        writer.writerow(headers)
        
        # in this loop, we write a new row for each tweet object, with the data taken from the tweet object in 
        # the order we listed the headers
        # note where we call the helper functions from step 4 on hashtags, urls, and source
        for item in tweets:
            writer.writerow([item.id, 
                             item.full_text, 
                             clean_hashtags(item.hashtags), 
                             clean_urls(item.urls), 
                             item.created_at, 
                             item.favorite_count, 
                             item.retweet_count, 
                             clean_source(item.source)])
    csvfile.close()

In [14]:
# now we call the function, passing in the all_tweets list
# here the filename will be the screen_name variable defined in step 2 with "_tweets" after it (e.g. NASA_tweets.csv),
# but you can change it to whatever you want
write_to_csv(all_tweets, screen_name + '_tweets')

Now you should see your csv file in the same directory as this notebook!  Check that it has all of the right fields and the right number of rows; there should be one for every tweet in the list plus a header row.  

You can see the file, current as of July 1, 2018, on Google Drive [here](https://docs.google.com/spreadsheets/d/1CaYfh9xJgJRPiN-bYn_ewaMCkp1uNkC1kC7E6pgy1lg/edit?usp=sharing), as well as in this repo.

If you're more intersted in continuing to analyze the file in Python, I'd recommend putting the data in a dictionary. Some of this is pretty similar to the process for writing the data to a csv file, although notice that we need to put each one under a key (here I've chosen the string version of the ID field).  Notice I'm running all four of the cleaning functions from step 4 on the data as I add it.

In [15]:
def create_dict(tweets):
    dict = {}
    for item in tweets:
        clean_source(item.source)
        dict[item.id_str] = {
            'full_text': item.full_text,
            'hashtags': clean_hashtags(item.hashtags),
            'urls': clean_urls(item.urls),
            'created_at': string_to_datetime(item.created_at),
            'favorite_count': item.favorite_count,
            'retweet_count' : item.retweet_count,
            'source': clean_source(item.source)
        }
    return dict

In [16]:
tweet_dict = create_dict(all_tweets)

In [17]:
tweet_dict["1013529203990581249"]

{'full_text': "In this week's #STEMonStation, @astro_ricky demonstrates how water's molecular properties behave in microgravity and the unique opportunities it creates on @Space_Station: https://t.co/sYCFHKxzBf https://t.co/KZrx0H5HNH",
 'hashtags': ['STEMonStation'],
 'urls': ['https://youtu.be/34bFgA3H3hQ'],
 'created_at': datetime.datetime(2018, 7, 1, 21, 6, 7, tzinfo=datetime.timezone.utc),
 'favorite_count': 1587,
 'retweet_count': 429,
 'source': 'Sprinklr'}

That's it for this tutorial!  Stay tuned for Part 2 in [this repo].(https://github.com/eleanorstrib/twitter_timeline_analysis_2) and on [agatha.codes](https://medium.com/agatha-codes) very soon.