## Six Steps to Transforming Any Twitter User's Timeline into a csv or Dictionary with Python

###### By Eleanor Stribling, July 2018

If you enjoy Natural Language Processing like I do, you might have some ideas for analyses you could do using Twitter. The goal of this two part project is to show you how to:

1. Obtain and clean the data from a Twitter user's timeline

2. Export into a csv file and a Python dictonary

3. Apply analysis tools from NLTK to get summary data and explore hypotheses about the user's tweets

4. Use Python libraries to visualize the data you collect

In this workbook, we'll learn how to gather any user's timeline using just Python and a few libraries. Then, we'll get ready to do the fun part of the project - the analysis - by cleaning the data and putting it into your choice of a csv file or Python dictionary.

I'm using Python 3.6 in this project, as well as an [Anaconda virtual environment](https://conda.io/docs/user-guide/tasks/manage-environments.html).  To complete the steps below, you should already have Python 3.6 and Anaconda installed and the virtual environment activated. You will also need a Twitter account and an internet connection to complete this workbook. 

Since I code on a Mac I use some terminology that doesn't translate to a Windows machine in places; terminal instead of shell for example.

If you're not too familiar with Jupyter Notebooks, they are awesome, and [this quick start guide](http://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/) is well worth your time.  Knowing the basics will help you through this notebook and be a useful tool when you do your own projects!

### Step 1: Install the required packages
From your virtual environment, you can run the `requirements.txt` file in this repo, rather than searching out and installing each one separately.  

To do this, go to the folder you'll be working from in your terminal, move the `requirements.txt` file over to that folder, and run the command `pip install -r requirements.txt`.

Once that's installed, type `jupyter notebook twitter_timeline_to_csv`, and the notebook will open in the browser.

Running the cell below will import all of the needed packages into your script and make them ready for use.

In [1]:
import csv
from datetime import datetime
import os
import re
import requests
import twitter
import lxml.html

#### Step 2: Get ready to talk to the Twitter API
If you don't already have a Twitter account, you can create one [here](https://twitter.com/i/flow/signup).

If you do have a Twitter account, head over to the Twitter API documentation - to get the keys needed to access the API, you'll need to create an application.  Navigate to the [apps page](https://apps.twitter.com) while signed into your account and click on the `Create New App` button on the top right.
<img src="images/applications_page.png"/>

Once that's done, click on your app's name from the list and 

Next, you need to go get four things:
- Consumer key
- Consumer secret key
- Access token
- Access secret token

In [2]:
TWITTER_CONS_KEY = os.environ.get('T_CONS_')
TWITTER_CONS_SEC = os.environ.get('T_CONS_SECRET')
TWITTER_ACCESS_TOKEN = os.environ.get('T_ACCESS_')
TWITTER_ACCESS_SEC = os.environ.get('T_ACCESS_SECRET')

In [3]:
t = twitter.Api(
    consumer_key = TWITTER_CONS_KEY,
    consumer_secret = TWITTER_CONS_SEC,
    access_token_key = TWITTER_ACCESS_TOKEN, 
    access_token_secret = TWITTER_ACCESS_SEC,
    tweet_mode='extended'
)

Next, you'll need to create a variable for the account that you want to analyze.  Use the Twitter handle on the account, without the '@' (e.g. Use NASA, not @NASA).  If you're starting with the URL of the account, take only the handle after the last '/' (e.g. From https://twitter.com/NASA, use 'NASA'.

In [4]:
screen_name = "NASA"

In [5]:
# The Twitter api
first_200 = t.GetUserTimeline(screen_name=screen_name, count=200)

In [6]:
def get_tweets(first_200, screen_name, last_id):
    all_tweets = []
    all_tweets.extend(first_200)
    for i in range(900):
        new = t.GetUserTimeline(screen_name=screen_name, max_id=last_id-1)
        all_tweets.extend(new)
        if len(new) > 0:
            last_id = new[-1].id
        else:
            break
    
    return all_tweets

In [7]:
all_tweets = get_tweets(first_200, screen_name, first_200[-1].id)

In [8]:
print("We collected %d tweets." % len(all_tweets))
print("The most recent tweet in our collection was sent %s and the oldest tweet was sent %s." % (
                                                                            all_tweets[0].created_at, 
                                                                            all_tweets[-1].created_at)
     )

We collected 3238 tweets.
The most recent tweet in our collection was sent Sun Jul 01 01:25:01 +0000 2018 and the oldest tweet was sent Wed Oct 11 15:31:37 +0000 2017.


In [9]:
print("The 'created_at' parameter is in the form of a %s and looks like this: %s." % (
                                        type(all_tweets[0].created_at),
                                        all_tweets[0].created_at)
     )
print("The 'created_at' parameter is in the form of a %s and looks like this: %s." % (
                                        type(all_tweets[0].source), 
                                        all_tweets[0].source)
     )

The 'created_at' parameter is in the form of a <class 'str'> and looks like this: Sun Jul 01 01:25:01 +0000 2018.
The 'created_at' parameter is in the form of a <class 'str'> and looks like this: <a href="https://www.sprinklr.com" rel="nofollow">Sprinklr</a>.


In [43]:
def clean_hashtags(hashtags):
    cleaned = []
    if len(hashtags) >= 1:
        for i in range(len(hashtags)):
            cleaned.append(hashtags[i].text)        
    return cleaned

def clean_urls(urls):
    cleaned = []
    if len(urls) >= 1:
        for i in range(len(urls)):
            cleaned.append(urls[i].expanded_url)
    return(cleaned)
        

def clean_source(source):
    raw = lxml.html.document_fromstring(source)
    return raw.cssselect('body')[0].text_content()


def string_to_datetime(date_str):
    return datetime.strptime(date_str, '%a %b %d %H:%M:%S %z %Y')


# print('Our cleaned text; original was "<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>)":')
# print(clean_source('<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>'))
# print('***')
# print('Our date string converted to an object; original was "Sat Jun 30 11:37:03 +0000 2018":')
# print(string_to_datetime('Sat Jun 30 11:37:03 +0000 2018'))


In [40]:
def write_to_csv(tweets, filename):
    headers = ['id', 'full_text', 'hashtags', 'urls', 'created_at', 'favorite_count', 'retweet_count', 'source']
            
    with open(filename + '.csv', 'w', newline='') as csvfile:
        writer = csv.writer(csvfile, delimiter=',')
        writer.writerow(headers)
        
        for item in tweets:
            writer.writerow([item.id, 
                             item.full_text, 
                             clean_hashtags(item.hashtags), 
                             clean_urls(item.urls), 
                             item.created_at, 
                             item.favorite_count, 
                             item.retweet_count, 
                             clean_source(item.source)])
    csvfile.close()

In [44]:
write_to_csv(all_tweets, screen_name + '_tweets')

In [None]:
def create_dict(tweets):
    dict = {}
    for item in tweets:
        clean_source(item.source)
        dict[str(item.id)] = {
            'id':item.id,
            'full_text': item.full_text,
            'hashtags': item.hashtags,
            'urls': item.urls,
            'created_at': string_to_datetime(item.created_at),
            'favorite_count': item.favorite_count,
            'retweet_count' : item.retweet_count,
            'source': clean_source(item.source)
        }
    return dict

In [None]:
tweet_dict = create_dict(all_tweets)

In [None]:
tweet_dict['1013023608040513537']