# Clustering my twitter likes

I entered Twitter because of the network of researchers and programmers that posts there. Because of that, I also have been doing 'favs' quite selectively, with an occasional like on some anime or cat or political tweet. As a "data scientist" (I almost choked saying that) I want to cluster the favorite tweets based on the text presented in these tweets - I reckon that it could be necessary to scrap the content that some of these tweets link to, but I am keeping it simple because I only faved what met the eye to be honest.

## Connecting to the twitter api

The first step is to go to the 'developers' section on twitter and create an app for personal use. This is to obtain the credentials to connect to the api - it's kind of like username/password but slightly more complicated. Instead of coding http requests from scratch, we are going to use an open source api wrapper, found on [bear/python-twitter](https://github.com/bear/python-twitter). After following the installation you can import the module as follows.

In [1]:
import twitter # has to be installed with 'pip install python-twitter'
import pickle
import sys # so as to know the error

My API credentials are not explicit here, as this is in the open web. Note that there is a running joke that you can find free api keys on github (for services that are paid!). This is a read-in from a `.gitignore`d file, but you can imagine that these api keys look like something your cat would type on the keyboard.

In [2]:
with open('creds.pkl', 'rb') as handle:
    creds = pickle.load(handle)

# connect to the twitter api with your twitter app credentials
api = twitter.Api(consumer_key = creds['consumer_key'],\
                  consumer_secret = creds['consumer_secret'],\
                  access_token_key = creds['access_token_key'],\
                  access_token_secret = creds['access_token_secret'])

And with this last call, we are connected to the api that wraps our http requests into functions :). I can then proceed to fetch some data. After sifting through some of the documentation of the api wrapper, I found what I wanted: the [`GetFavorites()`](https://python-twitter.readthedocs.io/en/latest/twitter.html#twitter.api.Api.GetFavorites) method that wraps the http request to the corresponding [endpoint](https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-favorites-list).

In [7]:
recent_favs = api.GetFavorites(screen_name='burnie093', count = 200) # this is a list

In [25]:
recent_favs[0].AsDict() # recent_favs[*] is a twitter.models.Status with a method AsDict()
recent_favs[0].AsDict()['text'] # you can now access the dict element text

'Steps to better code\n{ author: @isaacandsuch }\nhttps://t.co/lHL37fyC8e'

## Saving the texts to a file

Part of the drill is to save data to files to later read it back in. And note, we only have the most recent 200 tweets, and we want to get all of the ~900 favorited tweets. Also note this rule of thumb, if you're getting data > 100 MB, you should use a database engine (sql or nosql). Of course, for this text data, this would be ovekill! So saving it to files will suffice.

### Formats
Ok here are possible formats to save the file
- `json` ("JavaScript Object Notation"), this is a stringified version of the notation every js object uses and can be seen as correspondent to the python dict. Just a mention so that you know what this means when reading other documentation.
- plain text or csv, need I say more? It's a classic one and often used as well. But I want to leverage a python/c feature so I'll go with the next
- **`pickle`** this is a serialization mechanism to save a **python object** such as the list we currently have stored in the variable `texts`. Unlike the other two formats, this reads in the data as a python object, saving me from dealing with type conversion. [Here](https://stackoverflow.com/a/11218504)'s how to use it.

And for future reference for beginners, a file is just really a file which is not determined by its extension, only by its content. Extensions are just conventions among us computer people. If you want to give some wings to your evil genius, you could write a python program and save it with a java extension, e.g. `definitely_not_python.java`, and the fact is, the command `python definitely_not_python.java` (executing the python program) would work because the syntax would be parsed as python.

In [29]:
to_save = [recent_favs[i].AsDict() for i in range(len(recent_favs))] # I changed my mind
with open('data/favs_0.pkl', 'wb') as handle: # 'wb' stands for 'write binary'
    pickle.dump(to_save, handle)

To read it back in again

In [30]:
with open('data/favs_0.pkl', 'rb') as handle:
    hi = pickle.load(handle)
hi[0]['text']

'Steps to better code\n{ author: @isaacandsuch }\nhttps://t.co/lHL37fyC8e'

## Collecting all the faved tweets

So, there is a maximum count of the tweets you can retrieve from twitter: 200. So for each request, we can retrieve 200 tweets at a time. Let us collect the remaining ones, and for such we need to know the `max_id` to pass to the function. It should be at the last element...

In [51]:
maxi = hi[-1]['id']
maxi # 896645982854754304

896645982854754304

The following loops the store of all my faved tweets so far, and because the list is finite, it will throw an error when it can't retrieve any more tweets. And to Marcel: **you don't need to touch this, the data is stored already** ^\_^ The data is in files `favs_0.pkl` through `favs_4.pkl`.

In [4]:
maxi = 896645982854754304
for f in range(1,10): # will do this four more times, and will probably throw an error at some point
    try:
        favs = api.GetFavorites(screen_name='burnie093', count = 200, max_id = maxi)
        to_save = [favs[i].AsDict() for i in range(len(favs))]
        with open('favs_%d.pkl' % f, 'wb') as handle: # 'wb' stands for 'write binary'
            pickle.dump(to_save, handle)
        
        # prepare for the next iteration
        f += 1
        if maxi == to_save[-1]['id']:
            break
        else:
            maxi = to_save[-1]['id']
    except:
        print(sys.exc_info()[0])
        break

In [22]:
with open('favs_4.pkl', 'rb') as handle:
    some = pickle.load(handle)
len(some)

48

## Loading and extracting only text
By the way, let me drop this thing on more [nested list comprehension](https://stackoverflow.com/a/8050243), as probably the following could have been one line.

In [2]:
texts = []
for i in range(5): # i belongs to [0, 5[
    with open('favs_%d.pkl' % i, 'rb') as handle:
        texts.extend([t['text'] for t in pickle.load(handle)])
len(texts)

844

that corresponds to the same number of faved tweets on my twitter profile, except ahead of the website server by 2 faved tweets? ![](./data/faved.png)

## Finally! Oh wait, now we clean the data

Most of this data has really weird characters... For example:

In [42]:
texts[0]

'Steps to better code\n{ author: @isaacandsuch }\nhttps://t.co/lHL37fyC8e'

And actually, I think only the first part would be interesting. We could keep the author... to see if I follow more of the same author.. And by the way, big coincidence that this is from dev.to! The first step is to remove special characters such as `\n`. This is an escape character `\` followed by a letter to issue a command, `n`, which in this case corresponds to a new line. This is obviously programmed by the dudes and dudettes from dev.to to programatically compose the tweet in a nice manner, which you cannot see on the browser or app as `\n` is basically a rendering instruction.

About links... Since these can be found in the pickle files mapped to key `link` or similar, it does not seem interesting to keep them here for the sake of this one analysis. So we need a regex to remove `https://t.co*` until it finds a space or any other character. And my god do I fear regex.

For the sake of it, I will show some faved tweets.

In [3]:
texts[:5]

['Steps to better code\n{ author: @isaacandsuch }\nhttps://t.co/lHL37fyC8e',
 'I just published “Understanding your energy bill” https://t.co/Q2MutjiFjE',
 'Yes I am making sure adequate cat photo benchmarks are in our talk and am currently comparing the performance of different cats',
 'Coding = thinking in several dimensions\n{ author: @andreasklinger }\nhttps://t.co/3WHV4x6XEZ',
 "@pewdiepie Kimi no na wa used heavily japanese ideas in the story that was pretty vital to the plot. Hollywood's go… https://t.co/OPFp64DLHg"]

## Cleaning data
The following just follows this very nice [guide from Analytics Vidhya](https://www.analyticsvidhya.com/blog/2015/06/quick-guide-text-data-cleaning-python/).

In [5]:
# ESCAPING HTML CHARACTERS
import HTMLParser
texts = [HTMLParser.HTMLParser().unescape(t) for t in texts]
texts[:5]

ModuleNotFoundError: No module named 'HTMLParser'

In [6]:
# DECODING DATA
texts = [t.decode("utf8").encode("ascii", 'ignore')]
texts[:5]

NameError: name 't' is not defined