# Pre-processing tweets

Twitter is a popular micro blogging service where Twitter users communicate with short, 140-character status messages (called "tweets"), with which users can share links, pictures, and thoughts or opinions about different topics, journalists comment on live events, companies promote products and engage with customers and so on.

With about a half a billion tweets per day, there’s a lot of data to analyse and to play with.
Twitter data has been used in various analysis tasks, such as **sentiment analysis**, **social network analysis**, etc.
Topics that you will learn about in this chapter include:
* Collecting tweets via **API** requests, and storing them in **JSON** files
* Extracting **emoticons**
* Tokenizing **tweets** 
* Generating **word feature vectors**

## 1. Collecting Tweets with Python 
Many web services provide APIs to developers to interact with their services and to access data in a programmatic way.
Twitter is one of them.
It has taken great care to maintain a well-documented and elegantly simple API that is intuitive and easy to use. 

There are **two types** of APIs, i.e., [the Streaming API](https://dev.twitter.com/streaming/overview) and [the REST API](https://dev.twitter.com/rest/public/search), available for Twitter developers.

According to the documentation on the Twitter developers' website,
the former only sends out real-time tweets. 
It gives developers low latency access to Twitter’s global stream of Tweet data.

The later searches against a sample of recent Tweets published in the past 7 days. 
It is more suitable for singular searches, such as searching historic tweets, reading user profile information, or posting Tweets.
While the Streaming API is focused on matching for **completeness**, the REST API, like the Twitter search API, is focused on **relevance**. 
In this section, you will learn the basic usage of the Twitter search API.

There are [many great libraries available]((https://dev.twitter.com/overview/api/twitter-libraries) in different programming languages to further ease the work involved in making API requests. 
Using those libraries, you can easily make API requests without having to know too much
about the Twitter API details.
Here we will use [TwitterSearch](https://github.com/ckoepp/TwitterSearch), 
a Python library to easily iterate tweets found by the [Twitter search API](https://dev.twitter.com/rest/public/search),
to demonstrate how to make Twitter API requests and download data of our interest, 
we chose the TwitterSearch library, because it is **simple to use**, yet supports the Twitter search API well.
If you don't have TwitterSearch installed in your machine, go to its Github website and follow
the installation instructions.

To start with, you need to have a Twitter account and obtain credentials (i.e., consumer key, consumer secret, access token and access token secret) on the Twitter developer site to access the Twitter API.
Follow the steps below to get all the 4 credentials:
* Create a Twitter user account if you do not already have one.
* Go to https://apps.twitter.com/ and log in with your Twitter user account. This step gives you a Twitter developer account under the same name as your user account.
* Click “Create New App”, and then fill out the form (i.e., App name, App description, and so on), agree to the terms, and click “Create your Twitter application”.
Creating an application is the standard way for a developer to gain API access.
The process of creating an application is simple, and all that's needed is read-only access to the API.
* In the next page, click on “Keys and Access Tokens” tab on the top, 
    and copy your “consumer key” and “consumer secret”. 
* Scroll down if necessary and click “Create my access token”, and copy your “Access token” and “Access token secret”.
 
Below is example code to search Twitter for tweets that have keyword 'metrotrains',
and save all the retrieved tweets in list.
In order to run the code, you should substituting your own account credentials that you just got above in order to create the TwitterSearch object.
<font color="red">Note: if you don't have a Twitter API access and you don't want to apply for one, you can escape the following scripts and go to section 2.</font>

In [2]:
from TwitterSearch import *

In [3]:
tweets = []

try:
    # create a TwitterSearchOrder object
    tso = TwitterSearchOrder() 
    # let's define all words we would like to have a look for
    tso.set_keywords(['metrotrains']) 
    # or is English the default
    tso.set_language('en') 
    # and give us all those entity information
    tso.set_include_entities(False) 

    # create a TwitterSearch object with your own credentials
    ts = TwitterSearch(
        consumer_key = 'W6CzuJsRdPCrBSqYxqemSuOUd',
        consumer_secret = 'RG99f2M24zMJrCgZxAUf4D3xZmsRmC32fx1YqPVEKlgnV2QJEl',
        access_token = '756165158091657216-g9GirSewDxiPKPSAIVnGLjVmaMwWiac',
        access_token_secret = 'zSSikztIaLDyKj61sxrhP0X3pgBNrwL8A6JIiPUMveH7z'
     )

    for tweet in ts.search_tweets_iterable(tso):
        tweets.append(tweet) 
        
# take care of all those ugly errors if there are some        
except TwitterSearchException as e: 
    print(e)

If you run the program above, you will retrieve about 2,300 tweets that contain 'metrotrains'. 
Note that the number of tweets retrieved might vary from time to time, 
as the whole Twitter database is dynamic due to thousands of tweets being posted every second.
Each tweet is stored in a huge Python dictionary.
An example of what such a tweet looks like is the following dictionary, which
corresponds to the first tweet returned by the code above.

In [4]:
len(tweets)

2548

In [9]:
for k,v in tweets[0].items():
    if v!=None:
        print(k,':',v,'\n')
# try the following print out, what do you find?
# print tweets[0]

created_at : Sat Jun 02 00:47:11 +0000 2018 

id : 1002713203929804800 

id_str : 1002713203929804800 

text : RT @metrotrains: Frankston line: Buses replace trains FlindersSt - Moorabbin until 4:00pm Sat 2 June due to @levelcrossings works at Caulfi… 

truncated : False 

metadata : {'iso_language_code': 'en', 'result_type': 'recent'} 

source : <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a> 

user : {'id': 924085684712833024, 'id_str': '924085684712833024', 'name': 'Dank Trains Of Melbourne', 'screen_name': 'Thetramspotter', 'location': 'Melbourne, Victoria', 'description': '15 y.o. bisexual boy🏳️\u200d🌈, likes Trains and trams and makes cancerous memes', 'url': None, 'entities': {'description': {'urls': []}}, 'protected': False, 'followers_count': 93, 'friends_count': 599, 'listed_count': 0, 'created_at': 'Sat Oct 28 01:29:30 +0000 2017', 'favourites_count': 6440, 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'verified': False, 'statuses_count': 15

From the above example, you can see that the Twitter search API returns not only the content of tweets but also rich meta information, e.g., user information, when the tweet posted, the number of retweets and so on.
All the data is stored in **JSON format**.

It is clear that each tweet object contains far more data than the 140 characters of text that is normally thought of as a tweet! 
Click [here](http://www.slaw.ca/wp-content/uploads/2011/11/map-of-a-tweet-copy.pdf) to view the map made by Raffi Krikorian, which explains a tweet in JSON format.
This map is a good visualization of tweet’s JSON format even though it is a bit out-of-date. 
You can find the up-to-date information of tweet’s format [here](https://dev.twitter.com/overview/api/tweets).
We will discuss the tweet structure in a bit more detail later.
The following code uses Python library **json** or **simplejson** to dump all the tweets into a JSON file.

In [10]:
# Import the necessary package to process data in JSON format
try:
    import json
except ImportError:
    import simplejson as json

In [11]:
# dump all the tweets into a JSON file
jsonFile = open('tweetSamples_1.json', 'w')
for tweet in tweets:
    # Twitter Python Tool wraps the data returned by Twitter as a Dictionary object.
    # We first convert it back to the JSON format, and then save all tweets in a JSON file
    jsonFile.write(json.dumps(tweet)+"\n")
jsonFile.close()

Note that Twitter imposes rate limits on how many requests an application can make to any given API resource within a given time window. 
Twitter's rate limits are well documented (see [here](https://dev.twitter.com/rest/public/rate-limits)).
For the purpose of following along in this chapter, it is unlikely that you will hit the rate limits.
We have introduced the use of the Twitter search API. As we mentioned early in this section, there are other APIs that you can use, the Twitter stream API for example. 
If you would like to learn more on Twitter APIs, you can go to 
[Twitter's website for developers](https://dev.twitter.com/overview/documentation), 
or read some online tutorials such as [3].
- - -

## 2. Reading and Processing Tweets in JSON formats
Here we demonstrate how to read and process tweets in a bit more detail.
We will use `json` library to parse the dumped tweets (please refer to chapter 2 of Module 2 on the detailed discussion of JSON), and then show how to pre-process the text content, which includes handling emoticons, tokenizing tweets, and generating feature words.

### 2.1. Loading Tweets from a Dump File
Let's first load the tweets from a dump file,

In [1]:
# Import the necessary package to process data in JSON format
try:
    import json
except ImportError:
    import simplejson as json

In [12]:
import codecs
tweets = []
f = codecs.open("tweetSamples.json", "r", "utf-8")
for line in f:
    tweet = json.loads(line)
    tweets.append(tweet) 
f.close()


In the previous section, we have seen that each tweet object contains not only the text content but also 
other related information. 
Let's have a look at the structure of a tweet.
Recall that each tweet is stored in a Python dictionary.
Thus, to view all the attributes (or fields), simply type:

In [13]:
tweets[0].keys()

dict_keys(['contributors', 'truncated', 'text', 'is_quote_status', 'in_reply_to_status_id', 'id', 'favorite_count', 'retweeted', 'coordinates', 'source', 'in_reply_to_screen_name', 'in_reply_to_user_id', 'retweet_count', 'id_str', 'favorited', 'user', 'geo', 'in_reply_to_user_id_str', 'lang', 'created_at', 'in_reply_to_status_id_str', 'place', 'metadata'])

The key fields are the following:
* **text**: the text content of the tweet itself
* **created_at**: when the tweet was posted
* **favorite_count, retweet_count**: the number of favourites and retweets
* **favorited, retweeted**: Boolean stating whether the authenticated user has favourited or retweeted this tweet
* **lang**: the language used for the tweet (e.g. “en” for English)
* **id**: the unique tweet identifier
* **id_str**: The string representation of the tweet identifier
* **place, coordinates, geo**: geo-location information if available
* **user**: the full profile of the user, which contains user's id, name,  screen_name and so on.
* **entities**: list of entities like URLs, @-mentions, hashtags and symbols

As you can see there’s a lot of information we can use in analysing tweets.
You can imagine how the data stored in those fields already allows for some interesting analysis,
for example, you can check who is most favourited/retweeted, what are the most popular hashtags, etc. 
Assume we are going to extract data stored in the following fields
* id: `tweet['id']`
* created_at: `tweet['created_at']`
* text: `tweet['text']`
* user's id: `tweet['id']`
* user's name: `tweet['user']['name']`
* user's screen_name: `tweet['user']['screen_name']`

and store them in a Pandas DataFrame.

In [16]:
tweets[0]['user']['screen_name']

'luke_sabatini'

In [17]:
import pandas as pd
tweets_pddf = pd.DataFrame()
tweets_pddf['id'] = list(map(lambda tweet: tweet['id'], tweets))
tweets_pddf['user_id'] = list(map(lambda tweet: tweet['user']['id'], tweets))
tweets_pddf['user_name'] = list(map(lambda tweet: tweet['user']['name'], tweets))
tweets_pddf['user_sname'] = list(map(lambda tweet: tweet['user']['screen_name'], tweets))
tweets_pddf['created_at'] = list(map(lambda tweet: tweet['created_at'], tweets))
tweets_pddf['text'] = list(map(lambda tweet: tweet['text'], tweets))

In [18]:
tweets_pddf.head()

Unnamed: 0,id,user_id,user_name,user_sname,created_at,text
0,710240637497380864,534729059,Luke Sabatini,luke_sabatini,Wed Mar 16 23:05:38 +0000 2016,".@metrotrains, another day another diverted la..."
1,710240382311731200,215322466,Sina Marandian,myCroon,Wed Mar 16 23:04:37 +0000 2016,@metrotrains 12 mins delay YET AGAIN on Flinde...
2,710239778554257408,947522706,Brett Keleher,thebrickcleaner,Wed Mar 16 23:02:13 +0000 2016,@danielbowen @VLine @jimbob_prod @metrotrains ...
3,710238746688364545,215322466,Sina Marandian,myCroon,Wed Mar 16 22:58:07 +0000 2016,@metrotrains not sure if guys who run Melbourn...
4,710238698428706816,94544311,Ant,AntB77,Wed Mar 16 22:57:55 +0000 2016,@metrotrains and we're away 10 min late


Most of the material we’re looking for, i.e. the content of a tweet, is embedded in the text, and that’s where we’re starting our analysis.
Twitter only allows 140 characters of textual content for each tweet, which can roughly correspond to thoughts or ideas of Twitter users.
The 140 characters may include one or more entities and reference one or more places that map to locations in the real world. 
To make it a bit more concrete, let's have a look at the textual content of the first tweet:

In [19]:
print (tweets_pddf['text'][0])

.@metrotrains, another day another diverted late city loop train from Caulfield straight to Flinders! @9NewsMelb! #cheers


The tweet is 121 characters long and contains three tweet entities: the user mentions '@metrotrains' and '@9NewsMelb', and the hashtag '#cheers'. Besides user mentions and hashtags, most tweets also contain emoticons, for instance

In [20]:
print (tweets_pddf['text'][2162])

Less than 10kms from the city and it has taken @metrotrains an hour and 20 minutes to get me there. Hats off all 🎩👏🏻 keep up the good work


Unlike formally written English, e.g., newswire-like text, 
tweets do often not confirm to rules of spelling, grammar, and punctuation.
They often contain acronyms, typos, emoticons and other characters that express special meanings. 
Therefore, pre-processing tweets needs special treatment, compared with pre-processing formally written English text.
In the rest of this section, you will learn how to manipulate tweets into a form that can be digested by text analysis algorithms used in tasks, such as sentiment analysis.

### 2.2. Looking for Emoticons 
[Emoticons](https://en.wikipedia.org/wiki/Emoticon), such as 😂,😊,😡,😀, etc. are frequently used tweets and other kinds of online social iterations. They are designed to add emotional flavor to plain text messages, especially in short messages like tweets.
Because they are often direct signals of sentiment, emoticons in text have been widely used as features in 
sentiment analysis or as entries of sentiment lexicons.
Given a large amount of tweets posted every day, it would be interesting for businesses and researchers 
to understand the prevalence of emoticons on Twitter, how users express and perceive sentiment
through emoticons, and whether emoticons can be used as a reliable cue for identifying sentiment polarity.
In many of the existing sentiment analysis algorithms, emoticons played an important
role in both building sentiment lexicons and in training classifiers.
Discussing sentiment analysis iteself however goes beyond our scope.
Instead we will focus on identifying emoticons while pre-processing tweets.

Let's start with looking for a tweet that contains "Hats off all 🎩👏🏻":

In [21]:
tweets_text = list(map(lambda tweet: tweet['text'], tweets))

In [22]:
t =''
for text in tweets_text:
    if "Hats off all" in text:
        t = text
print (t)

Less than 10kms from the city and it has taken @metrotrains an hour and 20 minutes to get me there. Hats off all 🎩👏🏻 keep up the good work


What do the emotcons really look like? 

In [35]:
'👏'.encode('unicode_escape')

b'\\U0001f44f'

In [37]:
'👏'.encode('unicode_escape').decode("utf-8")

'\\U0001f44f'

In [10]:
t.encode('unicode_escape').decode("utf-8")

'Less than 10kms from the city and it has taken @metrotrains an hour and 20 minutes to get me there. Hats off all \\U0001f3a9\\U0001f44f\\U0001f3fb keep up the good work'

In [11]:
print(t.encode("raw_unicode_escape").decode("utf-8"))

Less than 10kms from the city and it has taken @metrotrains an hour and 20 minutes to get me there. Hats off all \U0001f3a9\U0001f44f\U0001f3fb keep up the good work


Emoticons are conventionally represented by punctuation marks, numbers and letters, such as :-), :D, :(, etc.
See Wikipedia entry on "[List of emoticons](https://en.wikipedia.org/wiki/List_of_emoticons)".
They have been introduced in Unicode since 2010.
As you can see above, all the emoticons are represented by Unicode strings, such as '\U0001f44f' corresponding to 👏.
[The standard emoticons](http://www.unicode.org/charts/PDF/U1F600.pdf ) covers Unicode range from 1F600 to 1F64F,
i.e., \u0001F600 to \u0001F64F. For example,

In [23]:
print(u'\U0001f600')
print(u'\U0001f601')
print(u'\U0001f60A')
print(u'\U0001f640')
print(u'\U0001f64f')
print(u'\U0001f3a9')

😀
😁
😊
🙀
🙏
🎩


In [24]:
print(u'\U0001f3a9')
print(u'\U0001f44f')
print(u'\U0001f3fb')

🎩
👏
🏻


A set of emoticons that can be used by Twitter users can be found [here](http://www.secret-emoticons.com/twitter-emoticons). It contains far more rich emoticons than the standard set. 
We have known how emoticons are represented in tweets.
Next, we are going to **extract emoticons** and save it in one column in our DataFrame `tweets_pddf`.

Instead of matching the unicode strings, we found `emoji`, a Python library that **supports the entire set of Emoji codes** as defined by [the Unicode consortium](http://www.unicode.org/emoji/charts/full-emoji-list.html).
This library can run with Python 2.7.

To install this package, type the following `pip` into your command window:
```
    pip install emoji --upgrade
```
Try to print some emoticons from its [Cheat Sheet](http://www.emoji-cheat-sheet.com/):

In [31]:
import emoji 

print(emoji.emojize('simle :smile:', use_aliases=True))
print(emoji.emojize('heart_eyes :heart_eyes:', use_aliases=True))
print(emoji.emojize('smiling_imp :smiling_imp:', use_aliases=True))
print(emoji.emojize('book :book:', use_aliases=True))

simle 😄
heart_eyes 😍
smiling_imp 😈
book 📖


To extract all the emoticons in a tweet, we are going to use the `emoji.get_emoji_regexp()` method
that returns a compiled regular expression that matches all the emoticons defined in `emoji`,
and then pass this regular expression to the `findall` method as follows:

In [32]:
import re

emoticon_regexp = emoji.get_emoji_regexp() # get the regular expressions for all emoticons
ems = re.findall(emoticon_regexp, t) # find all emoticons
for e in ems:
    print (e, ":", e.encode('unicode_escape').decode("utf-8"))

🎩 : \U0001f3a9
👏🏻 : \U0001f44f\U0001f3fb


Now wrap the first two lines of the code above in a Python function, and make it a callable function.

In [38]:
def findEmoticons(text):
    emoticon_regexp = emoji.get_emoji_regexp() # RegExp pattern
    emoticons = re.findall(emoticon_regexp, text) # re.findall(ptn, text)
    return emoticons

This function can be applied to each tweet and check if it contains any emoticons.
Let's find all tweets that contain one or more emoticons:

In [39]:
count = 0
for tweet in tweets_text:
    emoticons = findEmoticons(tweet)
    if len(emoticons) > 0:
        print (tweet)
        print (', '.join(emoticons))
        count = count + 1
print ("\n#tweets containing emoticon: ", count)

@metrotrains every night at like 1am?! 😂 scares me to death. Theres a brand new apartment block close by. Could they stop sounding the horn?
😂
@nudge87 @metrotrains it sounds like it 🤓
🤓
@nudge87 @metrotrains tell them what you really think of them, don't hold back 😉😉
😉, 😉
@teganvictoria @metrotrains oh perfect 😒😒
😒, 😒
@sassypastry @metrotrains day I left work early specifically to make it to an appointment on time 😤
😤
@metrotrains broken glass on the floor of the 9.58 south morang leaving from Parliament. Carriage no. is 124m 😊 https://t.co/4Peh3isFC6
😊
@metrotrains this is the train at noble park station at at 9:07 or so. Just letting you know the lights are out 😊 https://t.co/krTahmhE1l
😊
@metrotrains I think there's a better chance of hell freezing over than you running a service on time.... 👎🏻💩
👎🏻, 💩
Really @metrotrains? Really?🖕🏼
🖕🏼
@metrotrains thanks to the driver on the 7.31 Sandy to city service - great hump day commentary, especially liked the dad joke 👍
👍
Melbourne CBD 😆.. 

The total number of tweets containing at least one emoticon defined in `emoji` is 67. 
Given the total number of tweets that we have, it seems that there are not many tweets containing emoticons.
We might also need to consider the emoticons that are represented by punctuation marks, numbers and letters, such as :-), :D, :(, etc. We will leave it as an exercise for you to extract all those conventional emoticons.

**Counting observable things** is the start point for any kind of statistical analysis or manipulation that strives to find what may be a faint signal in raw data [1]. Whereas we just extracted all the emoticons in all the tweets loaded from the dump file, let's now take a closer look at **the frequency distribution** of those emoticons and print out **the most frequent emoticons**. 

In the previous chapter, you have learnt how to use the `FreqDist` class in NLTK. 
Here we are going to use the `collections` module in Python, which provides a `Counter` class that
can compute a frequency distribution of a given data. 
Indeed, the `FreqDist` class is implemented with the `Counter` class.
The code below demonstrates how to use a `Counter` object to compute frequency distribution as a ranked list of emoticons.  

Counting frequency distributions is **the simplest technique** used in analysing Twitter data.

In [42]:
import collections
em_list = []
for tweet in tweets_text:
     em_list += findEmoticons(tweet)
print(em_list)
em_counter = collections.Counter(em_list)
em_counter.most_common(20)

['😂', '🤓', '😉', '😉', '😒', '😒', '😤', '😊', '😊', '👎🏻', '💩', '🖕🏼', '👍', '😆', '😋', '😡', '😡', '😡', '👍', '😩', '😊', '😁', '👌🏽', '🚉', '🤔', '😡', '🆓', '😑', '🐕', '💩', '💩', '💩', '💩', '⚡', '⚡', '⚡', '⚡', '⚡', '⚡', '😡', '😡', '😎', '😊', '💪🏼', '🚂', '🚂', '😀', '😀', '☺', '😂', '😂', '😂', '😡', '😡', '😜', '😊', '😡', '😅', '😡', '😡', '👎', '👍', '😆', '👊🏻', '😊', '👎🏻', '👏', '😡', '😖', '🔥', '🚉', '💤', '👎🏻', '😡', '👍', '😊', '😎', '💦', '💦', '😡', '😡', '🎩', '👏🏻', '🚋', '👭', '👬', '👫', '🚋', '👫', '👬', '👭', '🚋', '😳', '😡', '☕', '👍', '👍', '👍', '👍', '👍', '😬']


[('😡', 16),
 ('👍', 9),
 ('😊', 7),
 ('⚡', 6),
 ('💩', 5),
 ('😂', 4),
 ('👎🏻', 3),
 ('🚋', 3),
 ('😉', 2),
 ('😒', 2),
 ('😆', 2),
 ('🚉', 2),
 ('😎', 2),
 ('🚂', 2),
 ('😀', 2),
 ('💦', 2),
 ('👭', 2),
 ('👬', 2),
 ('👫', 2),
 ('🤓', 1)]

As shown above, the frequency distribution is represented by a list of key/value pairs corresponding to emoticons in Unicode and their frequencies. 
Following [1], let's make reviewing the distribution a litte easier for eyeballing by tabulating those key/value pairs.
To emit a tabular or table format, you can install a Python package, called [`prettytable`](https://pypi.python.org/pypi/PrettyTable) by typing
```
    pip install prettytable
```
in your command window. 
It is a simple Python library designed to make it **quick and easy to represent tabular data** in visually appealing ASCII tables. 
The following code shows how to display the same result in a nicely formatted text-based table that 
is easy to skim by humans.

In [49]:
em_counter.most_common(5)

[('😡', 16), ('👍', 9), ('😊', 7), ('⚡', 6), ('💩', 5)]

In [50]:
from prettytable import PrettyTable
pt = PrettyTable(field_names=['Emoticons','Count'])
[pt.add_row(kv) for kv in em_counter.most_common(5)]
#pt.align['Emoticon'], pt.align['Count'] = 'l', 'r'
print (pt)

+-----------+-------+
| Emoticons | Count |
+-----------+-------+
|     😡     |   16  |
|     👍     |   9   |
|     😊     |   7   |
|     ⚡     |   6   |
|     💩     |   5   |
+-----------+-------+


A quick skim of the result could reveal that the most frequent emoticon is 😡, an angry face. 
Among those tweets containing at least one emoticon, there are more than 25 of them that have negative sentiment on metrotrains, where we assume 💩 indicates negative sentiment.
It is quite common in sentiment analysis to use emoticons as clues in determining the sentiment polarity of tweets. 
However, some research work [4] on the relationship between emoticons and sentiment polarity shows that 
a few emoticons are strong and reliable signals of sentiment polarity and a large group of the emoticons
coveys complicated sentiment where they should be treated carefully, 
see Figure 1 of [4], a survey of emotion expressed by emoticons.

Our last step is to save those emoticons in one column in the DataFrame, `tweets_pddf`, so that the downstream text analyser can directly make use of them. 

In [53]:
tweets_pddf.head()

Unnamed: 0,id,user_id,user_name,user_sname,created_at,text,emoticons
0,710240637497380864,534729059,Luke Sabatini,luke_sabatini,Wed Mar 16 23:05:38 +0000 2016,".@metrotrains, another day another diverted la...",
1,710240382311731200,215322466,Sina Marandian,myCroon,Wed Mar 16 23:04:37 +0000 2016,@metrotrains 12 mins delay YET AGAIN on Flinde...,
2,710239778554257408,947522706,Brett Keleher,thebrickcleaner,Wed Mar 16 23:02:13 +0000 2016,@danielbowen @VLine @jimbob_prod @metrotrains ...,
3,710238746688364545,215322466,Sina Marandian,myCroon,Wed Mar 16 22:58:07 +0000 2016,@metrotrains not sure if guys who run Melbourn...,
4,710238698428706816,94544311,Ant,AntB77,Wed Mar 16 22:57:55 +0000 2016,@metrotrains and we're away 10 min late,


In [55]:
import numpy as np
emoticon_list = []
for tweet in tweets_text:
    em = findEmoticons(tweet)
    if len(em) is 0:
        em = np.nan
    emoticon_list.append(em)
tweets_pddf['emoticons'] = emoticon_list
#view tweet record in the dataframe, which contain at least one emoticon.
tweets_pddf.dropna(subset=['emoticons'])

Unnamed: 0,id,user_id,user_name,user_sname,created_at,text,emoticons
114,710033743679922176,2954409301,Lee P,leelufc,Wed Mar 16 09:23:30 +0000 2016,@metrotrains every night at like 1am?! 😂 scare...,[😂]
147,709996461098967040,618374806,John Murray,SidebySide,Wed Mar 16 06:55:22 +0000 2016,@nudge87 @metrotrains it sounds like it 🤓,[🤓]
149,709995610074644480,618374806,John Murray,SidebySide,Wed Mar 16 06:51:59 +0000 2016,@nudge87 @metrotrains tell them what you reall...,"[😉, 😉]"
167,709976740320382977,128062445,~*~~luxury witch~~*~,sassypastry,Wed Mar 16 05:37:00 +0000 2016,@teganvictoria @metrotrains oh perfect 😒😒,"[😒, 😒]"
168,709976646053466113,29664096,Tegan Victoria,teganvictoria,Wed Mar 16 05:36:37 +0000 2016,@sassypastry @metrotrains day I left work earl...,[😤]
199,709884186081951744,27848004,Jessica Falzon,jesssicaef,Tue Mar 15 23:29:13 +0000 2016,@metrotrains broken glass on the floor of the ...,[😊]
217,709868580339187712,3225947413,Aaron Quick,Azzmang,Tue Mar 15 22:27:12 +0000 2016,@metrotrains this is the train at noble park s...,[😊]
220,709867747593703424,33170156,jason mcmahon,jasemcmahon,Tue Mar 15 22:23:54 +0000 2016,@metrotrains I think there's a better chance o...,"[👎🏻, 💩]"
234,709862278145806337,2208982531,Haze Soboh,HazeSoboh,Tue Mar 15 22:02:10 +0000 2016,Really @metrotrains? Really?🖕🏼,[🖕🏼]
279,709846711678599168,719612448,p13,pietsch13,Tue Mar 15 21:00:19 +0000 2016,@metrotrains thanks to the driver on the 7.31 ...,[👍]


### 2.2 Tokenizing Tweet Text
We have discussed the basic steps in pre-processing text in chapter 1. 
Can those steps be directly applied to Tweet tokenization?
Let's see some examples, using the popular NLTK library to tokenise the following tweet:
```
u'@HawthornFC Howz this?? \U0001f4aa\U0001f3fc #PTV #metrotrains #hawthornalways https://t.co/LECNCiTcN5' 
```

In [58]:
import nltk
tweet = u'@HawthornFC Howz this?? \U0001f4aa\U0001f3fc #PTV #metrotrains #hawthornalways https://t.co/LECNCiTcN5'
print(nltk.tokenize.word_tokenize(tweet))

['@', 'HawthornFC', 'Howz', 'this', '?', '?', '💪🏼', '#', 'PTV', '#', 'metrotrains', '#', 'hawthornalways', 'https', ':', '//t.co/LECNCiTcN5']


You will notice some peculiarities that are not captured by the NTLK built-in English tokenizer.  
For instance, @usernames, emoticons, hash tags and URLs **are not recognised as single tokens**.
"@HawthornFC" was split into two parts, i.e., "@" and "HawthornFC", both emoticons' Unicode strings are put together as one token, and the URL has been split into three parts.

Therefore, general-purpose English tokenizers are not applicable to tweets.
Furthermore, there is another NTLK built-in tokenizer, which is called `TweetTokenizer`.
Let's try it out,

In [59]:
from nltk.tokenize import TweetTokenizer
tt = TweetTokenizer()
tt.tokenize(tweet) 

['@HawthornFC',
 'Howz',
 'this',
 '?',
 '?',
 '💪',
 '🏼',
 '#PTV',
 '#metrotrains',
 '#hawthornalways',
 'https://t.co/LECNCiTcN5']

`TweetTokenizer(preserve_case=True, reduce_len=False, strip_handles=False)`

**Three arguments**:
* **preserve_case**: `True` by default. If False, then the tokenizer will downcase everything except for emoticons.
* **reduce_len**: `False` by default. The tokenizer will **replace repeated character sequences** of length >= 3, if it is set to `True`. For example, "waaaaayyyy" is going to be repleced with "waaayyy", and "cooooool" with "coool".
* **strip_handles**: `False` by default. All the `@usernames` will be **removed** if it is set to `True`

Is `TweetTokenizer` good enough?  
It works much better than the general-purpose English tokenizer. It tokenizes @usernames, hashtags and URLs **as single tokens**. However, Unicode strings for emoticons are still a problem for `TweetTokenizer`.
If we would like to preserve @usernames, emoticons, URLs and hash-tags as individual tokens,
Let's try the code discussed in Part 2 of [2] with some modification.

> flag `(?x)`:
  This flag allows you to write regular expressions that **look nicer and are more readable** by allowing you to **visually separate logical sections of the pattern** and **add comments**. Whitespace within the pattern is ignored. When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, **all characters from the leftmost such # through the end of the line are ignored**.

In [61]:
token_re = r'''(?x)
            (?:@[\w_]+) # matches @username
            |(?:\#+[\w_]+[\w\'_\-]*[\w_]+) # matches hash-tags
            |http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+ # matches URLs
            |(?:(?:\d+,?)+(?:\.?\d+)?) # matches numbers
            |(?:[a-z][a-z'\-_]+[a-z]) # matches words with hyphens and apostrophes 
            |(?:[\w_]+) # mathes other words
'''

def tokenize(s):
    tokens = re.findall(emoji.get_emoji_regexp(), s) # first find all the emoticons
    tokens += re.findall(token_re, s)
    return tokens
print(tweet)
tokenize(tweet)

@HawthornFC Howz this?? 💪🏼 #PTV #metrotrains #hawthornalways https://t.co/LECNCiTcN5


['💪🏼',
 '@HawthornFC',
 'Howz',
 'this',
 '#PTV',
 '#metrotrains',
 '#hawthornalways',
 'https://t.co/LECNCiTcN5']

Compared with the results given by the general purpose tokenizer and the tweet tokenizer in NLTK, the customised tokenizer gave much better tokenization of the tweet. 
Please do take a moment to observe those regular expressions to check if you can understand all of them.

Note that the tokenizer is probably far from perfect for handling tweets, as the language used in tweets is very informal. Twitter users often use arbitrary abbreviations, repeat letters in words, and so on. It is very challenging to develop a tokenizer that can perfectly tokenize tweets.

After tokenizing the tweets, we can use some procedures introduced in Chapter 2 to further customise
the list of tokens we are interested in by counting word frequencies, removing stopwords, generating bigram or even n-grams, etc.

### 2.3 Pre-processing Tweet for Sentiment Analysis
We have been mentioning sentiment analysis since the beginning of this chapter. 
Now let's have a look at a simple example of how to pre-process tweets for sentiment analysis by adapting the Python code in [5]. 

As in [5], we assume that:
1. all the words in tweets should be converted to lower case; 
2. all URLs and @username are eliminated by replacing them with "URL" and "at_USER" respectively; 
3. hastags are replaced with the exact name without the hash symbol; 
4. remove punctuation at the start and ending of the tweets.

The following Python function should implement all the above tasks.

`re.sub(pattern, repl, string, count=0, flags=0)`:  
Return the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in string by the replacement `repl`.  `repl` can be either a string or a callable;
1. If a string, backslash escapes in it are processed.  
2. If a callable, it's passed the match object and must return a replacement string to be used.

In [63]:
def processTweet(tweet): #pre-processing
    #Convert to lower case
    tweet = tweet.lower()
    #remove emoticons
    tweet = re.sub(emoji.get_emoji_regexp(),'',tweet)
    #Convert www.* or https?://* to URL
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','URL',tweet)
    #Convert @username to AT_USER
    tweet = re.sub('@[^\s]+','AT_USER',tweet)
    #Remove additional white spaces, single space only
    tweet = re.sub('[\s]+', ' ', tweet)
    #Replace #username with username
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)\
    #trim
    tweet = tweet.strip('\'"')
    return tweet

You might have noticed the regular expression used to replace "#username" with "username".
What does `"r'\1'"`** mean? 

It indicates a **backreference** in regular expressions.
`'\1'` means **replacing** it with the subtring matched by **the first group** in the pattern, i.e., '[^\s]+'.
Instead of using a backreference with a sequence number, you can use named groups, such as
```python
      tweet = re.sub(r'#(?P<name>[^\s]+)', r'\g<name>', tweet)
```
Let's try the function on the tweet we have been using so far as follows:

In [64]:
print (tweet)
processTweet(tweet)

@HawthornFC Howz this?? 💪🏼 #PTV #metrotrains #hawthornalways https://t.co/LECNCiTcN5


'AT_USER howz this?? ptv metrotrains hawthornalways URL'

In order to generate a good word vector for a sentiment analysis algorithm,
we need to **filter the tweet words that are not of interest**.
These words may include **stop words**, words **with repeated letters**,
words **not starting with an alphabet**, and so on.
Let's start with handling repeating letters.
If we set `'reduce_len'` to True, `TweetTokenizer` can automatically reduce the number of time a letter repeats in a single token. 

Here we show you how to do it with regular expressions.

`re.compile(pattern, flags=0)`

In [69]:
#look for 2 or more repetitions of character and replace with the character itself
def replaceTwoOrMore(s):
    pattern = re.compile(r"(.)\1{1,}", re.DOTALL)
    return pattern.sub(r"\1\1", s)

The pattern matches substrings that contain the same letter be repeated at least two times.
The `re.DOTALL` flag tells python to make the ‘.’ special character match all characters, including newline characters.

In [70]:
print (replaceTwoOrMore('cooooool'))
print (replaceTwoOrMore('oooooooooops'))
print (replaceTwoOrMore('gooooood'))
print (replaceTwoOrMore('coooooold'))

cool
oops
good
coold


As you can see, the `replaceTwoOrMore` function is **not perfect**. The words recovered by this function do not always have correct lexical forms, such as "coold".

Mapping **"ill-formed" out-of-vocabulary words** to their standard lexical forms is known as **lexical normalization**, which is a very challenging research problem in natural language processing.

It has similarities with **spell checking**, but differs in that ill-formedness in tweets for example is often **intentional due to the 140-characters limit**. 

If you would like to know more about lexical normalization, you should read the research paper on "[Lexical Normalisation of Short Text Messages: Makn Sens a #twitter](http://www.aclweb.org/anthology/P11-1038)" by Bo Han and Timothy Baldwin.

Let's now move to removing stopwords from tweets. We will use the same stopword list used in the previous two chapters, and load and store all the stopwords in a Python set object.

In [71]:
stopwords = []
with open('./stopwords_en.txt') as f:
    stopwords = f.read().splitlines()
stopwords = set(stopwords)

We further remove punctuation such as comma, single/double quote, question marks at the start and end of each word. For example, "this??" will be replaced with "this", and remove words starting with non-alphabets, e.g., "124m" and "7.07am".
Put all the code together, we derive the following function for extracting feature words from a tweet:

In [76]:
def getFeatureVector(tweet):
    featureVector = []
    #split tweet into a list of words
    words = tweet.split()
    for w in words:
        #replace two or more with two occurrences
        w = replaceTwoOrMore(w)
        #strip punctuation
        w = w.strip('\'"?,.')
        #check if the word starts with an alphabet
        val = re.search(r"^[a-zA-Z][a-zA-Z0-9]*$", w)
        #ignore if it is a stop word or a word starts with non-alphabetic
        if(w in stopwords or val is None):
            #ignore, go for the next
            continue
        else:
            #store it as featureVector
            featureVector.append(w.lower())
    return list(set(featureVector))

Let's look at the feature words extracted for the tweets.

In [77]:
for t in tweets_text:
    print (t)
    print (getFeatureVector(processTweet(t)), "\n")

.@metrotrains, another day another diverted late city loop train from Caulfield straight to Flinders! @9NewsMelb! #cheers
['caulfield', 'late', 'train', 'straight', 'cheers', 'city', 'diverted', 'loop', 'day'] 

@metrotrains 12 mins delay YET AGAIN on Flinders service in Belgrave/Lilydale line which initially was supposed to be 5 mins.
['flinders', 'service', 'initially', 'line', 'mins', 'supposed', 'delay'] 

@danielbowen @VLine @jimbob_prod @metrotrains LMAO but a skyrail will make the trains run more efficiently! Oh deary me!
['lmao', 'run', 'deary', 'trains', 'skyrail', 'make'] 

@metrotrains not sure if guys who run Melbourne train services have "shame" in their dictionary. Sick of apologizing 2 my boss 4 being late.
['train', 'run', 'late', 'shame', 'apologizing', 'sick', 'dictionary', 'services', 'melbourne', 'boss', 'guys'] 

@metrotrains and we're away 10 min late
['late', 'min'] 

Amateur hr by @metrotrains  Flinders st, 9:49 Southern cross train on Platform 1 has no driver.


@metrotrains @3AW693 @ptua suggestions on dealing with obnoxious punks playing load music on busy train?
['suggestions', 'train', 'dealing', 'punks', 'obnoxious', 'load', 'playing', 'busy', 'music'] 

RT @theheraldsun: Victoria to build new high-capacity trains that will create 800 jobs https://t.co/XyqVsdSjke #metrotrains #Melbourne http…
['rt', 'url', 'victoria', 'metrotrains', 'create', 'trains', 'build', 'melbourne', 'jobs'] 

This @metrotrains driver refused to load wheelchair as there was 'no room' even when passengers were making room. https://t.co/wPOlQruyXh
['making', 'passengers', 'url', 'load', 'driver', 'wheelchair', 'refused', 'room'] 

@metrotrains last carriage of Pakenham train just arriving at flagstaff now - offensive music playing loud
['flagstaff', 'train', 'loud', 'playing', 'pakenham', 'arriving', 'carriage', 'music', 'offensive'] 

I already have a gf @metrotrains and that is all the validation i need. Xx
['xx', 'validation', 'gf'] 

@metrotrains cranbourne train

['broken', 'months', 'forms', 'warm', 'winter', 'character', 'heater', 'metrotrains', 'relish', 'brings'] 

@metrotrains only as we were about to leave.
['leave'] 

@metrotrains and... the next train is 5min earlier and left the platform at 16:23 at pascoe vale. what's the point of having a timetable.
['train', 'earlier', 'point', 'pascoe', 'timetable', 'left', 'platform', 'vale'] 

@metrotrains good job metrotrains, city bound craigieburn line was 2min early and left the platform at 4:06pm. It was so on time I missed it
['job', 'craigieburn', 'early', 'line', 'city', 'good', 'missed', 'left', 'platform', 'metrotrains', 'time', 'bound'] 

@metrotrains My train has been stuck at Blackburn station for 20 minutes now. Is there a reason why ?
['reason', 'train', 'blackburn', 'minutes', 'station', 'stuck'] 

@togiatomaisally @metrotrains the individual timetable for your bus route  by visiting https://t.co/hKMBpm3Qpv
['visiting', 'route', 'url', 'timetable', 'bus', 'individual'] 

@togiatom

@SLibba There is also express services to Flemington after 4:40pm.
['services', 'express', 'flemington'] 

RT @FlemingtonVRC: Extra train services are running to #Flemington today for #SuperSaturday.
https://t.co/kLzdUywunX https://t.co/kg2x1nmmxs
['train', 'rt', 'supersaturday', 'url', 'services', 'extra', 'today', 'running', 'flemington'] 

@metrotrains @FlemingtonVRC what time is the last service there, 1.04 seems very early
['time', 'service', 'early'] 

RT @FlemingtonVRC: Extra train services are running to #Flemington today for #SuperSaturday.
https://t.co/kLzdUywunX https://t.co/kg2x1nmmxs
['train', 'rt', 'supersaturday', 'url', 'services', 'extra', 'today', 'running', 'flemington'] 

Extra train services are running to #Flemington today for #SuperSaturday.
https://t.co/kLzdUywunX https://t.co/kg2x1nmmxs
['train', 'supersaturday', 'url', 'services', 'extra', 'today', 'running', 'flemington'] 

Next stage (@ Southern Cross Station - @metrotrains in Melbourne, VIC) https://t.co/NW

['dear', 'great', 'pakenham', 'driver', 'wanttobehome', 'stops', 'make', 'hurry'] 

Massive shout out to the flinders st station staff for chasing me from the entrance to the platform to return my dropped myki @metrotrains
['chasing', 'dropped', 'flinders', 'myki', 'massive', 'entrance', 'station', 'platform', 'return', 'st', 'shout', 'staff'] 

@metrotrains cranbourne train, carriage 526M, abusive guy smoking cigarettes and swearing at fellow commuters. Drinking too...
['abusive', 'train', 'smoking', 'swearing', 'fellow', 'cigarettes', 'guy', 'carriage', 'cranbourne', 'commuters', 'drinking'] 

Another delay in Frankston line #metrotrains what are u doing about it to fix it
['fix', 'line', 'metrotrains', 'frankston', 'delay'] 

Hi @metrotrains why did my 5:59 frankston from southern cross dissappear off the board?
['southern', 'board', 'dissappear', 'cross', 'frankston'] 

Gotta say #1 and #8 tram and the Upfield line have been unacceptable of late. @yarratrams @metrotrains canceled o

@metrotrains thanks! I'm on way to Parliament &amp; last time this happened I was issued with a ticket, hope inspectors aren't there today.
['hope', 'inspectors', 'happened', 'issued', 'parliament', 'time', 'today', 'ticket'] 

Not sure why, but no trains to city at Platform 1 in Footscray...platform 5 is completely rammed @metrotrains
['completely', 'city', 'rammed', 'platform', 'trains'] 

@metrotrains thanks, fun times
['times', 'fun'] 

@metrotrains so what's your excuse for not running 8.20 upfield to city? Inclement weather lol?
['excuse', 'weather', 'lol', 'city', 'inclement', 'running', 'upfield'] 

Pretty sure Fury Road was based on a woman's experience trying to get into the city via a @metrotrains service.
['pretty', 'experience', 'fury', 'service', 'city', 'based', 'road'] 

@metrotrains No. That is my point. The first notification from the app was an hour after the trains stopped.
['point', 'app', 'notification', 'hour', 'stopped', 'trains'] 

@NeilTDownUnder Yes, trains m

Free coffee's today compliments of @metrotrains Now that's a nice gesture ! Totally appreciated and your really trying THANKS 😆
['free', 'nice', 'appreciated', 'compliments', 'gesture', 'totally', 'today'] 

@metrotrains it's every carriage on the Craigieburn line every morning until we enter the loop. The air comes on as we enter :(
['morning', 'craigieburn', 'air', 'line', 'loop', 'enter', 'carriage'] 

@kixystix Hi Kim, sorry for the discomfort, can you give us a carriage number? so we can get it checked.
['checked', 'number', 'give', 'discomfort', 'kim', 'carriage'] 

You can grab a free commuter cuppa today courtesy of @metrotrains this morning at Aspendale Station #nice #metronotsobad #free
['morning', 'nice', 'today', 'courtesy', 'cuppa', 'station', 'commuter', 'metronotsobad', 'aspendale', 'grab', 'free'] 

Another stuffy train with zero air con. Thanks @metrotrains :( I love being a sweaty mess on the train from lack of air movement
['lack', 'train', 'stuffy', 'air', 'love', '

['lift', 'bloody', 'spring', 'booking', 'st', 'hall', 'slow'] 

@eirene_b Sorry for delays, after staff making announcements via the P.A. for latest info?
['delays', 'info', 'making', 'latest', 'announcements', 'staff'] 

@metrotrains I have the app-It doesn't work. No-it's not my phone settings. We have to rely on the tweets of other passengers. #MetroTrains
['phone', 'tweets', 'rely', 'work', 'passengers', 'settings', 'metrotrains'] 

@metrotrains guys is the Frankston line expected to return to normal any time soon?
['normal', 'expected', 'line', 'time', 'return', 'frankston', 'guys'] 

@SmurfingBeer This is about getting the best first aid assistance to the ill customer.
['aid', 'ill', 'assistance', 'customer'] 

@metrotrains so I guess announcing that a train will run express while people still have a chance to get off is an outdated courtesy now?
['guess', 'announcing', 'train', 'run', 'outdated', 'chance', 'express', 'courtesy', 'people'] 

@metrotrains So I'm paying for a servi

['tweet', 'run', 'service', 'correct', 'give', 'people', 'bullshit'] 

Driver of train between Patterson and Moorabbin: We're going to go, but may stop, so hold on #Metrotrains #Frankston
['stop', 'train', 'frankston', 'driver', 'metrotrains', 'hold', 'patterson'] 

RT @pamnanijatin: 40 mins delay going into the city, 40 mins delay on the way back home. Got to admire the consistency of @metrotrains #fra…
['rt', 'city', 'mins', 'back', 'home', 'admire', 'consistency', 'delay'] 

@metrotrains this is the worst commute ever. It's bad enough that we have to live a thousand miles away #frankstonlinewoes
['frankstonlinewoes', 'bad', 'commute', 'live', 'worst', 'miles', 'thousand'] 

@metrotrains 45min late for work due to signalling issues. Now at least 45min late home due to signalling issues. I want my life back...
['signalling', 'late', 'work', 'home', 'life', 'back', 'issues', 'due'] 

@metrotrains stuck in a stationery Frankston train for the past 20 mins nowhere near any platform..woul

['twitter', 'signed'] 

@metrotrains ahhh ok, but drivers should really keep passengers informed don't you think. Good customer service and all that
['drivers', 'passengers', 'service', 'good', 'customer', 'ahh', 'informed'] 

@sashaja97928039 Hi Sasha. There is a track equipment fault at Caulfiled causing delays.
['delays', 'track', 'equipment', 'caulfiled', 'causing', 'fault', 'sasha'] 

@metrotrains why is the 5:13pm Mordiallioc from Southern Cross travelling so slow? just left Caulfield
['caulfield', 'travelling', 'southern', 'cross', 'left', 'mordiallioc', 'slow'] 

@kerriemussert Hi Kerrie. There is a track equipment fault at Caulfield causing delays.
['caulfield', 'delays', 'track', 'equipment', 'kerrie', 'causing', 'fault'] 

@metrotrains  another shite evening service on the Frankston line. Shite service this morning too. At least you're consistent. Disgraceful.
['morning', 'evening', 'disgraceful', 'service', 'line', 'consistent', 'frankston', 'shite'] 

@metrotrains why so s

@metrotrains no driver announcement at all regarding the delay or skipping stations, had to buzz the driver to find out what was going on
['stations', 'buzz', 'announcement', 'driver', 'skipping', 'find', 'delay'] 

@Skaf41 Can you give us the station name so that we can check it?
['station', 'check', 'give'] 

maybe @metrotrains  should have a rule that people can't sleep across multiple seats on peek hour trains
['seats', 'sleep', 'rule', 'multiple', 'hour', 'people', 'trains', 'peek'] 

@metrotrains seems a common problem at Mentone. Waited five minutes in the end. Boom gates down. Train crawling through.
['end', 'mentone', 'problem', 'train', 'minutes', 'common', 'gates', 'waited', 'crawling', 'boom'] 

@dannipenguin Did the driver make any announcement to this effect?
['effect', 'announcement', 'driver', 'make'] 

@Tobymckinnon Thanks Toby, we'll pass on the message for necessary rectification.
['message', 'pass', 'toby', 'rectification'] 

@metrotrains Can we get #findskafsshit t

['delays', 'causing', 'line', 'issues', 'trains', 'signal', 'werribee'] 

#Melbourne! Where you must allow an extra hour b4 work, coz #MetroTrains WANTS you to be late. #EvilMotives #YesComplainingOnTwitterAgain
['evilmotives', 'late', 'b4', 'work', 'coz', 'yescomplainingontwitteragain', 'hour', 'metrotrains', 'extra'] 

@NotMetroNotify @metrotrains don't you dare cancel trains
['trains', 'cancel', 'dare'] 

@metrotrains another cancelled train on the Pakenham line what a tremendous service you provide #patheticmetrotrains
['tremendous', 'train', 'service', 'line', 'provide', 'patheticmetrotrains', 'pakenham', 'cancelled'] 

@metrotrains why is the air con never on? Just b/c there are clouds? Squashing ppl on a train without air con is cruel! Turn it on!!!!
['train', 'air', 'clouds', 'ppl', 'squashing', 'turn', 'con'] 

RT @Camwhite51: @metrotrains how can this keep happening?you closed the line down for weeks this year and a couple years back for maintenan…
['rt', 'couple', 'year', 'l

Similar to counting the frequencies of emoticons, we can also count the frequencies of words in the pre-processed tweets:

In [79]:
f_list = []
for t in tweets_text:
    f_list += getFeatureVector(processTweet(t))

f_counter = collections.Counter(f_list)
pt = PrettyTable(field_names=['Words','Count'])
[pt.add_row(kv) for kv in f_counter.most_common()[:10]] #kv is a tuple: (key, value) 

pt.align['Words'], pt.align['Count'] = 'l', 'r'
print (pt)

+-------------+-------+
| Words       | Count |
+-------------+-------+
| url         |   486 |
| train       |   479 |
| rt          |   330 |
| line        |   300 |
| metrotrains |   286 |
| trains      |   238 |
| frankston   |   192 |
| station     |   160 |
| service     |   150 |
| good        |   135 |
+-------------+-------+


It is not surprising that "train" was the most frequent word, as all the tweets are about metrotrains.
"rt" was a very common token, which implies that there were a number of retweets. 
Finally, let's save the pre-processed tweets as one column in the dataframe.

In [81]:
feature_list = []
for tweet in tweets_text:
    preprocessed = processTweet(tweet)
    FeatureVector = getFeatureVector(preprocessed)
    feature_list.append(FeatureVector)

tweets_pddf['feature_words'] = feature_list

In [82]:
tweets_pddf.head(50)

Unnamed: 0,id,user_id,user_name,user_sname,created_at,text,emoticons,feature_words
0,710240637497380864,534729059,Luke Sabatini,luke_sabatini,Wed Mar 16 23:05:38 +0000 2016,".@metrotrains, another day another diverted la...",,"[caulfield, late, train, straight, cheers, cit..."
1,710240382311731200,215322466,Sina Marandian,myCroon,Wed Mar 16 23:04:37 +0000 2016,@metrotrains 12 mins delay YET AGAIN on Flinde...,,"[flinders, service, initially, line, mins, sup..."
2,710239778554257408,947522706,Brett Keleher,thebrickcleaner,Wed Mar 16 23:02:13 +0000 2016,@danielbowen @VLine @jimbob_prod @metrotrains ...,,"[lmao, run, deary, trains, skyrail, make]"
3,710238746688364545,215322466,Sina Marandian,myCroon,Wed Mar 16 22:58:07 +0000 2016,@metrotrains not sure if guys who run Melbourn...,,"[train, run, late, shame, apologizing, sick, d..."
4,710238698428706816,94544311,Ant,AntB77,Wed Mar 16 22:57:55 +0000 2016,@metrotrains and we're away 10 min late,,"[late, min]"
5,710237588318085120,94544311,Ant,AntB77,Wed Mar 16 22:53:31 +0000 2016,"Amateur hr by @metrotrains Flinders st, 9:49 ...",,"[train, southern, hr, flinders, cross, driver,..."
6,710234915384590336,310818887,Brad Cook,PkmtBrad,Wed Mar 16 22:42:54 +0000 2016,@danielbowen @jimbob_prod @metrotrains @VLine ...,,"[ago, happened, good, decade, narre, time, rec..."
7,710233767881748480,182766018,Mark Stilve,stilves,Wed Mar 16 22:38:20 +0000 2016,@metrotrains hope our train driver's day gets ...,,"[hope, notcool, clear, train, angryman, spray,..."
8,710232721625206784,387479230,Emily,emtoone,Wed Mar 16 22:34:10 +0000 2016,@metrotrains can you ever have trains running ...,,"[trains, time, running]"
9,710229717870186496,93570145,Fake Metro Trains,fakemetrotrains,Wed Mar 16 22:22:14 +0000 2016,@esayche Thanks for your feedback! Just do wha...,,"[metrotrains, vandalise, train]"


## 3. Conclusion
We started this chapter by learning how to create an authenticated connection and then progressed through a series example code that illustrated how to pre-process tweets and make them ready for analysis.
Beside, there are a couple of good tutorials on handling tweets, which are useful for review. They
are listed in the following sections.

## 4. Referece reading materials
1. "[Mining Twetter: Exploring Trending Topics, Discovering What People Are Talking About, and More](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/blob/master/ipynb/Chapter%201%20-%20Mining%20Twitter.ipynb)" in "Mining the Social Web". 📖 
2. "[Mining Twitter Data with Python](http://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/)", a tutorial by Marco Bonzanini. 📖 
3. "[Twitter API tutorial](http://socialmedia-class.org/twittertutorial.html)" by Wei Xu, which give a quick tutorial
on making API requests through two types of Twitter APIs 📖 
4. "[Sentiment Expression via Emoticons on Social Media](http://arxiv.org/pdf/1511.02556.pdf)" bu Hao Wang and Jorge A. Castanon (Read this paper is optional)
5. "[How to Build a Twitter Sentiment Analyzer](http://ravikiranj.net/posts/2012/code/how-build-twitter-sentiment-analyzer/)" by Ravikiran Janardhana. Read two subsections, which are "Preprocess tweets" and "Filtering tweet words" 📖 .

## 5. Exercises
1. Extract all the conventional emoticons that are represented by punctuation marks, numbers and letters. Hint: you might need to construct your own regular expressions that can handle the following emoticons: :), :-), :(, :O, :-(, and so on.