### Before Starting
This is a tutorial that teaches basic aspects of the use of python's `twitter` package to extract and analyse tweets. In the tutorial we focus on the use of emojis. We will first describe how to extract data from the twitter API, then describe detection of emojis and lastly perform some preliminary analysis. 

If you have not already, from terminal, run `pip install twitter` to download python's twitter package.

In [None]:
%matplotlib inline
from datascience import *
import numpy as np
import matplotlib.pyplot as plt
import twitter

Now, log into your twitter account, head to https://dev.twitter.com/apps, and create a new app. This is required to gain API access on twitter. The $4$ blurred fields are required for the package to gain access to Twitter API ie: to make calls to the API. Be sure not to share your access keys with others as they are unique to your twitter account.
<img src="./twitter_settings.png" style="width: 600px;"/>
In the next cell, fill in the values corresponding to the blurred fields from your app. We can then begin accessing the API and running analysis through your app.

In [None]:
CONSUMER_KEY = ''
CONSUMER_SECRET = ''
OAUTH_TOKEN = ''
OAUTH_TOKEN_SECRET = ''

auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
                           CONSUMER_KEY, CONSUMER_SECRET)

twitter_api = twitter.Twitter(auth=auth)

# Nothing to see by displaying twitter_api except that it's now a 
# defined variable

print(twitter_api)

If the above outcome looks something like `<twitter.api.Twitter object at 0x1165df908>`, that means we have created a connection without error.

### Pulling Tweets using search
Now we can parse the results of a search query on twitter. In the string query, feel free to play around with the value. For the purpose of the tutorial, we have used '#nobanonwall'. If you do try to experiment, note that this function is valid only on terms that would be valid through twitter.com/search.

Make sure not to put too large a value for count. For values ~$100000$, the query could take very long to run. Here, we have kept it at $100$ and the cell should run in less than a second.

Run the cell below to understand the format of the results.

The result is a JSON (JavaScript Object Notation) object. You can think of it as a dictionary in python. `search_results` is a dictionary containing `statuses`, which is a list of dictionaries, each one corresponding to a different tweet containing the search term '#nobanonwall'.

In [None]:
query = "#nobannowall"
count = 100

search_results = twitter_api.search.tweets(q=query, count=count)

statuses = search_results['statuses']

statuses[0]

Within each `status`, the keys correspond to data regarding the tweet. Let's try to extract some relevant data into a Table containing information regarding text, date and time posted, screen name of poster and number of retweets.

To do this, lets first write a function that takes in a tweet and returns an array consisting of this information corresponding to that tweet. Feel free to include more information if you'd like. Make sure you pick the correct path from the JSON object you viewed earlier.

In [None]:
def get_data_from_tweet(t):
    text = t['text']
    date_time = t['created_at']
    name = t['user']['screen_name']
    rt_count = t['retweet_count']
    return [name, date_time, text, rt_count]

# Test if it works on statuses[0]

get_data_from_tweet(statuses[0])

Now that we have a function that extracts the relevant data from $1$ tweet, let's apply this function on all the tweets we have and we will then have an array of rows for our table.

Using this array, we can define our Table called Tweets. The output should look something like:
<img src="./TweetsTable.png" style="width: 600px;"/>

In [None]:
tweets = np.array([get_data_from_tweet(status) for status in statuses])

Tweets = Table(['User', 'Time Posted', 'Text', 'Retweet Count']).with_rows(tweets)

Tweets

### Emoji Usage Analysis
Now, we will restrict our analysis to the Table Final_Tweets, which contains tweets posted on Jan 28th or 29th containing one or more of the following hashtags: #NoBanNoWall, #NoMuslimBan, #NotMyPresident, #TheResistance or #WomensMarch. Within these, duplicate names/urls were removed. These tweets were gathered using the method above. They were extracted, removed if duplicate and then saved into a `csv` file.

We've also imported the emoji dataset, which contains data regarding $842$ emojis, so as to recognize them in the tweets.

In [None]:
Final_Tweets = Table.read_table("./tutorial_tweets.csv")
Final_Tweets

In [None]:
emojis = Table.read_table("./complete_emoji.csv")
emojis = emojis.relabel('R-encoding', 'String Representation')

emojis

Next, we compute how many tweets each emoji appears in and then rank each emoji. Since this requires checking the presence of $842$ emojis in $57552$ tweets, it'll take a while to run. But, with the help of vectorization and use of parallel computing present in the `datasciences` package, we can speed up computation. It still takes about $2$ mins to run though.

In [None]:
emojis['count'] = np.sum(Final_Tweets.apply(lambda y : emojis.apply(lambda x : x in y, 'String Representation')\
                               , 'text'), axis = 0)

In [None]:
args = np.argsort(emojis['count'])

arr = np.zeros(len(args))

for i in np.arange(len(args), 0, -1):
    arr[args[i - 1]] = 843 - i

emojis['rank'] = arr

In [None]:
emojis.sort('count', descending = True)

### Visualising Emojis
The simplest and most elegant way to visualize categorical data is through a bar graph. Let us create a table `top10` that contains the $10$ most tweeted emojis in our dataset along with their `count` and `rank` values.

Visualize the tweet counts of these emojis using a bar graph.

In [None]:
top10 = emojis.where('rank', are.below_or_equal_to(10)).select(['Native', 'count', 'rank'])\
    .sort('count', descending = True)

In [None]:
top10.barh(0, 1, width=12, height=8)

### Advanced Visualisation
For those interested in a more advanced Visualisation, here we are going to compare emoji frequency between $2$ subsets of data.

First, we create a matrix of $0$s and $1$s, where `mat`$_{ij}$ represents the presence of the $j^{th}$ emoji in the $i^{th}$ tweet.

In [None]:
mat = 1*(Final_Tweets.apply(lambda y : emojis.apply(lambda x : x in y, 'String Representation')\
                               , 'text'))

Now, we define two different subsets of the data, then count the emojis in those subsets based on mat. For example, letâ€™s compare emoji usage between tweets mentioning `#womensmarch` and tweets mentioning `#theresistance`. First, we create the two subsets, count emojis in each subset, and create a combined dataset to facilitate comparisons.

In [None]:
# Which rows correspond to tweets with each hashtag?

womensmarch_rows = Final_Tweets.apply(lambda x : "#womensmarch" in x, 'text')
theresistance_rows = Final_Tweets.apply(lambda x : "#theresistance" in x, 'text')

# Matrix subsets for each hashtag

womensmarch_mat = mat[womensmarch_rows, :]
theresistance_mat = mat[theresistance_rows, :]

# Convert to sums of occurence of each emoji in each subset

emoji_womensmarch = np.apply_along_axis(np.sum, arr = womensmarch_mat, axis = 0)
emoji_theresistance = np.apply_along_axis(np.sum, arr = theresistance_mat, axis = 0)

# Add columns to our emoji dataset corresponding to these values

emojis['#theresistance Density'] = emoji_theresistance
emojis['#womensmarch Density'] = emoji_womensmarch

Now, we need to choose which tweets to consider for our analysis. In the following, we define `thresh` to be the minimum frequency of the tweet in both subsets and `thresh_for_each` is the minimum value for atleast $1$ of the frequencies. In our example we restrict our attention to the top `k` in each dataset, setting `k = 50`, `thresh = 1` and `thresh_for_each = 3`. Feel free to play around with these values.

In [None]:
k = 50
thresh = 1
thresh_for_each = 3

# Subset top k for each category

keep = np.zeros(842)
keep[np.argsort(emojis['#theresistance Density'])[-k:]] = 1
keep[np.argsort(emojis['#womensmarch Density'])[-k:]] = 1

# Subset minimum threshold

keep = np.logical_and(keep, np.logical_and(emojis['#theresistance Density'] >= thresh, \
                                          emojis['#womensmarch Density'] >= thresh))

# Subset the minimum for atleast one subset

keep = np.logical_and(keep, np.logical_or(emojis['#theresistance Density'] >= thresh_for_each, \
                                          emojis['#womensmarch Density'] >= thresh_for_each))

# Make the subset

dataset_to_analyse = emojis.take(np.arange(emojis.num_rows)[keep])

print(dataset_to_analyse.num_rows)

dataset_to_analyse

Before proceeding, make sure the number of rows in the final dataset isn't too small (don't want missing information) or too large (don't want irrelevant information). Somewhere between $5-10$ should be fine.

Lastly, we are going to plot the `log odds` ratio of emoji counts in tweets containing `#theresistance` vs those containing `#womensmarch` on the x-axis and the y-axis will contain the overall frequency per $1000$ tweets of each emoji.

In [None]:
logOdds = np.log(dataset_to_analyse['#theresistance Density']/dataset_to_analyse['#womensmarch Density'])
Overall_Frequency_per1000 = emojis['count'][keep]/1000

plt.plot(logOdds, Overall_Frequency_per1000, 'ro')

# Label each point with the emoji

labels = dataset_to_analyse['Native']

for label, x, y in zip(labels, logOdds, Overall_Frequency_per1000):
    plt.annotate(
        label,
        xy=(x, y), xytext=(-20, 20),
        textcoords='offset points', ha='right', va='bottom',
        bbox=dict(boxstyle='round,pad=0.5', fc='yellow', alpha=0.5),
        arrowprops=dict(arrowstyle = '->', connectionstyle='arc3,rad=0'),
        fontname='symbola',
        fontsize=20)

plt.show()

Note some of the interesting trends here. When the `log odds` ratio is close to $0$, intuitively what should the overall frequency of those emojis be compared to the others? Do emojis to the far left and far right represent emotions you would associate with `#womensmarch` tweets and `#theresistance` tweets respectively?