<img style="float: left; vertical-align: middle; margin: 1em;" src="images/surf.png" >
<img style="float: right; height: 5em; vertical-align: middle; margin: 1em;" src="images/twitter.png">

<hr style="clear: both;" />

# Twi-XL TwiNL collection demo

Twi-XL contains a Twitter archive called TwiNL, which has been maintained by [SURF](https://www.surf.nl/en) since 2011. The Twi-XL API exposes functionality for searching tweets and counting word frequencies. This notebook provides a brief overview of this functionality through the [Twi-XL Python library](https://gitlab.com/twi-xl-surf-nl/twi-xl-python).

Check the [Architecture Overview](https://twi-xl-python.readthedocs.io/en/latest/architecture_overview.html) of the Twi-XL API.

Check the [ReadtheDocs](https://twi-xl-python.readthedocs.io/en/latest/api.html#main-interface) of the Twi-XL Python library.

## Installing the Python Twi-XL library
To install the Twi-XL Python library and the dependencies we will need in this notebook, run the following cell:

(**Warning: this might take a while**)

In [None]:
!pip3 install --quiet seaborn spacy tqdm tweepy
!pip3 install --quiet git+https://gitlab.com/twi-xl-surf-nl/twi-xl-python.git@5c3d4e9a38ccdf5826d2a832026be96a2f7a0e0d

## Importing the TwiNL package
The `twinl` package from the Python Twi-XL library provides all functionality for interfacing with the TwiNL archive. Run the following cell to import it:

In [None]:
from twixl.collections import twinl

## Configuring the API
The Twi-XL Python library needs some information to communicate with the Twi-XL API. We will set two environment variables, one containing the endpoint (URL) of the Twi-XL API, and another containing an API key that is used to authenticate with the API:

In [None]:
%env TWIXL_API_ENDPOINT=https://api.twi-xl.sda-projects.nl
%env TWIXL_API_KEY=<TWIXL API KEY>

## Archive metrics
We're all set! First, let's have a look at the number of tweets collected since the beginning. We'll use the `tweet_metrics()` function to retrieve the number of tweets for each day in the archive and convert them to a Pandas series:

In [None]:
tweet_metrics = twinl.tweet_metrics()
tweet_metrics.to_pandas()

We can plot the tweet metrics using the `plot_tweet_metrics()` function:

In [None]:
import seaborn as sns

import twixl.collections.twinl.plotting

# The lines below configure the environment for plotting, we only need to do this once

sns.set(rc={'figure.figsize': (14, 8)})
sns.set_context('notebook', font_scale=2)
sns.set_style('whitegrid')

twinl.plotting.plot_tweet_metrics(tweet_metrics);

**Note: some tweets in the archive are marked as being created before the inception of the TwiNL archive, some of them even from 1994! The archive contains raw data from Twitter and can contain these kinds of inconsistencies as a result.**

## Query example 1: OR query
In order to search for tweets we will first need to construct a query. We can do this with the `Query` class in the `twinl` module. As an example, consider the following query designed to find tweets containing the words 'elfstedentocht' **or** 'schaatsen':

In [None]:
query_example_1 = (
    twinl.Query()
        .or_(keywords=["elfstedentocht", "schaatsen"])
)

query_example_1.print()

The `or_()` method used above specifies that a query matches if a tweet contains any of the words in the list provided by the `keyword` parameter. The `print()` method prints the contents of the query that will be sent to the Twi-XL API endpoint.

## Query example 2: AND query
We can also write queries where all keywords must be present in a tweet. Consider the following example, where the words 'elfsteden' and 'tocht' must both appear in the tweet:

In [None]:
query_example_2 = (
    twinl.Query()
        .and_(keywords=["elfsteden", "tocht"])
)

query_example_2.print()

## Query example 3: AND + OR query
Query criteria such as AND and OR can be combined by chaining operators such as `and_()` and `or_()` on the `Query` object. When specifying a query with multiple criteria the query will match if **any** of the criteria matches. For example, consider the following query where we will match tweets if any of the following criteria apply:

1. The tweet contains the word 'elfstedentocht' OR 'schaatsen';
2. The tweet contains the word 'elfsteden' AND 'tocht'.

In [None]:
query_example_3 = (
    twinl.Query()
        .or_(keywords=["elfstedentocht", "schaatsen"])
        .and_(keywords=["elfsteden", "tocht"])
)

query_example_3.print()

## Query example 4: regular expressions
With regular expressions we can create more flexible queries. Consider the following example, where we search for any tweet containing words starting with 'elf' or 'schaats':

In [None]:
query_example_4 = (
    twinl.Query()
        .or_(keywords=["\belf\w+", "\bschaats\w+"], regex=True)
)

query_example_4.print()

## Query example 5: combining AND, OR and regular expressions
As a last example, consider this query that finds tweets matching any of the following criteria:

1. The tweet contains the words 'elfstedentocht' or 'schaatsen';
1. The tweet contains words starting with 'elf' or 'schaatsen';
1. The tweet contains the word 'elfsteden' and 'tocht'

In [None]:
query_example_5 = (
    twinl.Query()
        .or_(keywords=["elfstedentocht", "schaatsen"])
        .or_(keywords=["\belf\w+", "\bschaats\w+"], regex=True)
        .and_(keywords=["elfsteden", "tocht"])
)

query_example_5.print()

## Searching with the API
Apart from a search query (we will use our query 5) we will also need to provide the API with a time range, consisting of a start date and time and end date and time. Optionally, we can specify how the maximum number of tweets we want returned in our query.

The following cell runs our last query on the TwiNL archive between January 1 2017 and January 23 using the `search()` function. Because searching can take a while, we provide a so-called callback function to the `search()` function. This function will be called every few seconds with the current status of the query. The `twinl.print_callback` in the cell below is a default callback that simply prints the current query status.

In [None]:
from datetime import datetime

search_results = twinl.search(
    query=query_example_5,
    start_time=datetime(2017, 1, 1, 0, 0),
    end_time=datetime(2017, 1, 22, 23, 59, 59),
    max_results=100,
    callback=twinl.print_callback
)

## Pandas integration
Due to Twitter user policy restrictions the Twi-XL API returns only metadata of the tweets such as their IDs and timestamps. The search results can be converted to a Pandas data frame using the `to_pandas()` method:

In [None]:
search_results.to_pandas()

## Frequency plot
The `twinl.plotting` module contains some example functions to plot search results. In order to see how tweets are distributed over time we can plot the frequencies of the tweets using the `plot_tweet_frequencies()` function. 

In [None]:
twinl.plotting.plot_tweet_frequencies(search_results, title="Number of 'Elfstedentocht' tweets per day");

## Retrieving tweets from Twitter
To access the full tweets we will need to download the tweets (or statuses, as Twitter calls them) directly from Twitter. For this you will need a Twitter account and developer credentials. Developer credentials can be created in the [apps page](https://developer.twitter.com/en/apps) of the Twitter developer section. 
There are different levels of access to the Twitter api, for this demo we assume you have the 'Essential' access which works with the `Twitter API v2`.
To get access generate a bearer token through your app’s Keys and Tokens tab under the [Twitter Developer Portal Projects & Apps page](https://developer.twitter.com/en/portal/projects-and-apps).

We will use the popular [Tweepy](https://www.tweepy.org/) library to download a small subset of the tweets in our search results:

In [None]:
import tweepy

#Twitter API v2: Essential access level
client = tweepy.Client(<BEARER TOKEN>)

tweet_ids = [tweet_metadata.tweet_id for tweet_metadata in search_results.tweets[:25]]
statuses = client.get_tweets(tweet_ids)
for status in statuses[0]:
    print(status['text'])

## Retrieving tweets from Twitter API v1.1
Check the code below if you have 'Elevated' or 'Elevated+' and want to use the `Twitter API v1.1`
You will need a consumer API and API secret key, as well as an access token and access token secret.

In [None]:
#Twitter API v1.1: Elevated, Academic Research
#import tweepy

# def get_tweepy_api() -> tweepy.API:
#     auth = tweepy.OAuthHandler(
#         <CONSUMER API KEY>,
#         <CONSUMER API SECRET KEY>
#     )
#     auth.set_access_token(
#         <ACCESS TOKEN>
#         <ACCESS TOKEN SECRET>
#     )
#     return tweepy.API(auth)

# tweepy_api = get_tweepy_api()
# tweet_ids = [tweet_metadata.tweet_id for tweet_metadata in search_results.tweets[:25]]
# statuses = tweepy_api.lookup_statuses(tweet_ids)

Because some of the tweets might have been deleted or made otherwise inaccesible, not all tweets might be available for download. We asked for 25 tweets, and got fewer than 25 from Twitter:

In [None]:
len(statuses)

## Word frequencies
Apart from searching for tweets we can also lookup daily word frequencies using the `twinl.word_frequency()` function. For a word frequency search we have to specify a date and, optionally, the minimum word length (default 1), the number of words returned (default is all) and the minimum occurrence rate of words (default is 1).

In the following cell, we retrieve word frequencies for January 4 2017 with a minimum word length of 5, a minimum occurrence rate (`frequency_limit`) of 100, and 50,000 maximum words returned:

In [None]:
word_frequencies = twinl.word_frequency(
    date=datetime(2017, 1, 4),
    min_length_words=5,
    max_results=50000,
    frequency_limit=100,
    callback=twinl.print_callback
)

## Pandas integration
The word frequencies results consist of a list of words and their frequencies. The results can be converted to a Pandas data frame with the `to_pandas()` method:

In [None]:
word_frequencies.to_pandas()

## Word cloud
To visualize word frequencies we can create a word cloud. We will first filter the results with a list of known Dutch stopwords provided by the [Spacy](https://spacy.io/usage/spacy-101) natural language processing library, and plot a word cloud with the `plotting.plot_word_cloud()` function:

In [None]:
!python3 -m spacy download nl_core_news_sm  # this will download the Dutch language pipeline for Spacy

In [None]:
import spacy

nl = spacy.load("nl_core_news_sm")
stopwords = nl.Defaults.stop_words

twinl.plotting.plot_word_cloud(word_frequencies, stopwords=stopwords, max_words=100, min_word_length=4);

## Circular plot
To visualize the most-tweeted words per hour in a day we can use the `plotting.plot_circular_bars()` function:

In [None]:
twinl.plotting.plot_circular_bars(word_frequencies, stopwords=stopwords, group_size=2);