# **Evaluating Content**

Although the search processed used by the team to retrieve the Tweet ID list was rigurous, some materials in the list may be unrelated. It is possible some posts match a hashtag but discuss subjects unrelated to the historical events of #RickyRenuncia movement from summer 2019.



## Objectives

This notebook presents a minimal IPython graphical user interface (GUI) where participants, researchers and members of the original team could interact with content and classify it.

### Learning Goals
- Interact with twitter embedings.
- Update SQLite3 database.
- Classify content (minimum of 20 tweets).
- Visualize state of the database.

## Requirements

**Tweeter API Credentials**

The user will need to have created the `twitter_secrets.py` file based on `twitter_secrets_example.py` and set the variables to his API specifications. See [Twitter API Credentials](./Developer_Registration.ipynb) section.

## Optional Requirements

**OPTIONAL**

**Google API Credentials**

A `google.oauth2.service_account.Credentials` object is required to interact with the google translate API to automatically see translations of text. This should help non-Spanish speakers interact with content in Spanish.

The user will need to have created/edited the `google_translate_keys.json` following the [Google API Credentials](./Developer_Registration.ipynb#Google-API-Credentials) section. This is **optional**, but will offer the user automatic translation of tweet text content to english (or other language).

## Import Libraries

Add Library justifications

In [2]:
import ipywidgets as widgets
from IPython.core.display import display, HTML, update_display
import json, os, pickle
from random import seed, randint
# from tweet_requester.analysis import TweetJLAnalyzer, TweetAnalyzer
from tweet_requester.display import TweetInteractiveClassifier, \
JsonLInteractiveClassifier, TSess, prepare_google_credentials, PROCESSING_STAGES
# from twitter_secrets import C_BEARER_TOKEN 
from twitter_secrets_testing import C_BEARER_TOKEN 
JL_DATA="./tweetsRickyRenuncia-final.jsonl"
CACHE_DIR="tweet_cache"
SQLite_DB="tweets.db"

## Download Data Set

A (dataset)[https://ia601005.us.archive.org/31/items/tweetsRickyRenuncia-final/tweetsRickyRenuncia-final.txt] of the tweets related to the investigation is public online. This list will be used as a basis for research.from os import isfile

In [3]:
import requests
from os.path import isfile, isdir

tweet_list_url = "https://ia601005.us.archive.org/31/items/tweetsRickyRenuncia-final/tweetsRickyRenuncia-final.txt"

# Download dataset if not present
if not isfile(JL_DATA):
    response = requests.get(tweet_list_url)
    with open(JL_DATA, 'wb') as handler:
        handler.write(response.content)
    print(f"Data downloaded at {JL_DATA}.")
else:
    print(f"Data available at {JL_DATA}.")

Data available at ./tweetsRickyRenuncia-final.jsonl.


## Download Database and Cache

The code bellow uses a combination of python logic and terminal commands to download the *compressed* **database** and **cache**, and then extracted only if needed. Terminals commands can be addentified by the `!` symbol at the beggining of the line. The commands `wget`, `tar` and `gzip` are [Bash](https://en.wikipedia.org/wiki/Bash_(Unix_shell)) commands available outside of python.

A version of the database and cache is being shared publicly through the [#RickyRenuncia Project Scalar Book](https://libarchivist.com/rrp/rickyrenuncia/index). The database and cache should make the experience

In [5]:
from os.path import isfile, isdir

tmp_file = "tweet_cache.tar.gz"
tweets_db_url = "https://libarchivist.com/rrp/rickyrenuncia//tweets.db.gz"
tweet_cache_url = "https://libarchivist.com/rrp/rickyrenuncia//tweet_cache.tar.gz"

# Download database if not present
if not isfile("tweets.db.gz"):
    !wget "$tweets_db_url"
else:
    print("Database already available")

# Extract database if not present
if not isfile("tweets.db"):
    !gzip -kd "tweets.db.gz"
else:
    print("Database already extracted.")

# Download dataset if not present
if not isfile(tmp_file):
    !wget "$tweet_cache_url"
else:
    print("Compressed Cache already available!")

# Extract the cache if not present
if not isdir(CACHE_DIR):
    !tar -xf $tmp_file
    print("Cache extracted!")
else:
    print("Cache already extracted!")


Database already available
Database already extracted.
Compressed Cache already available!
Cache already extracted!


## Create a TSess
The `TSess` object stores configuration and controls the connection used to retrieve content from the Twitter API. It is this object that requires your twitter credentials to create a connection.

**Tweeter API Credentials** are required to create the session.

In [4]:
tweet_session = TSess(
        C_BEARER_TOKEN, 
        compression_level=5, 
        sleep_time=3, # Minimal sleep between requests to avoid hitting rate limits
        cache_dir=CACHE_DIR, 
        hash_split=True
    )

The session even include rate limiting for requests. For bearer token app authentication the limit is 300 tweet lookups each 15 minutes (900 seconds). In other words 3 seconds per tweet. Read more at "[Rate limits | Docs | Twitter Developer Platform](https://developer.twitter.com/en/docs/twitter-api/rate-limits)".

## Create Google Translate Credentials

After following the **optional** instructions from [Google API Credentials](./Developer_Registration.ipynb#Google-API-Credentials) run the code bellow. If the user did not acquire any credentials the code will default to no credentials.

In [5]:
google_credentials = prepare_google_credentials(
    credentials_file="./google_translate_keys_testing.json"
)

## Create a JsonLInteractiveClassifier

A JsonLInteractiveClassifier object handles interactions with a local SQLite database, Twitter API (importing a TSess) and a GUI for early data curation. 

In terms of the data integration process , the process of capturing the data from the API would fall under **Extraction**, the GUI for additional metadata and the SQLite database would fall under **Transform** as information is being stored in a different format for easier analysis. Any methods used to visualize the data or access attributes with less effort would fall under **Load**.

In that sense the `JsonLInteractiveClassifier` handles multiple stages on the **ETL**. 

<div class="alert alert-info">
If you downloaded the database and cache on the previous steps continue with the next code. The parameter `pre_initialized` is set as `True` to continue working with previous version of the database.
</div>

In [6]:
classifier = JsonLInteractiveClassifier(
    tweet_ids_file=JL_DATA, 
    session=tweet_session, mute=True, 
    google_credentials=google_credentials,
    sqlite_db=SQLite_DB, pre_initialized=True)

# classifier = JsonLInteractiveClassifier(
#     tweet_ids_file=JL_DATA, 
#     session=tweet_session, mute=True, 
#     google_credentials=google_credentials,
#     sqlite_db=SQLite_DB, pre_initialized=False)

# **JsonLInteractiveClassifier**

## Where are results and details stored?

The `JsonLInteractiveClassifier` object includes a builtin SQLite3 database connection in the attribute `db`. It is possible to directly access the database after a connection is made.

The code bellow displays the tables and columns of the relational database.

In [None]:
# Connect to the database
classifier.connect()

# Extract database object and create a cursor
database = classifier.db
cursor = database.cursor()

# SQL Command Display tables and columns
# Total of tweets with slang among the processed.
cursor.execute("""SELECT name FROM sqlite_master WHERE type='table';""")
tweet_schema = cursor.fetchall()
for table in tweet_schema:
    print("Table:",table[0])
    # Get Column Names
    cursor.execute(f"""SELECT * FROM {table[0]} LIMIT 1""")
    column_names = list(map(lambda x: x[0],cursor.description))
    print("Columns:", column_names, "\n\n")
    
# Close the cursor
cursor.close()

The database is composed of 3 tables: `tweet`, `tweet_details`, and `tweet_traduction`.

1. Table `tweet` includes only 2 columns the `tweet_id`, a string indicating the unique tweet_id, and `state`, an integer representing the processing stage of the tweet.

2. Table `tweet_details` includes 6 columns. Bellow you can see the SQL command used to create the table.
```
CREATE TABLE tweet_detail (
            tweet_id TEXT,
            has_media INTEGER,
            description TEXT,
            is_meme INTEGER,
            language TEXT,
            has_slang INTEGER,
            PRIMARY KEY("tweet_id"))
```

3. Table `tweet_traduction` includes 3 columns: `tweet_id`, `target_language_code` and `traduction`. This table works as a cache for storing google translation of a tweet's text in one or more languages.

## **Generate a Tweet State Report**

It is possible to directly interact with the built-in database by accesing the SQLite3 database connection. Bellow the user interacts with the `tweet` table to get totals on the stage of processing using a 'Group By' SQL command.

In [None]:
# Connect to the database
classifier.connect()

# Extract database object and create a cursor
database = classifier.db
cursor = database.cursor()

# Execute an sql command
cursor.execute("""
SELECT state, count(*) FROM tweet GROUP BY state;
""")
rows = cursor.fetchall()
cursor.close()
print("{:>20} : {:<12}".format("STATE", "TOTAL"))
for state, count in rows:
    state_name = PROCESSING_STAGES(state).name
    print("{:>20} : {:<12}".format(state_name, count))
    

Although it would be posible to store the state of the tweets as text fields in the database, that would increment storage and reduce efficiency. For that reason our team used an Enumerator Class called `PROCESSING_STAGES` to identify each stage with an integer value and a name.

The code bellow prints a table of the values and names used to represent each stage. In the database they are stored as numeric values.

In [None]:
print("{:>5} | {:20} | {:}".format("VALUE", "NAME", "OBJECT"))
for stage in PROCESSING_STAGES:
    print("{:5} | {:20} | {:}".format(stage.value, stage.name, stage))

## **View some of the accepted tweets**

The `display_accepted` method will display accepted tweets in pages. Pages can have arbitrary amounts of tweets and pages can be selected by changing the `page` parameter.

Try it! Change the page number. We recommend (5, 12 and 22)

In [None]:
# Change the page number to see different tweets from the database
PAGE=3
classifier.display_accepted(page=PAGE, per_page=2)

## **Evaluate Some Tweets**

Now is your turn!

Using the method `StartEvaluation` you can process some of the unprocessed tweets. This should display a tweet embeding into screeen and ask some questions that will store metadata on the tweets.

By default the StartEvaluations only processess tweets from the `PREPROCESSED` stage. using the `preprocess_batch` some tweets can be preprocessed before evaluation begins.

In [7]:
classifier.preprocess_batch(n=20)

In [None]:
classifier.StartEvaluations()

## **Verify if the totals have changed.**

In [None]:
# Connect to the database
classifier.connect()

# Extract database object and create a cursor
database = classifier.db
cursor = database.cursor()

# Execute an sql command
cursor.execute("""
SELECT state, count(*) FROM tweet GROUP BY state;
""")
rows = cursor.fetchall()
cursor.close()
print("{:>20} : {:<12}".format("STATE", "TOTAL"))
for state, count in rows:
    state_name = PROCESSING_STAGES(state).name
    print("{:>20} : {:<12}".format(state_name, count))

# Close the cursor
cursor.close()

How many tweets are missing?... **TODO**

## **Exercise 1**

Using the example above. Run 2 different queries to visualize:
1. the amount of tweets that have slang among the processed tweets.
2. the amount of tweets that have multimedia among the processed tweets.

Observe the column names of the tables `tweet` and `tweet_details`. The code bellow iterates through all tables and shows the column names.

**Consider**

- What two tables will you need to join together to get both the tweet state and the details about slang and media?
- What table has the tweet `state`? 
- What table has the media and slang information?
- Visit [SQL Inner Join Tutorial](https://www.sqlitetutorial.net/sqlite-inner-join/) to get a better notion about what an "Inner Join" is.

The column `state` is integer and matches the PROCESSING_STAGE values seen before.
The columns `has_media` and `has_slang` are integers but can be treated as boolean.

Replace any values between `<VALUE-DESCRIPTION>` to make the commands work:
- \<ENTER-GROUPING-COLUMN\>
- sdf


In [None]:
# Connect to the database
classifier.connect()

# Extract database object and create a cursor
database = classifier.db
cursor = database.cursor()

In [None]:
###### SQL Command 1
# Total of tweets with slang among the processed.
# cursor.execute("""
# SELECT <ENTER-GROUPING-COLUMN>, count(*) 
# FROM <TABLE-NAME> a
# INNER JOIN <DETAIL-TABLE-NAME> b
# ON a.tweet_id=b.tweet_id
# WHERE a.state=<STATE-VALUE-FOR-FINALIZED> 
# GROUP BY <ENTER-GROUPING-COLUMN>;
# """)
cursor.execute("""
SELECT has_slang, count(*) 
FROM tweet a
INNER JOIN tweet_user_detail b
ON a.tweet_id=b.tweet_id
WHERE a.state in (2,6)
GROUP BY has_slang;
""")

slang_rows = cursor.fetchall()

# Print Results to screen
print("{:>15} : {:<12}".format("Slang?", "TOTAL"))
for slang, count in slang_rows:
    if slang:
        state="Has Slang"
    else:
        state="No Slang"
    print("{:>15} : {:<12}".format(state, count))
    

In [None]:
# SQL Command 2
# Total of tweets with multimedia among the processed and preprocessed.
# cursor.execute("""
# SELECT
#     <ENTER-GROUPING-COLUMN>,
#     count(*) 
# FROM
#     <TABLE-NAME> a
# INNER JOIN
#     <DETAIL-TABLE-NAME> b
# ON
#     a.tweet_id=b.tweet_id
# WHERE
#     a.state in (<STATE-VALUE-FOR-FINALIZED>, <STATE-VALUE-FOR-PREPROCESSED>)
# GROUP BY
#     <ENTER-GROUPING-COLUMN>
# ORDER BY
#     <ENTER-GROUPING-COLUMN> DESC;
# """)
cursor.execute("""
SELECT
    has_media,
    count(*) 
FROM
    tweet a
INNER JOIN
    tweet_auto_detail b
    ON
        a.tweet_id=b.tweet_id
WHERE
    a.state in (2,6) 
GROUP BY
    has_media
ORDER BY
    has_media DESC;
""")

multimedia_rows = cursor.fetchall()

# Print Results to screen
print("{:>12} : {:<12}".format("HAS MEDIA?", "TOTAL"))
for has_media, count in multimedia_rows:
    if has_media:
        state="Yes"
    else:
        state="No"
    print("{:>12} : {:<12}".format(state, count))
    


## **Exercise 2**

Inspired by the commands above, write a command that lets you visualize the languages of the tweets. 

To complete this exercise you will need to answer the next questions:
- What **table** has language information?
- What **column** holds the language information?

Remember to output the results.

In [None]:
# SQL Command 3
# Total of tweets by language
cursor.execute("""
SELECT <ENTER-GROUPING-COLUMN>, count(*) 
FROM  <DETAIL-TABLE-NAME>
GROUP BY <ENTER-GROUPING-COLUMN>;
""")

language_rows = cursor.fetchall()

# Print Results to screen


## Where is information stored?

In [None]:
print(type(classifier))

The `classifier` is of type `JsonLInteractiveClassifier`. This class has two attributes, `original_filename`, a list of Tweet IDs, and `sqlite_filename`, the file holding the SQLite database, 

In [None]:
print(classifier.original_filename)
print(classifier.sqlite_filename)

In [None]:
# Close the cursor
cursor.close()

## What's next?

Continue to [Media Rating](./4-Media_Rating.ipynb) to continue with the experience and learn how to use `pandas` to interact with an SQLite database.