# **Evaluating Content**

Although the search processed used by the team to retrieve the Tweet ID list was rigurous, some materials in the list may be unrelated. It is possible some posts match a hashtag but discuss subjects unrelated to the historical events of #RickyRenuncia movement from summer 2019.



## Objectives

This notebook presents a minimal IPython graphical user interface (GUI) where participants, researchers and members of the original team could interact with content and classify it.

### Learning Goals
- Interact with twitter embedings.
- Update SQLite3 database.
- Classify content (minimum of 20 tweets).
- Visualize state of the database.

## Requirements

**Tweeter API Credentials**

The user will need to have created the `twitter_secrets.py` file based on `twitter_secrets_example.py` and set the variables to his API specifications. See [Twitter API Credentials](./Developer_Registration.ipynb) section.

## Optional Requirements

**OPTIONAL**

**Google API Credentials**

A `google.oauth2.service_account.Credentials` object is required to interact with the google translate API to automatically see translations of text. This should help non-Spanish speakers interact with content in Spanish.

The user will need to have created/edited the `google_translate_keys.json` following the [Google API Credentials](./Developer_Registration.ipynb#Google-API-Credentials) section. This is **optional**, but will offer the user automatic translation of tweet text content to english (or other language).


## Import Libraries

Add Library justifications

In [10]:
import ipywidgets as widgets
from IPython.core.display import display, HTML, update_display
import json, os, pickle
from random import seed, randint
from tweet_requester.analysis import TweetJLAnalyzer, TweetAnalyzer
from tweet_requester.display import TweetInteractiveClassifier, \
JsonLInteractiveClassifier, TSess, prepare_google_credentials, PROCESSING_STAGES
from twitter_secrets import C_BEARER_TOKEN 
JL_DATA="./tweetsRickyRenuncia-final.jsonl"
BASE_DIR="./Evaluating Content"

## Create a TSess
The `TSess` object stores configuration and controls the connection used to retrieve content from the Twitter API. It is this object that requires your twitter credentials to create a connection.

**Tweeter API Credentials** are required to create the session.

In [11]:
tweet_session = TSess(
        C_BEARER_TOKEN, 
        compression_level=5, 
        sleep_time=3, # Minimal sleep between requests to avoid hitting rate limits
        cache_dir="./.tweet_cache_split/", 
        hash_split=True
    )

The session even include rate limiting for requests. For bearer token app authentication the limit is 300 tweet lookups each 15 minutes (900 seconds). In other words 3 seconds per tweet. Read more at "[Rate limits | Docs | Twitter Developer Platform](https://developer.twitter.com/en/docs/twitter-api/rate-limits)".

## Create Google Translate Credentials

After following the **optional** instructions from [Google API Credentials](./Developer_Registration.ipynb#Google-API-Credentials) run the code bellow. If the user did not acquire any credentials the code will default to no credentials.

In [12]:
google_credentials = prepare_google_credentials(
    credentials_file="./google_translate_keys.json"
)

## Create a JsonLInteractiveClassifier

A JsonLInteractiveClassifier object handles interactions with a local SQLite database, Twitter API (importing a TSess) and a GUI for interaction. EXPAND ON THE OBJECT AND HOW TO USE IT

In [31]:
classifier = JsonLInteractiveClassifier(
    tweet_ids_file="tweetsRickyRenuncia-final.txt", 
    session=tweet_session, mute=True, 
    google_credentials=google_credentials)

# **JsonLInteractiveClassifier**

## Where are results and details stored?

The `JsonLInteractiveClassifier` object includes a builtin SQLite3 database connection in the attribute `db`. It is possible to directly access the database after a connection is made.

The code bellow displays the tables and columns of the relational database.

In [34]:
# Connect to the database
classifier.connect()

# Extract database object and create a cursor
database = classifier.db
cursor = database.cursor()

# SQL Command Display tables and columns
# Total of tweets with slang among the processed.
cursor.execute("""SELECT name FROM sqlite_master WHERE type='table';""")
tweet_schema = cursor.fetchall()
for table in tweet_schema:
    print("Table:",table[0])
    # Get Column Names
    cursor.execute(f"""SELECT * FROM {table[0]} LIMIT 1""")
    column_names = list(map(lambda x: x[0],cursor.description))
    print("Columns:", column_names, "\n\n")
    
# Close the cursor
cursor.close()

Table: tweet
Columns: ['tweet_id', 'state'] 


Table: tweet_detail
Columns: ['tweet_id', 'has_media', 'description', 'is_meme', 'language', 'has_slang'] 


Table: tweet_traduction
Columns: ['tweet_id', 'target_language_code', 'traduction'] 


Table: tweet_user_detail
Columns: ['tweet_id', 'description', 'is_meme', 'has_slang'] 


Table: tweet_auto_detail
Columns: ['tweet_id', 'isBasedOn', 'identifier', 'url', 'dateCreated', 'datePublished', 'user_id', 'has_media', 'language', 'retweetCount', 'quoteCount', 'text'] 


Table: tweet_user
Columns: ['user_id', 'user_url', 'screen_name'] 


Table: tweet_match_media
Columns: ['tweet_id', 'media_id'] 


Table: tweet_media
Columns: ['media_id', 'media_url', 'type'] 


Table: db_update
Columns: ['version', 'git_commit', 'timestamp'] 




The database is composed of 3 tables: `tweet`, `tweet_details`, and `tweet_traduction`.

1. Table `tweet` includes only 2 columns the `tweet_id`, a string indicating the unique tweet_id, and `state`, an integer representing the processing stage of the tweet.

2. Table `tweet_details` includes 6 columns. Bellow you can see the SQL command used to create the table.
```
CREATE TABLE tweet_detail (
            tweet_id TEXT,
            has_media INTEGER,
            description TEXT,
            is_meme INTEGER,
            language TEXT,
            has_slang INTEGER,
            PRIMARY KEY("tweet_id"))
```

3. Table `tweet_traduction` includes 3 columns: `tweet_id`, `target_language_code` and `traduction`. This table works as a cache for storing google translation of a tweet's text in one or more languages.

## **Generate a Tweet State Report**

It is possible to directly interact with the built-in database by accesing the SQLite3 database connection. Bellow the user interacts with the `tweet` table to get totals on the stage of processing using a 'Group By' SQL command.

In [14]:
# Connect to the database
classifier.connect()

# Extract database object and create a cursor
database = classifier.db
cursor = database.cursor()

# Execute an sql command
cursor.execute("""
SELECT state, count(*) FROM tweet GROUP BY state;
""")
rows = cursor.fetchall()
cursor.close()
print("{:>20} : {:<12}".format("STATE", "TOTAL"))
for state, count in rows:
    state_name = PROCESSING_STAGES(state).name
    print("{:>20} : {:<12}".format(state_name, count))
    

               STATE : TOTAL       
         UNPROCESSED : 493703      
           REVIEWING : 348         
           FINALIZED : 67          
UNAVAILABLE_EMBEDING : 1179        
             RETWEET : 2528        
        PREPROCESSED : 2476        


Although it would be posible to store the state of the tweets as text fields in the database, that would increment storage and reduce efficiency. For that reason our team used an Enumerator Class called `PROCESSING_STAGES` to identify each stage with an integer value and a name.

The code bellow prints a table of the values and names used to represent each stage. In the database they are stored as numeric values.

In [15]:
print("{:>5} | {:20} | {:}".format("VALUE", "NAME", "OBJECT"))
for stage in PROCESSING_STAGES:
    print("{:5} | {:20} | {:}".format(stage.value, stage.name, stage))

VALUE | NAME                 | OBJECT
    0 | UNPROCESSED          | PROCESSING_STAGES.UNPROCESSED
    1 | REVIEWING            | PROCESSING_STAGES.REVIEWING
    2 | FINALIZED            | PROCESSING_STAGES.FINALIZED
    3 | REJECTED             | PROCESSING_STAGES.REJECTED
    4 | UNAVAILABLE_EMBEDING | PROCESSING_STAGES.UNAVAILABLE_EMBEDING
    5 | RETWEET              | PROCESSING_STAGES.RETWEET
    6 | PREPROCESSED         | PROCESSING_STAGES.PREPROCESSED


## **View some of the accepted tweets**

The `display_accepted` method will display accepted tweets in pages. Pages can have arbitrary amounts of tweets and pages can be selected by changing the `page` parameter.

Try it! Change the page number. We recommend (5, 12 and 22)

In [7]:
# Change the page number to see different tweets from the database
PAGE=5
classifier.display_accepted(page=PAGE)

## **Evaluate Some Tweets**

Now is your turn!

Using the method `StartEvaluation` you can process some of the unprocessed tweets. This should display a tweet embeding into screeen and ask some questions.

In [10]:
jl_display.StartEvaluations()

## **Verify if the totals have changed.**

In [32]:
# Connect to the database
classifier.connect()

# Extract database object and create a cursor
database = classifier.db
cursor = database.cursor()

# Execute an sql command
cursor.execute("""
SELECT state, count(*) FROM tweet GROUP BY state;
""")
rows = cursor.fetchall()
cursor.close()
print("{:>20} : {:<12}".format("STATE", "TOTAL"))
for state, count in rows:
    state_name = PROCESSING_STAGES(state).name
    print("{:>20} : {:<12}".format(state_name, count))

# Close the cursor
cursor.close()

               STATE : TOTAL       
         UNPROCESSED : 493703      
           REVIEWING : 348         
           FINALIZED : 67          
UNAVAILABLE_EMBEDING : 1179        
             RETWEET : 2528        
        PREPROCESSED : 2476        


How many tweets are missing?... **TODO**

## **Exercise 1**

Using the example above. Run 2 different queries to visualize:
1. the amount of tweets that have slang among the processed tweets.
2. the amount of tweets that have multimedia among the processed tweets.

Observe the column names of the tables `tweet` and `tweet_details`. The code bellow iterates through all tables and shows the column names.

**Consider**

- What two tables will you need to join together to get both the tweet state and the details about slang and media?
- What table has the tweet `state`? 
- What table has the media and slang information?
- Visit [SQL Inner Join Tutorial](https://www.sqlitetutorial.net/sqlite-inner-join/) to get a better notion about what an "Inner Join" is.

The column `state` is integer and matches the PROCESSING_STAGE values seen before.
The columns `has_media` and `has_slang` are integers but can be treated as boolean.

Replace any values between `<VALUE-DESCRIPTION>` to make the commands work:
- \<ENTER-GROUPING-COLUMN\>
- sdf


In [24]:
# Connect to the database
classifier.connect()

# Extract database object and create a cursor
database = jl_display.db
cursor = database.cursor()

In [25]:
###### SQL Command 1
# Total of tweets with slang among the processed.
# cursor.execute("""
# SELECT <ENTER-GROUPING-COLUMN>, count(*) 
# FROM <TABLE-NAME> a
# INNER JOIN <DETAIL-TABLE-NAME> b
# ON a.tweet_id=b.tweet_id
# WHERE a.state=<STATE-VALUE-FOR-FINALIZED> 
# GROUP BY <ENTER-GROUPING-COLUMN>;
# """)
cursor.execute("""
SELECT has_slang, count(*) 
FROM tweet a
INNER JOIN tweet_user_detail b
ON a.tweet_id=b.tweet_id
WHERE a.state in (2,6)
GROUP BY has_slang;
""")

slang_rows = cursor.fetchall()

# Print Results to screen
print("{:>15} : {:<12}".format("Slang?", "TOTAL"))
for slang, count in slang_rows:
    if slang:
        state="Has Slang"
    else:
        state="No Slang"
    print("{:>15} : {:<12}".format(state, count))
    

         Slang? : TOTAL       
       No Slang : 60          
      Has Slang : 7           


In [30]:
# SQL Command 2
# Total of tweets with multimedia among the processed and preprocessed.
# cursor.execute("""
# SELECT
#     <ENTER-GROUPING-COLUMN>,
#     count(*) 
# FROM
#     <TABLE-NAME> a
# INNER JOIN
#     <DETAIL-TABLE-NAME> b
# ON
#     a.tweet_id=b.tweet_id
# WHERE
#     a.state in (<STATE-VALUE-FOR-FINALIZED>, <STATE-VALUE-FOR-PREPROCESSED>)
# GROUP BY
#     <ENTER-GROUPING-COLUMN>
# ORDER BY
#     <ENTER-GROUPING-COLUMN> DESC;
# """)
cursor.execute("""
SELECT
    has_media,
    count(*) 
FROM
    tweet a
INNER JOIN
    tweet_auto_detail b
    ON
        a.tweet_id=b.tweet_id
WHERE
    a.state in (2,6) 
GROUP BY
    has_media
ORDER BY
    has_media DESC;
""")

multimedia_rows = cursor.fetchall()

# Print Results to screen
print("{:>12} : {:<12}".format("HAS MEDIA?", "TOTAL"))
for has_media, count in multimedia_rows:
    if has_media:
        state="Yes"
    else:
        state="No"
    print("{:>12} : {:<12}".format(state, count))
    


  HAS MEDIA? : TOTAL       
         Yes : 1181        
          No : 1362        


In [None]:
# Close the cursor
cursor.close()

## **Exercise 2**

Inspired by the commands above, write a command that lets you visualize the languages of the tweets.

In [None]:
# SQL Command 3
# Total of tweets by language
cursor.execute("""
SELECT <ENTER-GROUPING-COLUMN>, count(*) 
FROM  <DETAIL-TABLE-NAME>
GROUP BY <ENTER-GROUPING-COLUMN>;
""")

language_rows = cursor.fetchall()

# Print Results to screen


## Where is information stored?

dl

In [16]:

print(classifier.original_filename)
print(classifier.sqlite_filename)

tweetsRickyRenuncia-final.txt
.tweetsRickyRenuncia-final.txt.db


**TODO**
Suggest continue to multimedia_popularity...
Need to modify a couple of things there...