# Analyze Tweets
## Overview
This notebook is used to create a prototype that uses AWS to perform NLP analysis on the output of the `extract-tweets` notebook. The purpose of this prototype is to generate an output that will be fed into Tigergraph to produce our model for a user's confirmation bias.

## Set up dependencies
Execute the line below to set up dependencies

In [1]:
import sys
!{sys.executable} -m pip install -r requirements.txt

[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0m

## Prepare our data

We assume that the user-tweets.csv will be in our directory. If it's not present, it would be good to run the `extract-tweets` to generate the output file.

In [6]:
import pandas as pd
import re

# Load all user tweets
user_tweets = pd.read_csv('./user-tweets.csv')

# Drop index column attached to user tweets csv
user_tweets = user_tweets.drop(columns=['Unnamed: 0'])

# Clean up the links from the text (they're useless to us)
user_tweets['text'] = user_tweets['text'].apply(lambda x: re.split('https:\/\/.*', str(x))[0])

# Remove all emojis
user_tweets = user_tweets.astype(str).apply(lambda x: x.str.encode('ascii', 'ignore').str.decode('ascii'))

# Remove blank tweets
user_tweets = user_tweets[user_tweets.text.str.strip().str.len() != 0]

# Ensure that all text is in a single line
user_tweets.text = user_tweets.text.str.replace('\n', ' ');
user_tweets.text = user_tweets.text.str.replace('\r', ' ');

# Print out results
user_tweets.head()

Unnamed: 0,tweet_id,username,text
1,1511297746661253120,thesheetztweetz,Breaking - Amazon $AMZN signed the biggest roc...
2,1511153708654120963,thesheetztweetz,The U.S. Air Force's 388th Fighter Wing tested...
3,1511137391263715331,thesheetztweetz,U.S. Space Force Brig. Gen. Stephen Purdy rece...
4,1511087590832758789,thesheetztweetz,"Due the vent valve issue, the launch director ..."
5,1510994152175149062,thesheetztweetz,The countdown clock has now resumed at T-6:40 ...


## Analyse tweets using AWS

For our backend infrastructure, we'll be using AWS Comprehend as our machine learning component. It contains pre-trained models that can perform Key Phrase extraction, Sentiment analysis and Topic Modeling operations. Nothing custom required at this stage.

In [13]:
import boto3
import json

region = 'ap-southeast-1'
language_code = 'en'
username = 'elonmusk'

comprehend = boto3.client('comprehend', region_name=region)

def detect_key_phrases(text, language_code):
    response = comprehend.detect_key_phrases(Text=text, LanguageCode=language_code)
    return response

def detect_sentiment(text, language_code):
    response = comprehend.detect_sentiment(Text=text, LanguageCode=language_code)
    return response

text = user_tweets.iloc[20].text

sentiment = detect_sentiment(text, language_code)
key_phrases = detect_key_phrases(text, language_code)
entities = detect_entities(text, language_code)

print(text)
print(sentiment)
print(key_phrases)

The sounds of @inspiration4x DragonResilience during a phasing burn.I described in moment as an orchestra but its a more percussion-like rhythm &amp; very pleasant. So thankful for @SpaceX's talented team &amp; all the giants @NASA whose shoulders we stand on. @PolarisProgram up soon 
{'Sentiment': 'POSITIVE', 'SentimentScore': {'Positive': 0.9946348667144775, 'Negative': 0.000329022848745808, 'Neutral': 0.004913660231977701, 'Mixed': 0.00012249790597707033}, 'ResponseMetadata': {'RequestId': '44459cf0-e7f9-4c4c-8004-03746be6a3c5', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '44459cf0-e7f9-4c4c-8004-03746be6a3c5', 'content-type': 'application/x-amz-json-1.1', 'content-length': '165', 'date': 'Sat, 09 Apr 2022 01:00:26 GMT'}, 'RetryAttempts': 0}}
{'KeyPhrases': [{'Score': 0.9979216456413269, 'Text': 'The sounds', 'BeginOffset': 0, 'EndOffset': 10}, {'Score': 0.9858503937721252, 'Text': '@inspiration4x DragonResilience', 'BeginOffset': 14, 'EndOffset': 45}, {'Score': 0.990

## Upload input data to S3

This piece of code uploads data to our input S3 bucket. Make sure that the environment variable for `DATA_ACCESS_ROLE` is defined inside the `.env` file.

In [16]:
import os
from io import StringIO
from dotenv import load_dotenv

load_dotenv()

input_bucket = 'marites-comprehend-input'
output_bucket = 'marites-comprehend-output'
data_access_role_arn = os.environ.get("DATA_ACCESS_ROLE")

input_filename = '{}/{}-input.csv'.format(username, username)

# convert data frame into buffer
csv_buffer = StringIO()
user_tweets.to_csv(csv_buffer)

# upload data frame to S3
s3_resource = boto3.resource('s3')
s3_resource.Object(input_bucket, input_filename).put(Body=csv_buffer.getvalue())

{'ResponseMetadata': {'RequestId': 'Z1EJDSPTGF85BBP2',
  'HostId': '/rwAABDz/v3bXee2AquXKUQpqm3cf3JxO6arZaUridluoewRFRSRQdpRf58YbK1veCr4ZtzvtSY=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': '/rwAABDz/v3bXee2AquXKUQpqm3cf3JxO6arZaUridluoewRFRSRQdpRf58YbK1veCr4ZtzvtSY=',
   'x-amz-request-id': 'Z1EJDSPTGF85BBP2',
   'date': 'Sat, 09 Apr 2022 01:03:26 GMT',
   'x-amz-expiration': 'expiry-date="Mon, 11 Apr 2022 00:00:00 GMT", rule-id="comprehend-bucket-lifecycle"',
   'etag': '"731e5065042c5e376fcecdd25b1b3fc2"',
   'server': 'AmazonS3',
   'content-length': '0'},
  'RetryAttempts': 0},
 'Expiration': 'expiry-date="Mon, 11 Apr 2022 00:00:00 GMT", rule-id="comprehend-bucket-lifecycle"',
 'ETag': '"731e5065042c5e376fcecdd25b1b3fc2"'}

## Topic Detection

Analyses the tweets and splits them into 10 topics. Roughly takes ~30 minutes

In [19]:
import time

input_s3_url = 's3://{}/{}'.format(input_bucket, username)
output_s3_url = 's3://{}/{}'.format(output_bucket, username)

input_doc_format = 'ONE_DOC_PER_LINE'
number_of_topics = 10

input_data_config = {
    'S3Uri': input_s3_url,
    'InputFormat': input_doc_format
}

output_data_config = {
    'S3Uri': output_s3_url
}

job_name = 'Topic_Analysis_Job_{}'.format(username)

response = comprehend.start_topics_detection_job(NumberOfTopics=number_of_topics,
                                                 InputDataConfig=input_data_config,
                                                 OutputDataConfig=output_data_config, 
                                                 DataAccessRoleArn=data_access_role_arn, 
                                                 JobName=job_name)

job_id = response['JobId']
print('job_id: ' + job_id)

while True:
    result = comprehend.describe_topics_detection_job(JobId=job_id)
    job_status = result["TopicsDetectionJobProperties"]["JobStatus"]
    
    if job_status in ['COMPLETED', 'FAILED']:
        print("job_status: " + job_status)
        break
    else:
        print("job_status: " + job_status)
        time.sleep(60)


job_id: 637218a390140d1cc1f8ac78ef4dc8cd
job_status: SUBMITTED
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: IN_PROGRESS
job_status: COMPLETED


## Download the data locally

Download the topic modelling result locally so that we can analyse what we can do with the dataset

In [26]:
result = comprehend.describe_topics_detection_job(JobId=job_id)
results_s3_url = result['TopicsDetectionJobProperties']['OutputDataConfig']['S3Uri']


s3_name = 's3://{}/'.format(output_bucket)

local_results_filename = './topic-model.tar.gz'.format(username)
results_aws_filename = results_s3_url.replace(s3_name, '')

s3 = boto3.client('s3')
s3.download_file(output_bucket, results_aws_filename, local_results_filename)

Extract the results into json

In [27]:
import tarfile

def extract_targz(targz_file, output_path=''):
    if targz_file.endswith("tar.gz"):
        tar = tarfile.open(targz_file, "r:gz")
        tar.extractall(path = output_path)
        tar.close()
    elif targz_file.endswith("tar"):
        tar = tarfile.open(targz_file, "r:")
        tar.extractall(path=output_path)
        tar.close()

# Unzip the results file
output_path = 'extracted'
extract_targz(local_results_filename, output_path)