# Sentiment Analysis in the banking sector using Twitter API, AWS Comprehend and AWS SageMaker

In [33]:
from IPython.display import Image
from IPython.core.display import HTML

## INTRODUCTION
The main objective of this article is to share the main findings about sentiment analysis in the banking sector using the Twitter API, AWS Comprehend and AWS SageMaker. 

Having said that, it is important to highlight that this study has taken 3 Colombian banks, the data was generated from the Twitter API, from the Jupyter Notebook of AWS SageMarker, with specific hashtags for each of these banks, thus generating all the information corresponding to customer feedback. After that, AWS Comprehend was used to analyze the input data (API) using the power of NLP algorithms to extract key phrases, entities and feelings automatically **(SEE FIGURE 1)** . Because these AWS models are pre-trained, it was possible to classify each comment into three totally different scores: positive, negative and neutral. 

In the same sense, the questions posed for the analysis were
* Which bank has the highest number of tweets?
* What is the customer's perception of the city where the bank is headquartered?
* What is the perception of the clients classified by city?


**Note**: *The source code is from AWS (see references), for practical exercises some modifications have been made.*

In [34]:
Image(url= "https://s3.amazonaws.com/public-ps-datasets/comprehend.png")

## Import libraries

In [35]:
# Import libraries
import pandas as pd
from collections import OrderedDict
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt



### Connect to comprehend API using boto3.


In [36]:
import boto3
comprehend = boto3.client('comprehend', region_name='us-east-1')

In [37]:
sample_tweet="It’s always a great day when I can randomly put my equestrian knowledge to good use at work! #AWS #BePeculiar"   

# Key phrases
phrases = comprehend.detect_key_phrases(Text=sample_tweet, LanguageCode='en')

# Entities
entities = comprehend.detect_entities(Text=sample_tweet, LanguageCode='en')

#Sentiments
sentiments = comprehend.detect_sentiment(Text=sample_tweet, LanguageCode='en')

# Scrape Twitter API

In [38]:
# Twitter api information 
# You can get the access token from your twitter developer account (https://developer.twitter.com/). 
# DO NOT SHARE YOUR ACCESS TOKEN WITH ANYONE.

api_key =  '' #API key
api_secret = '' #API secret
access_token = '' #Access token
access_secret = '' #API secret


Let's scrape the twitter.com and look for tweets that contains the hashtag **#bancolombia**, for example.

#### NOTE: Run the below code ONLY the first time you access the notebook to install Tweepy. 


In [39]:
 !pip install tweepy --upgrade pip

Requirement already up-to-date: tweepy in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (3.8.0)
Requirement already up-to-date: pip in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (20.1.1)


In [40]:
#Once installed you can directly import to use it.
import tweepy

auth = tweepy.OAuthHandler(api_key, api_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

In [41]:
# Let's search for a tag

tag = '#bancolombia' #Here's an example. You can change here the hastag you want to investigate. 
tweets = api.search(q=tag, count = 100)  # This limits the web scrapping. Refer pricing https://developer.twitter.com/en/pricing.html

Let's see how a tweet look like! 
It will have tons of meta data along with the tweet itself. The metadata captures many valuable information about the user like: user account id, profile picture, profile description, location etc. 


In [42]:
tweets[0]

Status(_api=<tweepy.api.API object at 0x7f29430d6f28>, _json={'created_at': 'Sun Jul 05 16:33:25 +0000 2020', 'id': 1279815659627728897, 'id_str': '1279815659627728897', 'text': 'Que bueno es escuchar cada vez más a grandes marcas en #Colombia como el Grupo #Bancolombia hablando de sus casos d… https://t.co/1XjR4ecnjb', 'truncated': True, 'entities': {'hashtags': [{'text': 'Colombia', 'indices': [55, 64]}, {'text': 'Bancolombia', 'indices': [79, 91]}], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/1XjR4ecnjb', 'expanded_url': 'https://twitter.com/i/web/status/1279815659627728897', 'display_url': 'twitter.com/i/web/status/1…', 'indices': [117, 140]}]}, 'metadata': {'iso_language_code': 'es', 'result_type': 'recent'}, 'source': '<a href="http://www.linkedin.com/" rel="nofollow">LinkedIn</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 1

From the metadata of the tweet, we could extract the required information eg: **description, location** etc.


In [43]:
print("author's profile description: " , tweets[0].user.description)
print("author's location: ", tweets[0].user.location)
print("author's tweet: ", tweets[0].text)
print("Timestamp: ", tweets[0].created_at)

author's profile description:  ING Informático con Esp. en Desarrollo de Software y MBA con Esp. G-PYs. Apasionado de la Fotografía (https://t.co/U1FxkdihW4). Son mis opiniones
author's location:  
author's tweet:  Que bueno es escuchar cada vez más a grandes marcas en #Colombia como el Grupo #Bancolombia hablando de sus casos d… https://t.co/1XjR4ecnjb
Timestamp:  2020-07-05 16:33:25


### Extract tweet content and location from each tweet and analyze sentiment of the post.

In [44]:
# Let's extract sentiments using Amazon Comprehend API from each Tweet
posts = []
timestamp = []
locations = []
sentiments = []
positive = []
negative = []
neutral = []

for i in range(len(tweets)):
    d = tweets[i].text
    ts = tweets[i].created_at
    l = tweets[i].user.location
    
    if d != '':
        res = comprehend.detect_sentiment(Text=d, LanguageCode='en') #Connects to AWS Comprehend
        s = res.get('Sentiment')
        p = res.get('SentimentScore')['Positive']
        neg = res.get('SentimentScore')['Negative']
        neu = res.get('SentimentScore')['Neutral']
    
    timestamp.append(ts)
    posts.append(d)
    locations.append(l)
    sentiments.append(s)
    positive.append(p)
    negative.append(neg)
    neutral.append(neu) 
    


### Build a dataframe to view the data in tabular form so that the information is easy to consume. 

In [45]:
import pandas as pd
from collections import OrderedDict

result = pd.DataFrame(OrderedDict( {
            'tweets': posts
         , 'location': pd.Series(locations).str.wrap(15)
         , 'timestamp': timestamp
         , 'sentiment': sentiments
         , 'positiveScore': positive
         , 'negativeScore': negative
         , 'neutralScore' : neutral
         }))

In [46]:
# Display first rows
result.head()

Unnamed: 0,tweets,location,timestamp,sentiment,positiveScore,negativeScore,neutralScore
0,Que bueno es escuchar cada vez más a grandes m...,,2020-07-05 16:33:25,POSITIVE,0.758265,0.026999,0.214699
1,"RT @hogar_abuelitos: Ayúdanos a ayudar, 24 Abu...",Vitoria-Gasteiz\n- País Vasco,2020-07-05 12:24:45,NEUTRAL,0.203069,0.02619,0.77073
2,"RT @hogar_abuelitos: Ayúdanos a ayudar, 24 Abu...",Vitoria-Gasteiz\n- País Vasco,2020-07-05 12:24:38,NEUTRAL,0.430123,0.0147,0.555164
3,Ya #bancolombia me cobra de nuevo la cuota del...,@carlosfernando\nposadat,2020-07-05 01:48:21,NEUTRAL,0.373988,0.006424,0.619572
4,@diegoamoreno04 @larepublica_co @Bancolombia @...,,2020-07-04 20:23:31,NEUTRAL,0.091123,0.184147,0.724692


In [47]:
# Check null values 
result.isnull().sum()

tweets           0
location         0
timestamp        0
sentiment        0
positiveScore    0
negativeScore    0
neutralScore     0
dtype: int64

In [57]:
# Delete rows with null values
result.dropna(axis=0)
result.head()

Unnamed: 0,tweets,location,timestamp,sentiment,positiveScore,negativeScore,neutralScore
0,Que bueno es escuchar cada vez más a grandes m...,,2020-07-05 16:33:25,POSITIVE,0.758265,0.026999,0.214699
1,"RT @hogar_abuelitos: Ayúdanos a ayudar, 24 Abu...",Vitoria-Gasteiz\n- País Vasco,2020-07-05 12:24:45,NEUTRAL,0.203069,0.02619,0.77073
2,"RT @hogar_abuelitos: Ayúdanos a ayudar, 24 Abu...",Vitoria-Gasteiz\n- País Vasco,2020-07-05 12:24:38,NEUTRAL,0.430123,0.0147,0.555164
3,Ya #bancolombia me cobra de nuevo la cuota del...,@carlosfernando\nposadat,2020-07-05 01:48:21,NEUTRAL,0.373988,0.006424,0.619572
4,@diegoamoreno04 @larepublica_co @Bancolombia @...,,2020-07-04 20:23:31,NEUTRAL,0.091123,0.184147,0.724692


### Statistics 

In [49]:
# Mean
result.mean()

positiveScore    0.126079
negativeScore    0.374257
neutralScore     0.486181
dtype: float64

In [50]:
# Desviation standar
result.std()

positiveScore    0.227420
negativeScore    0.389013
neutralScore     0.352415
dtype: float64

In [51]:
# Summary statistics 
result.describe()

Unnamed: 0,positiveScore,negativeScore,neutralScore
count,45.0,45.0,45.0
mean,0.126079,0.374257,0.486181
std,0.22742,0.389013,0.352415
min,9.1e-05,6.3e-05,0.017035
25%,0.002784,0.018085,0.152461
50%,0.009615,0.184147,0.504587
75%,0.147081,0.845264,0.798434
max,0.787369,0.982872,0.999806


In [52]:
#Save to csv
result.to_csv('bancolombia-analysis.csv')

We could perform exploratory data analysis on the result dataset to know which location generated more negative        sentiment, which location generated most tweets, does a certain time of the day generates more negative sentiment etc.

In [53]:
print("Locations that generated positive sentiments in the descending order: ")
result.groupby(by='location')['positiveScore'].mean().sort_values(ascending=False)


Locations that generated positive sentiments in the descending order: 


location
Bogotá, D.C.,\nColombia          0.787369
Cali, Colombia                   0.729407
@carlosfernando\nposadat         0.373988
Chaparral,\nTolima               0.327975
Vitoria-Gasteiz\n- País Vasco    0.316596
MADRID, ESPAÑA                   0.222527
mas caleña q el\nchontaduro      0.147081
                                 0.108228
Cúcuta                           0.099056
Bogotá,\nColombia                0.061338
📍Colombia                        0.044267
COLOMBIA                         0.007085
Barranquilla,\nColombia          0.006885
#Colombia                        0.006754
Medellín,\nColombia              0.005673
Bogota                           0.004743
Valledupar -\nColombia           0.004351
Medellín                         0.004234
Colombia                         0.002161
Name: positiveScore, dtype: float64

In [54]:
result.groupby(by='location', sort = True)['tweets'].count().sort_values(ascending=False)

location
                                 17
📍Colombia                         4
Colombia                          3
Bogotá,\nColombia                 3
Vitoria-Gasteiz\n- País Vasco     2
#Colombia                         2
Cúcuta                            2
Bogotá, D.C.,\nColombia           1
@carlosfernando\nposadat          1
Barranquilla,\nColombia           1
Bogota                            1
Cali, Colombia                    1
COLOMBIA                          1
Valledupar -\nColombia            1
mas caleña q el\nchontaduro       1
MADRID, ESPAÑA                    1
Medellín                          1
Medellín,\nColombia               1
Chaparral,\nTolima                1
Name: tweets, dtype: int64

In [55]:
# Save only the tweets to a dataframe
df = result['tweets']
df.head()

0    Que bueno es escuchar cada vez más a grandes m...
1    RT @hogar_abuelitos: Ayúdanos a ayudar, 24 Abu...
2    RT @hogar_abuelitos: Ayúdanos a ayudar, 24 Abu...
3    Ya #bancolombia me cobra de nuevo la cuota del...
4    @diegoamoreno04 @larepublica_co @Bancolombia @...
Name: tweets, dtype: object

#### Upload the data to S3 on AWS

In [56]:
from io import StringIO
import boto3

def write_pd_s3_csv(df, bucket, filepath):
    csv_buffer = StringIO()
    df.to_csv(csv_buffer)
    s3_resource = boto3.resource('s3')
    s3_resource.Object(bucket, filepath).put(Body=csv_buffer.getvalue())
    print("The data is successfully written to S3 path:", bucket+"/"+filepath)

    
s3_bucket =  '' # bucket name
file_path = '' #File path for save on bucket
write_pd_s3_csv(df, s3_bucket, file_path)

### REFERENCES
[1] Domo.com. 2020. Domo Resource - Data Never Sleeps 7.0. [online] Available at: <https://www.domo.com/learn/data-never-sleeps-7> [Accessed 28 June 2020]. 

[2] M. Pejić Bach, Ž. Krstić, S. Seljan and L. Turulja, "Text Mining for Big Data Analysis in Financial Sector: A Literature Review", Sustainability, vol. 11, no. 5, p. 1277, 2019. Available: 10.3390/su11051277 [Accessed 26 June 2020].

[3] "Amazon Comprehend - Natural Language Processing (NLP) and Machine Learning (ML)", Amazon Web Services, Inc., 2020. [Online]. Available: https://aws.amazon.com/comprehend/?nc1=h_ls. [Accessed: 27- Jun- 2020].

[4] "Analyze content with Amazon Comprehend and Amazon SageMaker notebooks | Amazon Web Services", Amazon Web Services, 2020. [Online]. Available: https://aws.amazon.com/blogs/machine-learning/analyze-content-with-amazon-comprehend-and-amazon-sagemaker-notebooks/. [Accessed: 27- Jun- 2020].
