# Twitter Analysis using Elasticsearch and Python: Initial Discovery with Kibana
## OSINT with Python and ELKstack [Part 2a]
> Andrew Eng | 2020-10-09

## Get Synched

In [part 1](https://medium.com/swlh/open-source-intelligence-with-elasticsearch-analyzing-twitter-feeds-part-1-of-3-21a8b65dde03) we:
1. Built our infrastructure using a containerized version of ELKSTACK (Elasticsearch, Logstash, Kibana)
2. Explored and built a python script that reaches out to Twitter using APIs
3. Transferred the collection data into Elasticsearch for analyzing

In part 2, we are going to explore the data that we ingested into Elasticsearch using Kibana.

**While writing this part, I realized that this is going to be a huge topic.  While going through EDA, I found so many useful tools and techniques that could be used to speed up the process.  I'm going to break part 2 into multiple sub-sections.  This sub-section is going to be focused on the initial EDA and building out simple visualizations with Kibana.  The next sections will focus on cleaning up the data using machine learning and tap into other APIs that will help us answer some questions.  There may be additional quesitons that we come up with through the process.**

Since [part 1](https://medium.com/swlh/open-source-intelligence-with-elasticsearch-analyzing-twitter-feeds-part-1-of-3-21a8b65dde03), I expanded the search criteria to include all of the metadata from each tweet.  That way, I can go back to it and answer possible new questions around the data I'm exploring.  I need to clean it up a bit more.  **Note** I need to create TRY exceptions around KeyError codes (some fields do not have data, but I still want to collect that field).

Since the code was getting larger than I expected.  I took out some uneeded things.  I simplified the feed to elastic function:

```python

while count < len(feed):
        doc = {
            '@timestamp': dt.now(),
            'created_at': str(feed[count]['created_at']),
            'twitter_id' : int(feed[count]['id']),
            'id_str' : int(feed[count]['id_str']),
            'full_text' : str(feed[count]['full_text']),
            'truncated' : str(feed[count]['truncated']),
            'display_text_range' : str(feed[count]['display_text_range']),
            'entities' : str(feed[count]['entities']), # Already split the dictionary, no longer needed 
            'metadata' : str(feed[count]['metadata']),
            'source' : str(feed[count]['source']),
            'in_reply_to_status_id' : str(feed[count]['in_reply_to_status_id']),
            'in_reply_to_status_id_str' : str(feed[count]['in_reply_to_status_id_str']),
            'in_reply_to_user_id' : str(feed[count]['in_reply_to_user_id']),
            'in_reply_to_user_id_str' : str(feed[count]['in_reply_to_user_id_str']),
            'in_reply_to_screen_name' : str(feed[count]['in_reply_to_screen_name']),
            'user' : str(feed[count]['user']), # Already split the dictionary, no longer needed
            'geo' : str(feed[count]['geo']),
            ... [truncated]
```

This clean up was pretty fun.  I was doing it manually, but it was taking forever.  [**VERY** tedius] I had to open the dictionary, extract the keys, format the text so it creates variables and then those variables are called.  So intead...  I turned to python.  I had to think of ways I can keep iterating through dictionaries and sub-dictionaries to extract the items for each key.  I want a simple dictionary that I can stat on.  **Note** Clean up the collection function:

I moved some code blocks around and wrapped it in a function that I can use to call.  I did this because it provides better flexibility when try/exceptions are being executed.

**This code block imports modules, sets server and API configurations, and objects**

In [17]:
# Import tweepy as tw
import tweepy as tw
import sys
from datetime import datetime as dt
from elasticsearch import Elasticsearch

# Initializing objects
twitter_cred = dict()
api = ''
es = ''

def setConfig(server):
    # Import keys from a saved file instead of inputting it directly into the script.  
    # Strip whitespaces and split on = as I only want the key values
    # Server argument is the elasticsearch node
    
    key_location = 'twitter.keys'
    apikeys = []

    global api
    global es

    with open(key_location) as keys:
        for i in keys:
            apikeys.append(i.split("=")[1].strip(" ").strip("\n"))
    keys.close()

    # Initialize dictionary
    #twitter_cred = dict()

    # Enter API keys
    twitter_cred["CONSUMER_KEY"] = apikeys[0]
    twitter_cred["CONSUMER_SECRET"] = apikeys[1]

    # Access Tokens
    twitter_cred["ACCESS_KEY"] = apikeys[2]
    twitter_cred["ACCESS_SECRET"] = apikeys[3]

    # Set authentication object
    auth = tw.OAuthHandler(twitter_cred["CONSUMER_KEY"], twitter_cred["CONSUMER_SECRET"])
    auth.set_access_token(twitter_cred["ACCESS_KEY"], twitter_cred["ACCESS_SECRET"])

    # Create api object with authentication
    api = tw.API(auth, wait_on_rate_limit=True)

    # Set Elasticsearch Server
    es = Elasticsearch(server, port=9200)

# Execute function with the elasticsearch ip address
setConfig('127.0.0.1')

Pull a tweet so we can play around with the record

In [21]:
feed = {}

for tweet in tw.Cursor(api.search, q='palantir OR pltr', tweet_mode='extended').items(1):
    feed.update(tweet._json)

Iterate the dictionary and sub dictionary to generate the line(s) of code I need

In [22]:
# What keys are used
parentList = []
subList = []

# Add dictionaries to sub process list
def subProcess(dictionary):
    subList.append(dictionary)
    
for item in feed.keys():
    parentList.append(item)

for i in parentList:
    if type(feed[i]) is not dict:
        print(f"{i} = feed['{i}']")
        
    else:
        subProcess(i)

created_at = feed['created_at']
id = feed['id']
id_str = feed['id_str']
full_text = feed['full_text']
truncated = feed['truncated']
display_text_range = feed['display_text_range']
source = feed['source']
in_reply_to_status_id = feed['in_reply_to_status_id']
in_reply_to_status_id_str = feed['in_reply_to_status_id_str']
in_reply_to_user_id = feed['in_reply_to_user_id']
in_reply_to_user_id_str = feed['in_reply_to_user_id_str']
in_reply_to_screen_name = feed['in_reply_to_screen_name']
geo = feed['geo']
coordinates = feed['coordinates']
place = feed['place']
contributors = feed['contributors']
is_quote_status = feed['is_quote_status']
retweet_count = feed['retweet_count']
favorite_count = feed['favorite_count']
favorited = feed['favorited']
retweeted = feed['retweeted']
possibly_sensitive = feed['possibly_sensitive']
lang = feed['lang']


## Exploratory Data Analysis with Kibana

### Getting Started

**Exploratory Data Analysis (EDA)** is the process for performing initial investigations on data.  The goal is to discover patterns, outliers, and develop hypothesis.  Visualizations can be created to help streamline this process.  

> "**[Kibana](https://www.elastic.co/kibana)** is a free and open user interface that lets you visualize your Elasticsearch data and navigate the Elastic Stack. Do anything from tracking query load to understanding the way requests flow through your apps."

![Kibana Interface sourced from elastic.co](images/2_what_is_kibana.png)

I use Kibana exclusively for EDA and creating visualizations in order to try and understand data.  As I progress, I imagine I will be using python pandas and matplotlib more to quickly sift through data.  For now, I'll stick with Kibana.

[Access the notebook](https://github.com/andreweng)

In [11]:
# Code snippet will not be included in the blog to reduce the read time.  This will be a supplement in my github repository

import tweepy as tw
import sys
from datetime import datetime as dt
from elasticsearch import Elasticsearch

# Initializing objects
twitter_cred = dict()
api = ''
es = ''

def acqData(search, acq):

    index_name = idx + dt.today().strftime('%Y-%m-%d')
    feed = []
    
    print(': :Acquiring Data::')
   
    for tweet in tw.Cursor(api.search, q=search, tweet_mode='extended').items(acq):
        feed.append(tweet._json)

    count = 0
    
    print(': :Transferring to Elasticsearch Search::')
    
    while count < len(feed):
        doc = {
            '@timestamp': dt.now(),
            'created_at': str(feed[count]['created_at']),
            'twitter_id' : int(feed[count]['id']),
            'id_str' : int(feed[count]['id_str']),
            'full_text' : str(feed[count]['full_text']),
            'truncated' : str(feed[count]['truncated']),
            'display_text_range' : str(feed[count]['display_text_range']),
            'entities' : str(feed[count]['entities']), # Already split the dictionary, no longer needed 
            'metadata' : str(feed[count]['metadata']),
            'source' : str(feed[count]['source']),
            'in_reply_to_status_id' : str(feed[count]['in_reply_to_status_id']),
            'in_reply_to_status_id_str' : str(feed[count]['in_reply_to_status_id_str']),
            'in_reply_to_user_id' : str(feed[count]['in_reply_to_user_id']),
            'in_reply_to_user_id_str' : str(feed[count]['in_reply_to_user_id_str']),
            'in_reply_to_screen_name' : str(feed[count]['in_reply_to_screen_name']),
            'user' : str(feed[count]['user']), # Already split the dictionary, no longer needed
            'geo' : str(feed[count]['geo']),
            'coordinates' : str(feed[count]['coordinates']),
            'place' : str(feed[count]['place']),
            'contributors' : str(feed[count]['contributors']),
            #'retweeted_status' : str(feed[count]['retweeted_status']),
            'is_quote_status' : str(feed[count]['is_quote_status']),
            'retweet_count' : str(feed[count]['retweet_count']),
            'favorite_count' : str(feed[count]['favorite_count']),
            'favorited' : str(feed[count]['favorited']),
            'retweeted' : str(feed[count]['retweeted']),
            'lang' : str(feed[count]['lang']),
            'user_id' : str(feed[count]['user']['id']),
            'user_id_str' : str(feed[count]['user']['id_str']),
            'user_name' : str(feed[count]['user']['name']),
            'user_screen_name' : str(feed[count]['user']['screen_name']),
            'user_location' : str(feed[count]['user']['location']),
            'user_description' : str(feed[count]['user']['description']),
            'user_url' : str(feed[count]['user']['url']),
            'user_protected' : str(feed[count]['user']['protected']),
            'user_followers_count' : str(feed[count]['user']['followers_count']),
            'user_friends_count' : str(feed[count]['user']['friends_count']),
            'user_listed_count' : str(feed[count]['user']['listed_count']),
            'user_created_at' : str(feed[count]['user']['created_at']),
            'user_favourites_count' : str(feed[count]['user']['favourites_count']),
            'user_utc_offset' : str(feed[count]['user']['utc_offset']),
            'user_time_zone' : str(feed[count]['user']['time_zone']),
            'user_geo_enabled' : str(feed[count]['user']['geo_enabled']),
            'user_verified' : str(feed[count]['user']['verified']),
            'user_statuses_count' : str(feed[count]['user']['statuses_count']),
            'user_lang' : str(feed[count]['user']['lang']),
            'user_contributors_enabled' : str(feed[count]['user']['contributors_enabled']),
            'user_is_translator' : str(feed[count]['user']['is_translator']),
            'user_is_translation_enabled' : str(feed[count]['user']['is_translation_enabled']),
            'user_profile_background_color' : str(feed[count]['user']['profile_background_color']),
            'user_profile_background_image_url' : str(feed[count]['user']['profile_background_image_url']),
            'user_profile_background_image_url_https' : str(feed[count]['user']['profile_background_image_url_https']),
            'user_profile_background_tile' : str(feed[count]['user']['profile_background_tile']),
            'user_profile_image_url' : str(feed[count]['user']['profile_image_url']),
            'user_profile_image_url_https' : str(feed[count]['user']['profile_image_url_https']),
            #'user_profile_banner_url' : str(feed[count]['user']['profile_banner_url']),
            'user_profile_link_color' : str(feed[count]['user']['profile_link_color']),
            'user_profile_sidebar_border_color' : str(feed[count]['user']['profile_sidebar_border_color']),
            'user_profile_sidebar_fill_color' : str(feed[count]['user']['profile_sidebar_fill_color']),
            'user_profile_text_color' : str(feed[count]['user']['profile_text_color']),
            'user_profile_use_background_image' : str(feed[count]['user']['profile_use_background_image']),
            'user_has_extended_profile' : str(feed[count]['user']['has_extended_profile']),
            'user_default_profile' : str(feed[count]['user']['default_profile']),
            'user_default_profile_image' : str(feed[count]['user']['default_profile_image']),
            'user_following' : str(feed[count]['user']['following']),
            'user_follow_request_sent' : str(feed[count]['user']['follow_request_sent']),
            'user_notifications' : str(feed[count]['user']['notifications']),
            'user_translator_type' : str(feed[count]['user']['translator_type']),
            'entities_hashtags' : str(feed[count]['entities']['hashtags']),
            'entities_symbols' : str(feed[count]['entities']['symbols']),
            'entities_user_mentions' : str(feed[count]['entities']['user_mentions']),
            'entities_urls' : str(feed[count]['entities']['urls']),
            #'entities_media' : str(feed[count]['entities']['media']),
            #'extended_entities_media' : str(feed[count]['extended_entities']['media']),
            'metadata_iso_language_code' : str(feed[count]['metadata']['iso_language_code']),
            'metadata_result_type' : str(feed[count]['metadata']['result_type']),
            #'retweeted_status_created_at' : str(feed[count]['retweeted_status']['created_at']),
            #'retweeted_status_id' : str(feed[count]['retweeted_status']['id']),
            #'retweeted_status_id_str' : str(feed[count]['retweeted_status']['id_str']),
            #'retweeted_status_full_text' : str(feed[count]['retweeted_status']['full_text']),
            #'retweeted_status_truncated' : str(feed[count]['retweeted_status']['truncated']),
            #'retweeted_status_display_text_range' : str(feed[count]['retweeted_status']['display_text_range']),
            #'retweeted_status_source' : str(feed[count]['retweeted_status']['source']),
            #'retweeted_status_in_reply_to_status_id' : str(feed[count]['retweeted_status']['in_reply_to_status_id']),
            #'retweeted_status_in_reply_to_status_id_str' : str(feed[count]['retweeted_status']['in_reply_to_status_id_str']),
            #'retweeted_status_in_reply_to_user_id' : str(feed[count]['retweeted_status']['in_reply_to_user_id']),
            #'retweeted_status_in_reply_to_user_id_str' : str(feed[count]['retweeted_status']['in_reply_to_user_id_str']),
            #'retweeted_status_in_reply_to_screen_name' : str(feed[count]['retweeted_status']['in_reply_to_screen_name']),
            #'retweeted_status_geo' : str(feed[count]['retweeted_status']['geo']),
            #'retweeted_status_coordinates' : str(feed[count]['retweeted_status']['coordinates']),
            #'retweeted_status_place' : str(feed[count]['retweeted_status']['place']),
            #'retweeted_status_contributors' : str(feed[count]['retweeted_status']['contributors']),
            #'retweeted_status_is_quote_status' : str(feed[count]['retweeted_status']['is_quote_status']),
            #'retweeted_status_retweet_count' : str(feed[count]['retweeted_status']['retweet_count']),
            #'retweeted_status_favorite_count' : str(feed[count]['retweeted_status']['favorite_count']),
            #'retweeted_status_favorited' : str(feed[count]['retweeted_status']['favorited']),
            #'retweeted_status_retweeted' : str(feed[count]['retweeted_status']['retweeted']),
            #'retweeted_status_possibly_sensitive' : str(feed[count]['retweeted_status']['possibly_sensitive']),
            #'retweeted_status_lang' : str(feed[count]['retweeted_status']['lang']),
            #'retweeted_status_entities_hashtags' : str(feed[count]['retweeted_status']['entities']['hashtags']),
            #'retweeted_status_entities_symbols' : str(feed[count]['retweeted_status']['entities']['symbols']),
            #'retweeted_status_entities_user_mentions' : str(feed[count]['retweeted_status']['entities']['user_mentions']),
            #'retweeted_status_entities_urls' : str(feed[count]['retweeted_status']['entities']['urls']),
            #'retweeted_status_entities_media' : str(feed[count]['retweeted_status']['entities']['media']),
            #'user_entities_description_urls': str(feed['user']['entities']['description']['urls'])
            'word_list':  str(feed[count]['full_text']).split(' ')
        }

        es.index(index=index_name, body=doc)
        
        count +=1
    
    print(f'Processed {tweet_count} records of {search} to {server}')
    
# Set credentials 
def setConfig(server):
    # Import keys from a saved file instead of inputting it directly into the script.  
    # Strip whitespaces and split on = as I only want the key values
    key_location = 'twitter.keys'
    apikeys = []

    global api
    global es

    with open(key_location) as keys:
        for i in keys:
            apikeys.append(i.split("=")[1].strip(" ").strip("\n"))
    keys.close()

    # Initialize dictionary
    #twitter_cred = dict()

    # Enter API keys
    twitter_cred["CONSUMER_KEY"] = apikeys[0]
    twitter_cred["CONSUMER_SECRET"] = apikeys[1]

    # Access Tokens
    twitter_cred["ACCESS_KEY"] = apikeys[2]
    twitter_cred["ACCESS_SECRET"] = apikeys[3]

    # Set authentication object
    auth = tw.OAuthHandler(twitter_cred["CONSUMER_KEY"], twitter_cred["CONSUMER_SECRET"])
    auth.set_access_token(twitter_cred["ACCESS_KEY"], twitter_cred["ACCESS_SECRET"])

    # Create api object with authentication
    api = tw.API(auth, wait_on_rate_limit=True)

    # Set Elasticsearch Server
    es = Elasticsearch(server, port=9200)

For the purpose of this notebook, I the below cell is a modified version that allows me to just execute the functions.

In [8]:
# You can modify this cell

try:
    idx = 'default-'
    tweet_count = 100
    search = 'palantir OR PLTR'

    setConfig('127.0.0.1')
    acqData(str(search), int(tweet_count))
    
except:
    pass

: :Acquiring Data::
: :Transferring to Elasticsearch Search::


## Setup Kibana Index

Great!  We got data into Elasticsearch.  Now, let's create a Kibana index of it:

1. Browse to http://127.0.0.1:5601 > Click on "Explore on my own"

![1_initial.png](images/kibana/1_initial.png)

2. Click "Stack management"

![2_stack_management.png](images/kibana/2_stack_management.png)

3. Create Index Patterns

![3_index_patterns.png](images/kibana/3_index_patterns.png)

4. Create Index

![4_create_index.png](images/kibana/4_create_index.png)

5. Match "default"

![5_default.png](images/kibana/5_default.png)

6. Select Time Filter

![6_time_filter.png](images/kibana/6_time_filter.png)

7. Complete and finish up

![7_complete.png](images/kibana/7_complete.png)

8. Go to "Discover"

![8_discover.png](images/kibana/8_discover.png)

9. Select Time Period

![9_thisWeek.png](images/kibana/9_thisWeek.png)

The initial questions we are trying to answer are:

- What are the most retweets of the specific search term? (Tracking topic popularity)
- How many unique users are tweeting about a given search term? (Is it trending?)
- What other tweets are they posting about the search term? (Are they a troll / bot?)
- How many likes does the person have on the search term? (Trending / Sentiments)?

The following fields seems interesting when Scanning through the "Available Fields":

- full_text
- lang
- retweet_count
- user_created_at
- user_screen_name
- user_followers_count
- user_friends_count
- user_statuses_count
- user_created_at

I added lang field in so I can later figure how what is the sentiments per "language" ratio.  For now, I'm filtering out non-english.

![1_parsing_english](images/eda_kibana/1_parse_english.png)

Filtering english-only reduces my sample from 400 to 262.

Scanning through the full_text, I see there are a bunch of retweets that I want to aggregate into 1.  The stats show that the top 5 values are retweets and accounted for X percentage of the 262 tweets.  We see that the top retweet accounted for 19.1%.  Let's explore that.

![2_full_text_stats.png](images/eda_kibana/2_full_text_stats.png)

I got super excited as I thought it was that easy and found all the bots.  However, a closer look.  retweet_count is NOT the amount of times the user_screen_name retweeted.  It's how many times the full_text was retweeted.

![3_retweet_count.png](images/eda_kibana/3_retweet_count.png)



![2_kibana_grabbing_relevent_info.png](images/2_kibana_grabbing_relevent_info.png)

## What are the most retweets of the specific search term? (Tracking topic popularity)
Scanning through the defaults fields list we see:
    
![retweet](images/eda_kibana/retweet.png)

Now, let's create a visualization:

Click on "Visualize" > "Create Visualization" > select "Data Table" > select the index to look in "default*" > Click "Add Bucket" > "Split rows" > Select Aggregation > Terms:
- Field: full_text.keyword
- Change size from 5 to 25

Click "> Update"

![buckets.png](images/eda_kibana/4_buckets.png)

**The answer to the question is: "RT @_whitneywebb: NEW ARTICLE: A secretive AI platform powered by Palantir will soon be fed data from a new national "smart sewer" networ…" with a count of 50 retweets during the time of collection**

## How many unique users are tweeting about a given search term? (Is it trending?)
We're going to dive right into this with a pie graph and a counter:

Visualize > Create New Visualization > Metric > Default* > Metrics > Aggregation: Unique Count > Field: user_screen_name.keyword > Save the Visualization

Visualize > Create new Visualization > Pie > Buckets: Split Slices > Aggregation: Terms > Field: user_screen_name.keyword > size 10

Visualize > Create new Visualization > Line > Metrics (Y-axis): Count
Buckets (X-axis): Aggregation: Terms > Field: user_screen_name.keyword > size 10

Now that we have the visualizations, we can create a dashboard:

Create new dashboard:
- Add all the visualizations

Search for "A secretive AI platform powered by Palantir will soon be fed data from a new national"

![dashboard](images/eda_kibana/dashboard.png)

**The answer is 14 unique users retweeting the same thing during this time frame.  They tweeted the same amount of times (4), maybe they are bots?**

## What other tweets are they posting about the search term? (Are they a troll / bot?)
Since we have the visualizations and dashboard setup already, we can just modify the search criteria to:

```text
"purofierro666" OR "pennys_shevy" OR "nevrsurrendr" OR "eustache_luigy" OR "eric_davao" OR "Inanna432" OR "LisaMP925" OR "SultryRobin" OR "UsernameNAB" OR "WarmongerExpose"
```

![img_user_count](images/eda_kibana/users_count.png)

It's all the same retweet.  But there's a few things that we have to keep in mind:
1. Our collection script is only acquiring the search keywords and nothing else, if the users didn't tweet about the search keywords, we wouldn't pick it up.  
2. My search timeframe is pretty narrow, I collected only 100 tweets few minutes apart.
3. I have to play around with the #tweet count.  Users reportedly tweeted 4 times, but I can't seem to find it.

**I would still list this as unanswered because of the limitation of our collection script and perhaps how we are processing**

## How many likes does the person have on the search term? (Trending / Sentiments)?
Looking at the data, it looks like I have to adjust the script to take in additional metadata.  This question will remain unanswered.  This may not even be relevent.  I added this question as a technique to detecting potential bots and perform chain link analysis of the bot network.  But I don't know if this will work out the way I wanted. 

## Parting Thoughts

Out of the initial questions that I had, I was able to definitively answer 2 of 4 for my time period.  I discovered that I am missing some fields that I may need and that some questions may not be relevent.  Once I build out the visualizations and the dashboard, I can modify the search parameter and all visualizations will update.  Kibana is nice to visualize data fast and efficiently.

part2b will be going back to the script to see what we can modify and adjust to take in additional information and to enrich the records further.