# Gradient Twitter Analysis Examples

**Author: Newton Campbell** - newtonh20@ieee.org

In this notebook, we demonstrate how to load the offline Twitter data for the summer semester Gradient project into a Python notebook. We then show a couple of examples of cleaning, sentiment analysis and how to form a graph based on semantic similarity of Tweets. Enjoy!

## First, some references
Okay, for those of you who haven't gotten a chance to play with NLP before, we are providing you a list of resources that we hope will be helpful:

#### Libraries and open source resources
* spaCy ([website](https://spacy.io/), [blog](https://explosion.ai/blog)) \[Python; emerging open-source library with [fantastic usage examples](https://spacy.io/usage/spacy-101), [API documentation](https://spacy.io/api), and [demo applications](https://spacy.io/universe)\]
* Natural Language Toolkit (NLTK) ([website](https://www.nltk.org/), [book](https://www.nltk.org/book/)) \[Python; practical intro to programming for NLP, mainly used for teaching\]
* Stanza CoreNLP ([website](https://stanfordnlp.github.io/stanza/)) \[Python; high-quality analysis toolkit\]
* AllenNLP ([website](https://allennlp.org/)) \[Python; NLP research library built on PyTorch\]
* Tensorflow Tutorials ([website](https://www.tensorflow.org/hub/tutorials)) \[Python; Not the first thing folks think about with respect to this kind of NLP. But there are some interesting classification capabilities that may come in handy\]
* R Text Mining Libraries ([website](https://cran.r-project.org/web/views/NaturalLanguageProcessing.html)) \[You read that right; R has tons of open-source libraries that you can use for Text Analysis, [even some that are ports of Python libraries](https://towardsdatascience.com/r-packages-for-text-analysis-ad8d86684adb)\]

#### Oh, and can't forget graph libraries
Ah, let's not forget about graph libraries. You will likely need one for this project. I would give you a list. But the one I would give you is just a subset of the one that [you would find here.](https://wiki.python.org/moin/PythonGraphLibraries). NetworkX and igraph are two of my go-to libraries. They have community detection (clustering for graphs) algorithms and are fairly straightforward to use.


You will also want to play with some of the basic examples that <a href="https://towardsdatascience.com/getting-started-with-natural-language-processing-nlp-2c482420cc05">can be found here</a> and look into the "DIY projects and data sets" section <a href="https://towardsdatascience.com/how-to-get-started-in-nlp-6a62aa4eaeff">at the bottom of this page to really get your feet wet.</a> It can seem a little daunting with just how much there is to know at first. But just try to get a couple of working examples and remember to continue exploring through the semester in tandem to your work.

# Install and import libraries

Let's start by importing some libraries that will help with an analysis of Twitter data

In [None]:
# !pip install langdetect
# !pip install pycountry
# !pip install emoji
# !python -m spacy download en_core_web_md

# Import Libraries
from textblob import TextBlob
import sys
import tweepy
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import os
import nltk
# This is needed for parsing certain Tweets (You may need to download others for other datasets)
nltk.download('vader_lexicon')
import spacy
import en_core_web_md
nlp = en_core_web_md.load()
# nlp = spacy.load('en_core_web_sm')           # A more detailed model (with higher-dimension word vectors) - 13s to load, normally 
import networkx as nx                        # a really useful network analysis library
import pycountry
import emoji
import re
import string
from wordcloud import WordCloud, STOPWORDS
from PIL import Image
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from langdetect import detect
from nltk.stem import SnowballStemmer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer

# Load an Offline Dataset

Now, let's load one of the offline datasets for the project. As stated in the project launch document, you will not have to use the Twitter API for this project. We've downloaded offline versions for you.

**To run this code, you will have to change the value of the offline_tweets variable to the smallest file from Challenge 2:**

In [None]:
# Change this
offline_tweets = 'Infrastructure BillSearchTerm - Infrastructure BillSearchTerm.csv'

offline_tweets_df = pd.read_csv(offline_tweets)
num_tweets = len(offline_tweets_df.index)
display(offline_tweets_df)

## A Little Data Cleaning

To properly evaluate the Tweets' semantic meaning, you usually have to clean up the text a little. Its just easier for most third-party libraries. But you should also see if a third-party library that you're using has its own cleaning function.

**Here, we clean text by using lambda function and clean RT, link, punctuation characters and finally convert to lowercase.**

In [None]:
from bs4 import BeautifulSoup
from nltk.tokenize import WordPunctTokenizer
import ast

tok = WordPunctTokenizer()
pat1 = r'@[A-Za-z0-9]+'
pat2 = r'https?://[A-Za-z0-9./]+'
combined_pat = r'|'.join((pat1, pat2))
def tweet_cleaner(text):
    soup = BeautifulSoup(text, 'lxml')
    souped = soup.get_text()
    stripped = re.sub(combined_pat, '', souped)
    try:
        clean = stripped.decode("utf-8-sig").replace(u"\ufffd", "?")
    except:
        clean = stripped
    letters_only = re.sub("[^a-zA-Z]", " ", clean)
    lower_case = letters_only.lower()
    # During the letters_only process two lines above, it has created unnecessay white spaces,
    # I will tokenize and join together to remove unneccessary white spaces
    words = tok.tokenize(lower_case)
    return (" ".join(words)).strip()

offline_tweets_df['text'] = offline_tweets_df['text'].apply(lambda x: ast.literal_eval(x).decode('utf-8'))
offline_tweets_df['text'] = offline_tweets_df['text'].map(lambda x: tweet_cleaner(x))
offline_tweets_df["text"] = offline_tweets_df.text.str.lower()
offline_tweets_df['text']

# Sentiment Analysis

Sentiment Analysis can help us decipher the mood and emotions of general public and gather insightful information regarding the context. Sentiment Analysis is a process of analyzing data and classifying it based on the need of the research. These sentiments can be used for a better understanding of various events and impact caused by it. [L. Bing](https://www.cs.uic.edu/~liub/FBS/SentimentAnalysis-and-OpinionMining.pdf) highlights that in the research literature it is possible to see many different names, e.g. “sentiment analysis, opinion mining, opinion extraction, sentiment mining, subjectivity analysis, affect analysis, emotion analysis, review mining”, however all of them have similar purposes and belong to the subject of sentiment analysis or opinion mining. By analysing these sentiments, we may find what people like, what they want and what their major concerns are.

**Now we have a set of Tweets, loaded into a data frame, that we can mine for various purposes. Next, we will use Textblob to calculate positive, negative, neutral, polarity and compound parameters from the text.**

In [None]:
#Sentiment Analysis
def percentage(part,whole):
 return 100 * float(part)/float(whole)

positive = 0
negative = 0
neutral = 0
polarity = 0
tweet_list = []
neutral_list = []
negative_list = []
positive_list = []

for index, tweet in offline_tweets_df.iterrows():
 
 #print(tweet.text)
 tweet_list.append(tweet.text)
 analysis = TextBlob(tweet.text)
 score = SentimentIntensityAnalyzer().polarity_scores(tweet.text)
 neg = score['neg']
 neu = score['neu']
 pos = score['pos']
 comp = score['compound']
 polarity += analysis.sentiment.polarity
 
 if neg > pos:
  negative_list.append(tweet.text)
  negative += 1
 elif pos > neg:
  positive_list.append(tweet.text)
  positive += 1
 elif pos == neg:
  neutral_list.append(tweet.text)
  neutral += 1

positive = percentage(positive, num_tweets)
negative = percentage(negative, num_tweets)
neutral = percentage(neutral, num_tweets)
polarity = percentage(polarity, num_tweets)
positive = format(positive, '.1f')
negative = format(negative, '.1f')
neutral = format(neutral, '.1f')

#Number of Tweets (Total, Positive, Negative, Neutral)
tweet_list = pd.DataFrame(tweet_list)
neutral_list = pd.DataFrame(neutral_list)
negative_list = pd.DataFrame(negative_list)
positive_list = pd.DataFrame(positive_list)
print("Total Tweets: ",len(tweet_list))
print("positive number: ",len(positive_list))
print("negative number: ", len(negative_list))
print("neutral number: ",len(neutral_list))

**We can create a straightforward pie chart to profile the data in a more meaningful way:**

In [None]:
#Creating PieCart
labels = ['Positive ['+str(positive)+'%]' , 'Neutral ['+str(neutral)+'%]','Negative ['+str(negative)+'%]']
sizes = [positive, neutral, negative]
colors = ['yellowgreen', 'blue','red']
patches, texts = plt.pie(sizes,colors=colors, startangle=90)
plt.style.use('default')
plt.legend(labels)
plt.title("Sentiment Analysis Result for short list of Infrastructure Bill Tweets")
plt.axis('equal')
plt.show()

**Using the Piper SME Political Typology described in the Project Launch document, let's take a look at this chart for Democrats(ID=-2) and Republicans(ID=2)**

In [None]:
# First, let's establish the Typology dictionary
typology_dict = {'Fringe Left' : -3, 'Progressive' : -2, 'Democrat' : -1, 'Centrist' : 0, 'Libertarian' : 1, 'Republican' : 2, 'Trump-Republican' : 2.5, 'Fringe Right' : 3}

for group in [typology_dict['Democrat'],typology_dict['Republican']]:
  positive = 0
  negative = 0
  neutral = 0
  polarity = 0
  tweet_list = []
  neutral_list = []
  negative_list = []
  positive_list = []
  rows_in_group = len(offline_tweets_df[offline_tweets_df['tweet category'] == group].index)
  for index, tweet in offline_tweets_df[offline_tweets_df['tweet category'] == group].iterrows():
    #print(tweet.text)
    tweet_list.append(tweet.text)
    analysis = TextBlob(tweet.text)
    score = SentimentIntensityAnalyzer().polarity_scores(tweet.text)
    neg = score['neg']
    neu = score['neu']
    pos = score['pos']
    comp = score['compound']
    polarity += analysis.sentiment.polarity
    
    if neg > pos:
      negative_list.append(tweet.text)
      negative += 1
    elif pos > neg:
      positive_list.append(tweet.text)
      positive += 1
    elif pos == neg:
      neutral_list.append(tweet.text)
      neutral += 1

  positive = percentage(positive, rows_in_group)
  negative = percentage(negative, rows_in_group)
  neutral = percentage(neutral, rows_in_group)
  polarity = percentage(polarity, rows_in_group)
  positive = format(positive, '.1f')
  negative = format(negative, '.1f')
  neutral = format(neutral, '.1f')

  #Number of Tweets (Total, Positive, Negative, Neutral)
  tweet_list = pd.DataFrame(tweet_list)
  neutral_list = pd.DataFrame(neutral_list)
  negative_list = pd.DataFrame(negative_list)
  positive_list = pd.DataFrame(positive_list)

  #Creating PieCart
  labels = ['Positive ['+str(positive)+'%]' , 'Neutral ['+str(neutral)+'%]','Negative ['+str(negative)+'%]']
  sizes = [positive, neutral, negative]
  colors = ['yellowgreen', 'blue','red']
  patches, texts = plt.pie(sizes,colors=colors, startangle=90)
  plt.style.use('default')
  plt.legend(labels)
  plt.title("Sentiment Analysis Results for short list of Infrastructure Bill Tweets: " + list(typology_dict.keys())[list(typology_dict.values()).index(group)])
  plt.axis('equal')
  plt.show()

In this dataset, there seems to be a higher percentage of Negative Tweets about the Infrastructure Bill for the Democrats than for the Republicans. You will see why that is when you scan through the Tweets yourself.

# Some Graph Theory

Here, we will use spaCY to parse the Tweets and get an understanding of how similar they are. Much of this was derived [from this Kaggle example.](https://www.kaggle.com/caractacus/thematic-text-analysis-using-spacy-networkx)

In [None]:
tokens = []
lemma = []
pos = []
parsed_doc = [] 
col_to_parse = 'text'

for doc in nlp.pipe(offline_tweets_df[col_to_parse].astype('unicode').values, batch_size=50,
                        n_process=3):
    if doc.is_parsed:
        parsed_doc.append(doc)
        tokens.append([n.text for n in doc])
        lemma.append([n.lemma_ for n in doc])
        pos.append([n.pos_ for n in doc])
    else:
        # We want to make sure that the lists of parsed results have the
        # same number of entries of the original Dataframe, so add some blanks in case the parse fails
        tokens.append(None)
        lemma.append(None)
        pos.append(None)


offline_tweets_df['parsed_doc'] = parsed_doc
offline_tweets_df['comment_tokens'] = tokens
offline_tweets_df['comment_lemma'] = lemma
offline_tweets_df['pos_pos'] = pos
offline_tweets_df.head()

## Remove Stopwords

We can reduce increase the signal:noise ratio in these Tweets by removing some of the more common words (or stopwords). By removing these from the tweets, we would prevent them from influencing the analysis of whether two tweets are similar. 

**For now, let's look at what words are included in spaCy's stopword list.**

In [None]:
stop_words = spacy.lang.en.stop_words.STOP_WORDS
print('Number of stopwords: %d' % len(stop_words))
print(list(stop_words))

**Now, let's take a look at spaCy's similarity function:**

In [None]:
print(offline_tweets_df['parsed_doc'][0].similarity(offline_tweets_df['parsed_doc'][1]))
print(offline_tweets_df['parsed_doc'][0].similarity(offline_tweets_df['parsed_doc'][10]))
print(offline_tweets_df['parsed_doc'][1].similarity(offline_tweets_df['parsed_doc'][10]))

Now, we can form a graph where each node in the graph represents an individual Tweet. And each edge represents similarity. We start out by making the graph fully connected (all nodes connect to other nodes). And then we remove edges that have similarity below a certain threshold. 

*Sidenote: I quickly grabbed this example from towardsdatascience.com and ran with it to show you the basics. But there is a much, much better way to do this, performance-wise. Can you figure out what it is?*

In [None]:
# won't scale linearly!                              
raw_G = nx.Graph() # undirected
n = 0

for i in offline_tweets_df['parsed_doc']:        # sure, it's inefficient, but it will do
    for j in offline_tweets_df['parsed_doc']:
        if i != j:
            if not (raw_G.has_edge(j, i)):
                sim = i.similarity(j)
                raw_G.add_edge(i, j, weight = sim)
                n = n + 1

print(raw_G.number_of_nodes(), "nodes, and", raw_G.number_of_edges(), "edges created.")

**Now, let's remove edges with similarity below 0.85**

In [None]:
edges_to_kill = []
min_wt = 0.85      # this is our cutoff value for a minimum edge-weight 

for n, nbrs in raw_G.adj.items():
    #print("\nProcessing origin-node:", n, "... ")
    for nbr, eattr in nbrs.items():
        # remove edges below a certain weight
        data = eattr['weight']
        if data < min_wt: 
            # print('(%.3f)' % (data))  
            # print('(%d, %d, %.3f)' % (n, nbr, data))  
            #print("\nNode: ", n, "\n <-", data, "-> ", "\nNeighbour: ", nbr)
            edges_to_kill.append((n, nbr)) 
            
print("\n", len(edges_to_kill) / 2, "edges to kill (of", raw_G.number_of_edges(), "), before de-duplicating")
for u, v in edges_to_kill:
    if raw_G.has_edge(u, v):   # catches (e.g.) those edges where we've removed them using reverse ... (v, u)
        raw_G.remove_edge(u, v)

**Now let's visualize this Graph to see how connected it is:**

In [None]:
strong_G = raw_G
nx.draw(strong_G, node_size=20, edge_color='gray')

Visualising the whole graph, but only those links of weights above a certain cutoff, allows us to get a feel for a good cutoff level to use when visualising the structure. Having filtered out these lower-weighted links, we can clean up the graph by removing the isolates. This will enable the layout engine to show us more of the structure of the components.

In [None]:
strong_G.remove_nodes_from(list(nx.isolates(strong_G)))

We can also tweak the layout algorithm. By, for example, changing the ideal distance at which the repulsive and attractive forces are in equilibrium. 

In [None]:
from math import sqrt
count = strong_G.number_of_nodes()
equilibrium = 10 / sqrt(count)    # default for this is 1/sqrt(n), but this will 'blow out' the layout for better visibility
pos = nx.fruchterman_reingold_layout(strong_G, k=equilibrium, iterations=300)
nx.draw(strong_G, pos=pos, node_size=10, edge_color='gray')

In [None]:
nx.draw(strong_G, node_size = 10)