### Pre-talk notes for Speaker!
During talk:
* Minimise file browser
* move to this folder cd .\Documents\GitHub\working-with-twitter-data\
* Zoom in
* Clear cells
* Share public link -> * Share public link -> https://github.com/UKDataServiceOpen/working-with-twitter-data/blob/main/TidyingDemo.ipynb

Talk time - 20 minutes

# Twarc Tidying and Analysis
This notebook will cover the exploration, tidying up and some basic analysis of the data collected by the [TwarcDemo in this repo](https://github.com/UKDataServiceOpen/working-with-twitter-data/blob/main/TwarcDemo.ipynb)

We will be using the 1000 Vegan Tweets from "Veganuary" 2019. If I didn't show a Twarc demo before you can find these in `data/demoData.csv`, and if you didn't follow along there is a provided version of this data.

So let's import some packages and read it in. 

In [None]:
import pandas as pd # Our data manipulation library
import numpy as np # Support for matrices, and other table-like shapes

In [None]:
# Tweak default plotting styles
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
plt.rcParams.update({'font.size': 22,
                    'figure.figsize':(24,8)})

In [None]:
# Read our data into a dataframe using pandas
data = pd.read_csv('data/demoData.csv')

# The head function prints out the first 5 rows.
data.head()

We can test our tweets are real by taking the ID from the first column and replacing the ID in any tweet we can find, which I will demo!

So we've got our data read in successfully, let's print out some of the tweet text to make sure they have something to do with veganism.

In [None]:
for index in [1,2,3,4,5]:
    print(data['text'][index])
    print('\n')


In [None]:
# I always recommend running info() for basic type information.
data.info()

In [None]:
# and decribe() for statistical info.
# data.describe()
# Or to supress scientific notation
data[['author.public_metrics.followers_count','public_metrics.like_count','public_metrics.retweet_count']].describe().apply(lambda s: s.apply('{0:.0f}'.format))

At this point I am thinking we have too many columns to analyse really. It's worth at this point asking if there is anything we could remove now. Though if we are exploring this may be premature.

So that looks good to me, We might notice there are some retweets in here. I personally prefer removing retweets and replies where possible as they complicate our analysis with duplicate and tweets that don't make sense in context.

## Removing retweets and replies.
We need a way of detecting these. You might notice retweets start with two "RT" retweet characters. There is also a column that might help. let's check out the type column.

We can call the value_counts function on any column.

In [None]:
data['type'].value_counts()

In [None]:
# We can also quickly call a plot function on any of these generated dataframes or value_counts.
data['type'].value_counts().plot(kind='bar')

# Though we are missing lots of Tweets here

In [None]:
# Regular tweets have a type "null"
# We can use boolean indexing to select only the rows that match this null condition.
len(data[data['type'].isnull()])

In [None]:
# so lets select all the tweets that are typed as null.
data = data[data['type'].isnull()]
len(data)

In [None]:
# the length is correct but lets check things look okay
data.head()

In [None]:
# We need to reindex as well
data = data.reset_index()
# and delete old index
del data['index']

In [None]:
# lets check the first five tweets again
for index in [1,2,3,4,5]:
    print(data['text'][index])
    print('\n')

Annoyingly removing tweets marked with type retweet doesn't seem to get them all. This is one of those many things with the Twitter API I can't seem to find an answer to.

Luckily they are prepended with RT, which we can scrape an remove.

In [None]:
# remove tweets with RT string
data = data[~data['text'].str.contains('RT')]

In [None]:
# again, reset index and let's test again.
data = data.reset_index()

del data['index']

# lets check the first five tweets again
for index in [1,2,3,4,5]:
    print(data['text'][index])
    print('\n')

At this point it's probably time to narrow down what we are doing, there is too much interesting information in this dataset.

I am going to keep the following:
* id - The Tweet ID
* created_at - The time the tweet was created
* text - the text that makes up a tweet
* author.id - the author ID
* author.created_at - when the users account was created
* author.username - the Twitter users username
* author.location - a self-defined location
* author.public_metrics.followers_count - Number of followers a user has
* geo.full_name - the full name describing a tweets geolocation
* public_metrics.like_count - number of likes on this tweet
* public_metrics.retweet_count - number of retweets on this tweer.

In [None]:
data = data[['id','created_at', 'text','author.id','author.created_at', 'author.username','author.location','author.public_metrics.followers_count','geo.full_name','public_metrics.like_count','public_metrics.retweet_count']]
data.head()

## How is Veganism percieved on Twitter?
In order to answer this question we need to introduce sentiment analysis. This is quite easy to do in Python as complicated as it sounds. As with many complicated things, somebody has written a package to make this easy for us.


In [None]:
# Import NLTK, the Natural Language package
import nltk
# Download the popular vader lexicon of words and sentiments.
nltk.download([
    "vader_lexicon",
])

# import the sentiment analyser.
from nltk.sentiment import SentimentIntensityAnalyzer

# Create a new sentiment analyser.
sia = SentimentIntensityAnalyzer()

# And write a function we can pass to our pandas function
def get_sentiment(string):
    return sia.polarity_scores(string)['compound']

With this package our sentiment scores are returned on a scale of -1 for fully negative, to +1 for fully positive.
So our below sentence "I love cats" has a sentiment of 0.6, we high positive sentiment. Whereas "I hate cats is lower in sentiment.

In [None]:
# Test our sentiment package
get_sentiment('I love cats')

In [None]:
# Test our sentiment package
get_sentiment('I hate cats')

In [None]:
# Test our sentiment package
get_sentiment('I am cats')

In [None]:
# sentiment by word demo function
def sentiment_by_word(string):
    for word in string.split(' '):
        print(word + ' -- ' + str(get_sentiment(word)))

sentiment_by_word('I love cats')

In [None]:
# So let's apply this to our entire dataframe
data['sentiment'] = data['text'].apply(get_sentiment)

# print a few rows
for index in [0,1,2,3,4]:
    print('sentiment ' + str(data['sentiment'][index]))
    print(data['text'][index])
    print('\n')

### What can we do with our sentiment scores
To start with, let's find our highest sentiment tweet.


In [None]:
data.sort_values(by='sentiment', ascending=False).head()

In [None]:
topSentimentIndex = 156
print(data['id'][topSentimentIndex])
print(data['text'][topSentimentIndex])
print(data['sentiment'][topSentimentIndex])

In [None]:
sentiment_by_word(data['text'][topSentimentIndex])

In [None]:
# And how about the lowest sentiment?
data.sort_values(by='sentiment', ascending=True).head()

In [None]:
bottomSentimentIndex = 104
print(data['id'][bottomSentimentIndex])
print(data['text'][bottomSentimentIndex])
print(data['sentiment'][bottomSentimentIndex])

In [None]:
sentiment_by_word(data['text'][bottomSentimentIndex])

In [None]:
# Let's also grab some neutral tweets.
data[data['sentiment'] == 0]

In [None]:
neutralIndex = 3
print(data['id'][neutralIndex])
print(data['text'][neutralIndex])
print(data['sentiment'][neutralIndex])

In [None]:
sentiment_by_word(data['text'][neutralIndex])

In [None]:
# Often the users at each end of this spectrum are quite different, lets see what our tweeters look like in general
data.sentiment.hist()

At a glance there are three different kinds of tweets here:
1. Negative tweets, these are likely complaints from vegans or complaints about vegans.
2. Neutral tweets, Most of the users appear to be neutral, this is usually a symptom that our sentiment analyser wasn't trained on the language it's predicting on, so is seeing words it's never classified before and tags them as neutral.
3. Positive tweets, these appear in abundance in comparison to negativity. Could this be a sign of positivity, marketing, bias?

On a hunch this grouping is quite naive. I could imagine a big difference in group 1:
* People complaining about vegans in a hateful way
* Vegans complaining about non-vegans in a hateful way
* Vegans complaining about vegan difficulties.

group 3 likely contains:
* Inflated self-promotion from vegan business owners
* Inflated promotion and feedback from large brands launching vegan products such as Greggs and the Vegan sausage roll.

These are all much larger project ideas, we don't even have a level of condience that any of these users are vegan.

### Most liked content
We have access to likes and retweets, let's check out what the most liked content is.

In [None]:
# A fairly familiar graph of likes being geometrically hard to gain, with outliers from "viral" tweets
data['public_metrics.like_count'].hist(bins=100)

In [None]:
# And we see similar with retweets, even harder to come by as an echo of messaging rather than approval.
data['public_metrics.retweet_count'].hist(bins=20)

In [None]:
# And how about the most liked tweet?
data.sort_values(by='public_metrics.like_count', ascending=False).head()

In [None]:
mostLikedIndex = 1012
print(data['id'][mostLikedIndex])
print(data['text'][mostLikedIndex])
print(data['sentiment'][mostLikedIndex])

In [None]:
sentiment_by_word(data['text'][mostLikedIndex])

In [None]:
# And how about the most retweeted?
data.sort_values(by='public_metrics.retweet_count', ascending=False).head()

In [None]:
mostRetweetedIndex = 1012
print(data['id'][mostRetweetedIndex])
print(data['text'][mostRetweetedIndex])
print(data['sentiment'][mostRetweetedIndex])

The most liked and most retweeted came from Boy George, there is a correlation there that we will look at later.

## Does the perception of Veganism change over time?
This is a toy example of what we saw in the presentation. How do we percieve a rolling sentiment over time.

Next let's sort this dataframe by date. Looking now we seem to have only a handful of minutes between our Tweets.

In [None]:
# sort by date
# convert created at into a datetime object
data['created_at'] =pd.to_datetime(data.created_at)
# Sort our dataframe b dates
data = data.sort_values(by='created_at',ascending=True)
# reset index
data = data.reset_index()
del data['index']

# print the head
data.head()

In [None]:
# To start with let's plot sentiment over index. Although this is linear, rather than time based it can be useful.
data['sentiment'].plot()

In [None]:
# As each tweet is relatively unconnected we can get quite erratic plots, some smoothin can help.
data['sentiment'].rolling(30).mean().plot()

In [None]:
# positive tweets
len(data[data['sentiment'] > 0])

In [None]:
# neutral tweets
len(data[data['sentiment'] == 0])

In [None]:
# negative tweets
len(data[data['sentiment'] < 0])

Most of our tweets are vegan, some or neutral, and few are negative.

Not much to see here, in the full veganuary dataset we see a kickoff of positively, a trend downward and then a celebration at the end of the month.

In [None]:
# Our neutral tweets, generally introduce some high-sentiment content that our model doesn't understand yet, so maybe we should remove them.
data[data.sentiment != 0]['sentiment'].rolling(30).mean().plot()

In [None]:
# plot with realistic time axis
data.plot(kind='scatter',x='created_at', y='sentiment')

As we are only looking at such a small sample it's hard to draw any conclusions from this data. Depending on what we follow this can be a very clear line that somewhat represents the sentiment of a topic over time. Diving into our neutral tweets to better classify could be a good next step.

### Does sentiment correlate with success?
Now we have quantified sentiment, do high sentiment messages get engagement?
Let's plot our:
* follower counts
* likes count
* retweet count
* sentiment 

And see what we find.

In [None]:
corr = data[['author.public_metrics.followers_count','public_metrics.like_count','public_metrics.retweet_count','sentiment']].corr()
corr.style.background_gradient(cmap ='coolwarm')

My hunch is that as so many results have neutral sentiment this is probably shifting our correlations quite heavily, let's remove them for now.

In [None]:
corr = data[['author.public_metrics.followers_count','public_metrics.like_count','public_metrics.retweet_count','sentiment']][data['sentiment'] != 0].corr()
corr.style.background_gradient(cmap ='coolwarm')

It seems that the more followers a user has, the more likely their content is to be liked and retweeted. Follower count does not correlate with the sentiment of tweets though.

Likes and retweets have a strong correlation, content that is likely to be retweeted is also likely to be liked.

In this case sentiment doesn't seem to correlate with any of these features though.

By this point, we have noticed that our neutral sentiment tweets are a bit of a missed oppurtunity. We understand sentiment generally, but do not understand the terms within our topic area. The word "carnivore" might be usual, but in vegan circles it can be used in disgust or even as an insult, our sentiment analyzer doesn't understand this.

A good next step would be trying to figure out what these words are, but I will leave this here for now.

## Future Work
* Getting a full word count from tweets
* Dealing with stop words, punctuation and hashtags
* Removing duplicate words through case sensitivity, fuzzy matching and stemming
* Making word clouds with [wordclouds.co.uk](https://www.wordclouds.co.uk/)
* Classifying types of tweet into marketing, self-promotion and true oppinion.

## Useful Links
* Word cloud builder - [wordclouds.co.uk](https://www.wordclouds.co.uk/)
* An intro to basic NLP and word clouds with WhatsApp data - [What can I do with WhatsApp?](https://towardsdatascience.com/what-can-i-do-with-whatsapp-661fc3cdd5c5)
* Use machine learning to understand and leverage text. - [Solving 90% of NLP](https://www.kdnuggets.com/2019/01/solve-90-nlp-problems-step-by-step-guide.html)