### Pre-talk notes for Speaker!
During talk:
* Minimise file browser
* move to this folder cd .\Documents\GitHub\working-with-twitter-data\HealthDemo
* Zoom in
* Clear cells
* Share public link -> https://github.com/UKDataServiceOpen/working-with-twitter-data/blob/main/HealthDemo/HealthTidyDemo.ipynb
* Share [binder link](https://mybinder.org/v2/gh/UKDataServiceOpen/working-with-twitter-data/HEAD?labpath=%2FHealthDemo%2FHealthTidyDemo.ipynb)

TODO - Time this talk

# Twarc Tidying and Analysis
This notebook covers my initial exploration of 7000 tweets from October 2021 to January 2022, all located in the UK. These tweets were scraped on the 16th Feb 2022 and all contain the keywords:
- cough
- coughing
- sneeze
- sneezing
- fatigue
- headache

Otherwise regarded as common Covid-19 symptoms.
This data was collected using the [HealthTwarcDemo notebook in this repo](https://github.com/UKDataServiceOpen/working-with-twitter-data/HealthDemo/HealthTwarcDemo.ipync)

Throughout this notebook we cover:
- Initial exploration of a dataset from Twitter
- Visualising the increase in term over time
- Investigating connected symptoms with some entry-level Natural language processing
- Building a wordcloud from these words

So let's get started.

In [None]:
# Import our pacakges
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Change the default style to be bigger, and clearer colored.
plt.style.use('seaborn-whitegrid')
plt.rcParams.update({
    'font.size': 28,
    'figure.figsize':(28,12)
})

# set seaborn style
sns.set(rc={'figure.figsize':(12,8)})
sns.set(font_scale=1.5)

# Create a cool UK Data Service color palette for our plots
colors = ['#E03A6C', '#F5AD42', '#ECE64B', '#449858', '#43A6C6', '#6C2B76']
palette = sns.set_palette(sns.color_palette(colors), n_colors=100)

In case you couldn't successfully scrape any new data in the [HealthTwarcDemo notebook in this repo](https://github.com/UKDataServiceOpen/working-with-twitter-data/HealthDemo/HealthTwarcDemo.ipync) I have included a dataset, [3monthCoughUK.csv](https://github.com/UKDataServiceOpen/working-with-twitter-data/HealthDemo/3monthCoughUK.csv), which you can use.

In [None]:
# Read our data into a dataframe using pandas, convert our dates to datetime objects so our plots can use them!
data = pd.read_csv('3monthCoughUK.csv', parse_dates=['created_at'])

# The head function prints out the first 5 rows.
data.head()

## How to check a Tweet
We can grab the ID from the first column here, and replace the ID in any existing tweet.

For example here is the URL for one of my tweets about this webinar -> https://twitter.com/JosephAllen1234/status/1493911499047419907
That messy number at the end is the Tweet ID. 
https://twitter.com/JosephAllen1234/status/Replace_me_with_a_tweet_ID

So the first row above, has tweet ID - 1476704208074289156

So even pasting the below, will redirect to the correct user.
https://twitter.com/JosephAllen1234/status/1476704208074289156

We should at this redirect see some mention of covid or one of our symptoms. An unfortunate side effect of twitters search is a user called "coughsneeze" may have all their tweets returned in our search too!

In [None]:
data.loc[data['author.username'].str.contains("cough", case=False)]

## Checking some Tweet text
So we've got our data read in successfully, let's print out some of the tweet text to make sure they have something to do with veganism.

In [None]:
# print out the first 5 tweets text for visual inspection
for index in [1,2,3,4,5]:
    print(data['text'][index])
    print('\n')


In [None]:
# I always recommend running info() for basic type information.
# Here we are looking for any weird types, or largely missing values
data.info()

In [None]:
# and decribe() for statistical info.
# data.describe()
# Or to supress scientific notation
data[['author.public_metrics.followers_count','public_metrics.like_count','public_metrics.retweet_count']].describe().apply(lambda s: s.apply('{0:.0f}'.format))

At this point I am thinking we have too many columns to analyse really. It's worth at this point asking if there is anything we could remove now. Though if we are exploring this may be premature.

There is far too much to analyze here, considering what I am here to look for really the text is the main thing I care about. You may wish to retain the author ID if you want to track somebody who reports symptoms over time for example.

I am going to keep the following:
* id - The Tweet ID
* created_at - The time the tweet was created
* text - the text that makes up a tweet
* author.id - the author ID
* author.created_at - when the users account was created
* author.username - the Twitter users username
* author.location - a self-defined location
* author.public_metrics.followers_count - Number of followers a user has
* geo.full_name - the full name describing a tweets geolocation
* public_metrics.like_count - number of likes on this tweet
* public_metrics.retweet_count - number of retweets on this tweer.

In [None]:
data = data[['id','created_at', 'text','author.id','author.created_at', 'author.username','author.location','author.public_metrics.followers_count','geo.full_name','public_metrics.like_count','public_metrics.retweet_count']]
data.head()

# Have covid-19 symptoms grown?
Let's start by checking if there is evidence in these tweets that COVID has grown in the UK over this period.

Those of us still checking the news will know that in late November we had the Omicron strain develop. I have included a dataset "covid_data" which contains new case numbers in the UK from https://ourworldindata.org/explorers/coronavirus-data-explorer

In [None]:
# read in official covid data, convert dates
covid = pd.read_csv('covid_data.csv', parse_dates=['date'])

# Set the date as our index so our plotting libraries format them
covid = covid.set_index('date')

In [None]:
sns.lineplot(data=covid).set(title='Number of covid cases over time')

A very upsetting and familiar winter graph for Covid-19. But is it replicated in our very small Twitter dataset?

In [None]:
sns.lineplot(data=data['created_at'].groupby(data.created_at.dt.date).count().rolling(10).mean()).set(title='Number of tweets containing covid symptoms in UK over time')

Sort of, there seemed to be larger reporting of covid symptoms back in October that have not that wasn't represented with more cases.
Do the Tweets reflect covid numbers, covid paranoia or both?

There is no doubt there is a correlation between these datasets, but at this point we can't really tell why this happens. Social media is infamous for "look at me" behavior which adds a huge bias here.

# Symtpoms over time

We can go a step further though, we could for example try to break down counts per day for tweets which contain our named symptoms. This could help us seperate out omicron hype from the delta hype. Omicron sufferers generally seem to have more of a sneeze and headache illness than the coughs we previously assocaited with covid-19. We can try to visualize this too.

In [None]:
# Build columns to flag whether text contains our keywords
data['has_cough'] = data.text.str.contains('cough') | data.text.str.contains('coughing')
data['has_sneeze'] = data.text.str.contains('sneeze') | data.text.str.contains('sneezing')
data['has_fatigue'] = data.text.str.contains('fatigue')
data['has_headache'] = data.text.str.contains('headache')

# fatigue, sneeze, sneezing, headache
data[['has_cough','has_sneeze','has_fatigue','has_headache']].describe()

In [None]:
# Log out some tweets that should contain coughs
for index in [1,2,3]:
    print(data[data.has_cough == True].reset_index()['text'][index])
    print('\n')

In [None]:
# Log out some tweets that should contain sneezes
for index in [1,2,3]:
    print(data[data.has_sneeze == True].reset_index()['text'][index])
    print('\n')

In [None]:
# Log out some tweets that should contain fatigue
for index in [1,2,3]:
    print(data[data.has_fatigue == True].reset_index()['text'][index])
    print('\n')

In [None]:
# Log out some tweets that should contain headache
for index in [1,2,3]:
    print(data[data.has_headache == True].reset_index()['text'][index])
    print('\n')

In [None]:
# create a 2x2 plot
fig, axes = plt.subplots(2, 2, figsize=(25, 18))
max = data.groupby(data.created_at.dt.date).has_cough.sum().max()

# set datasets for each plot
axes[0,0].set_ylim(0, max)
axes[0,0].set_title('Mentions of cough or coughing')
axes[0,1].set_ylim(0, max)
axes[0,1].set_title('Mentions of sneeze or sneezing')
axes[1,0].set_ylim(0, max)
axes[1,0].set_title('Mentions of fatigue')
axes[1,1].set_ylim(0, max)
axes[1,1].set_title('Mentions of headache')

# set big title
fig.suptitle('Covid symptoms mentioned on Twitter over time')

# create lineplots
sns.lineplot(ax=axes[0, 0], data=data.groupby(data.created_at.dt.date).has_cough.sum().rolling(10).mean())
sns.lineplot(ax=axes[0, 1],  data=data.groupby(data.created_at.dt.date).has_sneeze.sum().rolling(10).mean())
sns.lineplot(ax=axes[1, 0], data=data.groupby(data.created_at.dt.date).has_fatigue.sum().rolling(10).mean())
sns.lineplot(ax=axes[1, 1],  data=data.groupby(data.created_at.dt.date).has_headache.sum().rolling(10).mean())

So we can see that there seems to be a baseline of fatigue and sneezing that aren't really growing with covid cases.

On the other hand coughing and headaches appear at a higher frequency, and also seem to surge just before the number of covid cases do.

# Have we missed a symptom?

My next hunch comes from a tweet like these:
"Fever, shaking, fatigue, swelling under armpit, heart pounding. Was quite scary at one point mate. Starting to feel bit better now"
"slept,slept,slept headache gone, aches and pains gone, cough gone"

Users seem to be reporting symptoms I wasn't looking for like:
- sleeping
- heart pounding
- shaking
- swelling

To find these, it might be worth looking for words which appear with our terms. To begin lets simply take a word frequency count.

In [None]:
# import our NLP library
import nltk
nltk.download('punkt')

In [None]:
# split text by whitespace
# drop any empty rows before tokenize
# data['text'] = data['text'].dropna()
wordlist = data['text'].str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(wordlist)
word_dist = nltk.FreqDist(words)
wordCount = pd.DataFrame(word_dist.most_common(),
                    columns=['Word', 'Frequency'])

In [None]:
# What are our top 10 words?
wordCount.head(10)

Our most common words largely contain punctuation and what are called in NLP "stop words", these are words that traditionally add no context to a sentence like I, you, a, the, but etc.

In [None]:
# split text by whitespace
# drop any empty rows before tokenize
# data['text'] = data['text'].dropna()
wordlist = data['text'].str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(wordlist)

# remove non alphanumeric characters
new_words= [word for word in words if word.isalnum()]

word_dist = nltk.FreqDist(new_words)
wordCount = pd.DataFrame(word_dist.most_common(),
                    columns=['Word', 'Frequency'])

In [None]:
# What are our top ten words, without non alphanumeric cahracters
wordCount.head(10)

Punctuation is gone, now let's remove "stopwords". NLTK actually has a list of these

In [None]:
# stopwords are the words that add 'nothing' to a sentence, let's remove them. NLTK can help here.
nltk.download('stopwords')

In [None]:
# import and print stop words for demos sake
from nltk.corpus import stopwords
print(stopwords.words('english'))

In [None]:
# split text by whitespace
# drop any empty rows before tokenize
# data['text'] = data['text'].dropna()
wordlist = data['text'].str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(wordlist)

# remove non alphanumeric characters
new_words= [word for word in words if word.isalnum()]

# remove stop words
from stop_words import get_stop_words
stop_words = set(stopwords.words('english'))
filtered_new_words = [w for w in new_words if not w.lower() in stop_words]

word_dist = nltk.FreqDist(filtered_new_words)
wordCount = pd.DataFrame(word_dist.most_common(),
                    columns=['Word', 'Frequency'])



In [None]:
# Log words without stop words
wordCount.head(10)

We don't have all of them here, but we do have a lot of them. I in particular has been skipped because we haven't lowercased all our words when we tokenize them.

In [None]:
# split text by whitespace
# we add in a lower() function
wordlist = data['text'].str.lower().str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(wordlist)

# remove non alphanumeric characters
new_words= [word for word in words if word.isalnum()]

# remove stop words
from stop_words import get_stop_words
stop_words = set(stopwords.words('english'))
filtered_new_words = [w for w in new_words if not w.lower() in stop_words]

word_dist = nltk.FreqDist(filtered_new_words)
wordCount = pd.DataFrame(word_dist.most_common(),
                    columns=['Word', 'Frequency'])



In [None]:
# lowercase all words, to merge Cough and cough for example.
wordCount.head(10)

finally we are likely underreporting any word here that can be conjugated.
Think words like:
- swim
- swam
- swimming

all refer to the "stem" swim. We will experience this with:
- cough
- coughed
- coughing

So it's worth merging these.

In [None]:
# steam all words where possible
from nltk.stem import PorterStemmer
ps = PorterStemmer()

In [None]:
# split text by whitespace
# we add in a lower() function
wordlist = data['text'].str.lower().str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(wordlist)

# remove non alphanumeric characters
new_words= [word for word in words if word.isalnum()]

# remove stop words
from stop_words import get_stop_words
stop_words = set(stopwords.words('english'))
filtered_new_words = [w for w in new_words if not w.lower() in stop_words]
stems = [ps.stem(word) for word in filtered_new_words if word.isalnum()]

word_dist = nltk.FreqDist(stems)
wordCount = pd.DataFrame(word_dist.most_common(),
                    columns=['Word', 'Frequency'])



In [None]:
# demo stemming of fatigued
ps.stem('fatigued')

In [None]:
# show top 50 occuring stems
wordCount.head(50)

We can see some web language leaking in now:
- http is the protocol used to request assets on the web
- amp is a special character code.

In [None]:
# remove web terminology
wordCount = wordCount[wordCount.Word != 'http']
wordCount = wordCount[wordCount.Word != 'amp']

In [None]:
sns.barplot(data=wordCount.head(20), y= wordCount.head(30).Word, x = wordCount.head(30).Frequency)

In this alone we can see some common terms:
- cold
- sore
- throat
- back

# Wordclouds
Looking at this list is great, but a wordcloud is always a fun, and easy addition once you are at this point!
We can use our wordCount object to build the format it wants.


In [None]:
# prepare format wordcloud pacakge expects
bag = wordCount[['Word','Frequency']]
bag.col = ['words','counts']

d = {}
for a, x in bag.values:
    d[a] = x

In [None]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud

wordcloud = WordCloud()
wordcloud.generate_from_frequencies(frequencies=d)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

In [None]:
# lower max_font_size, change the maximum number of word and lighten the background:
wordcloud = WordCloud(width=1000, height= 700, max_font_size=200, max_words=100, background_color="white").generate_from_frequencies(d)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

In [None]:
covid_mask = np.array(Image.open("mask.png"))

# lower max_font_size, change the maximum number of word and lighten the background:
wordcloud = WordCloud(width=1000, height= 700, max_font_size=500, max_words=100, background_color="white", mask=covid_mask).generate_from_frequencies(d)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

### Most liked content
We have access to likes and retweets, let's check out what the most liked content is.

In [None]:
# And how about the most liked tweet?
data.sort_values(by='public_metrics.like_count', ascending=False).head()

In [None]:
mostLikedIndex = 1496
print(data['id'][mostLikedIndex])
print(data['text'][mostLikedIndex])

In [None]:
# And how about the most retweeted?
data.sort_values(by='public_metrics.retweet_count', ascending=False).head()

In [None]:
mostRetweetedIndex = 1496
print(data['id'][mostRetweetedIndex])
print(data['text'][mostRetweetedIndex])