# Exploratory Spatial Data Analysis of Disaster-Tweets

With this notebook, you can analyse tweets that are related to two natural disasters.
How are they related? They were posted in the impacted area of the disaster and posted after the disaster occurred.
How can we analyse them? First, get an overview of the dataset, such as if the dataset is complete or if there are outliers. Then you can ask for specific questions and try to extract information from the data that helps you to answer them. 

Datasets are located in '../tweets'.
Two Datasets:
* Napa Earthquake tweets -> https://en.wikipedia.org/wiki/2014_South_Napa_earthquake
* Hurricane Harvey tweets -> https://en.wikipedia.org/wiki/Hurricane_Harvey

# TODO structure notebook. include statistics about the dataset in the beginning and add more field to the notebook
# TODO Add more descriptions

## Load dataset

In [None]:
%matplotlib inline

In [None]:
import pandas as pd

df = pd.read_csv('../tweets/napa_tweets.csv', sep=',', error_bad_lines=False, index_col=False, warn_bad_lines=False)
#df = pd.read_csv('Ressources_and_Results/hurricane_harvey_tweets.csv', sep=',', error_bad_lines=False, index_col=False, warn_bad_lines=False,encoding='utf8', header=0)
df.rename(columns={'Unnamed: 0': 'id'}, inplace=True)
df.head(5)


In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

wordcloud = WordCloud(background_color='white').generate(' '.join(df['tweet_text']))
plt.figure(figsize=(20, 20))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

Download nltk Ressources.


<span style="color:red">Do this only once!</span>


In [None]:
import nltk 
nltk.download('stopwords')
nltk.download('punkt')

Pre-processing

Test multiple pre-processing procedures and observe their impact on the analysis results

In [None]:
from preprocessing import *
import gensim
from nltk.stem.porter import *
import nltk
from nltk.corpus import stopwords
from gensim import corpora

#tweets' text as list
tweets_text = df['tweet_text'].tolist()
#lowercase
tweets_text=[tweet.lower() for tweet in tweets_text]

#remove URLs
remove_url_regex = r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b'
tweets_text = filter_tweets_before_tokenization(tweets_text, remove_url_regex)

#tokenization
tweets_text=[nltk.word_tokenize(tweet) for tweet in tweets_text]

#remove special characters
remove_sc_regex = r'[^A-Za-z ]+'
tweets_text = filter_tweets_after_tokenization(tweets_text, remove_sc_regex)

# remove short words
remove_short_words_regex = r'\W*\b\w{1,3}\b'
tweets_text = filter_tweets_after_tokenization(tweets_text, remove_short_words_regex)

# Remove all user names in the tweet text
user_names_regex = r"@\S+"
tweets_text = filter_tweets_after_tokenization(tweets_text,user_names_regex)

#increase keyword frequency by aggregating similar keywords
# check the order if preprocessing routine! e.g. stemming would effect the performance of synonym handling
#disaster = 'hurrican'
#disaster_terms = ['hurricane', 'hurricaneharvey', 'hurricane_harvey', 'flood', 'storm']
#tweets_text = synonym_handling(tweets_text, disaster, disaster_terms)

#Remove unique words that appear only once in the dataset
frequency = getFrequency(tweets_text)
min_frequency_words = 2
tweets_text = [[token for token in tweet if frequency[token] > min_frequency_words] for tweet in tweets_text]

# Remove stop words
# You need to download the stopwords
from nltk.corpus import PlaintextCorpusReader
stoplist = set(stopwords.words('english'))
tweets_text = [[word for word in document if word not in stoplist] for document in tweets_text]

#Custom Stop word list
custom_stopwords_path = r'Ressources_and_Results\Stopwordlist_English.txt'
custom_stop_words = []
with open(custom_stopwords_path, 'r') as sw:
    custom_stop_words = [line.rstrip('\r\n') for line in sw]
sw.close()

tweets_text = [[word for word in document if word not in set(custom_stop_words)] for document in tweets_text]

#Stemming
stemmer = PorterStemmer()
#stemmer = SnowballStemmer("english")
tweets_text = [[stemmer.stem(word) for word in sub_list] for sub_list in tweets_text]

#remove empty strings
tweets_text = [[word for word in document if word] for document in tweets_text]

tweets_text[:10]

Create corpus and dictionary for LDA

In [None]:
dict = gensim.corpora.Dictionary(tweets_text)
corpus = [dict.doc2bow(text) for text in tweets_text]

Train Model

In [None]:
num_topics= 10
alpha = 0.0001
eta= 0.0001
passes = 10
lda = gensim.models.LdaMulticore(corpus, id2word=dict, num_topics= num_topics, alpha = alpha, eta= eta, passes = passes)

Show top words of topics

In [None]:
top_words = 5

#show top words of topics
for t in range(lda.num_topics):
    print('topic {}: '.format(t) + ', '.join([v[0] for v in lda.show_topic(t, top_words)]))

#show top words of topics with probabilities  
#for t in range(lda.num_topics):
#   print('topic {}: '.format(t) + ', '.join([v[0] + " (" + str(v[1]) + ")" for v in lda.show_topic(t, top_words)]))


Visualise topics and check relation between them

In [None]:
import pyLDAvis.gensim
vis = pyLDAvis.gensim.prepare(lda, corpus, dict)
pyLDAvis.display(vis)

Identify disaster-related topic and classify tweets accordingly

In [None]:
document_topic_list = list(lda.get_document_topics(corpus))
classified_tweets =[max(document, key=lambda x: x[1]) for document in document_topic_list]
topics = [top_prob[0] for top_prob in classified_tweets]
probabilites = [top_prob[1] for top_prob in classified_tweets]
df['topics'] = topics
df['probabilities'] = probabilites

Check new axes and null values in data frame

In [None]:
print(df.axes)
print('Number of Null values: ' + str(df.isnull().sum().sum()))
from shapely.geometry import Point
df['Coordinates'] = list(zip(df.longitude, df.latitude))
df['Coordinates'] = df['Coordinates'].apply(Point)

Check classified tweets

In [None]:
topic_number = 4
df.loc[df['topics'] == topic_number]

Check geospatial distribution of disaster-related tweets

In [None]:
import geopandas
topic_number = 9
gdf = geopandas.GeoDataFrame(df, geometry='Coordinates')
print(gdf.head())
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
ax = world[world.continent == 'North America'].plot(
    color='white', edgecolor='black')
gdf_topic = gdf.loc[gdf['topics'] == topic_number]
gdf_topic.plot(ax=ax, color='green',markersize = 0.3)
minx, miny, maxx, maxy = gdf_topic.total_bounds
ax.set_xlim(minx-1, maxx+1)
ax.set_ylim(miny-1, maxy+1)
plt.show()

More information

* [Latent Dirichlet Allocation](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)
* [Combining machine-learning topic models and spatiotemporal analysis of social media data for disaster footprint and damage assessment](https://www.tandfonline.com/doi/full/10.1080/15230406.2017.1356242)
* [Gensim](https://radimrehurek.com/gensim/)