# Kaggel: Disaster Tweets

This dataset is obtained from a Kaggel challenge to detect tweets referring to a disaster:
https://www.kaggle.com/c/nlp-getting-started/overview

The notebooks' goal is to derive features from natural language transcripts which could be subsequently used in statistical methods.

The notebook conducts the following steps:

    1) Import libraries and data
    2) Define funcitons for data preparation
    3) Prepare data
    4) Apply ML model

### 1) Import libraries and data

In [1]:
# import libraries

import numpy as np
import pandas as pd
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = stopwords.words('english')
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
# load data

train = pd.read_csv('nlp-getting-started/train.csv')
test = pd.read_csv('nlp-getting-started/test.csv')

In [3]:
train.columns.tolist() # 

['id', 'keyword', 'location', 'text', 'target']

In [4]:
print(train.describe())

                 id      target
count   7613.000000  7613.00000
mean    5441.934848     0.42966
std     3137.116090     0.49506
min        1.000000     0.00000
25%     2734.000000     0.00000
50%     5408.000000     0.00000
75%     8146.000000     1.00000
max    10873.000000     1.00000


    The sample is slightly imbalanced with 57% of tweets about real disasters


In [5]:
# merge test and train
dfs = [train, test]
df = pd.concat(dfs)
df = df.reset_index(drop=True)

In [6]:
# print examples of tweets 

loop = np.arange(10)

for i in loop:
    print(train.loc[train.target == 1].reset_index().text[i])
    print('')

Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all

Forest fire near La Ronge Sask. Canada

All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected

13,000 people receive #wildfires evacuation orders in California 

Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school 

#RockyFire Update => California Hwy. 20 closed in both directions due to Lake County fire - #CAfire #wildfires

#flood #disaster Heavy rain causes flash flooding of streets in Manitou, Colorado Springs areas

I'm on top of the hill and I can see a fire in the woods...

There's an emergency evacuation happening now in the building across the street

I'm afraid that the tornado is coming to our area...



In [7]:
# non disaster tweets 

for i in loop:
    print(train.loc[train.target == 0].reset_index().text[i])
    print('')

What's up man?

I love fruits

Summer is lovely

My car is so fast

What a goooooooaaaaaal!!!!!!

this is ridiculous....

London is cool ;)

Love skiing

What a wonderful day!

LOOOOOOL



Note: The displayed samples show us that the documents contain words carrying little meaning in itself (e.g., 'are', 'the', 'by'), special characters, or numbers.
    
While some of these aspects might not support the analysis on its own, they could be used to derive other features to facilitate our task.
For instance, articles and numbers might describe concrete events indicating a real disaster while certain special characters could hint towards urgency.

Thus, before removing this potential information, variables are derived to capture some of the assumed indicatiors:
   
    1) Concrete language: use or articles ('a','an',etc.), prepositions ('to','with',etc.), and quantifiers (numbers, 'many', 'few', etc.) might indicate an higly contextualized tweet. 
    2) Excitement: Exclamation marks might indicate exciting situations 
    
These categories could be captured using more advanced psycholingusitcal software such as LIWC or DICTON. For the purpose of this notebook, I chose to use a simple count algorithm to find the respective word counts.

### 2) Define functions for data preparation

In [8]:
# create a dictionary holding wordcounts. 

def dictionary_empty(wordlist):
    """
    Creates an empty dictionary containing words specified in the input list.
    Input: List of words
    Return: Empty dictionary 
    """
    di = dict()
    for w in wordlist:               # create dictionary containing specified words
        if w in di:
            continue
        else:
            di[w] = 0
    return di

In [9]:
# count the words of a wordlist
def count_words(wordlist, corpus):
    """
    Searches corpus for words in dictionary and returns a sum of these words for each document.
    Input: wordlist with words, corpus of documents
    Return: Dataframe with category word counts
    """
    
    di = dictionary_empty(wordlist)
    
    df = pd.DataFrame()
    df_append = []
    for txt in corpus:
        wds = txt.split()
        for w in wds:
            if w in di:
                di[w] += 1
            else:
                continue
        
        wds_count = sum(di.values())
        df_wds = pd.DataFrame(np.array([wds_count]))
        df_append.append(df_wds)

        di = dictionary_empty(wordlist)
        
    df = pd.concat(df_append, axis = 0)
    df = df.reset_index(drop=True)
    return df

In [10]:
# clean corpus

def clean_txt(corpus):
    """
    This function takes a string of textual data and returns a cleaned corpus.
    Steps include: 
    Remove words with less than three letters. 
    Convert letters to lowercase
    Remove stop words(nltk) 
    Remove special characters
    
    Inputs: Series with text strings
    Returns: Series with cleaned text strings
    """

    df = pd.DataFrame()
    
    # rm special characters
    df['clean'] = corpus.str.replace("[^a-zA-Z]", " ")
    # lower case 
    df['clean'] = df['clean'].apply(lambda x: x.lower())
    # rm words with less than 3 letters
    df['clean'] = df['clean'].fillna('').apply(lambda x: ' '.join([w for w in x.split() if len(w) >2]))
    # stop words
    stop_words = stopwords.words('english')
    # tokenize 
    tokenized = df['clean'].fillna('').apply(lambda x: x.split())
    # rm stop words
    tokenized = tokenized.apply(lambda x: [w for w in x if w not in stop_words])
    # de-tokenize
    detokenized = []
    for i in range(len(df)):
        t = ' '.join(tokenized[i])
        detokenized.append(t)
    df['clean'] = detokenized

    # define dfclean as corpus
    corpus = df['clean']
    
    return corpus    

#### 3) Prepare data

In [11]:
# define dictionaries

words_concrete = 'a an the to with on in at for from by some many few all most more less'
words_concrete = words_concrete.split() # split to create a list

words_excitement = '! '
words_excitement = words_excitement.split()

In [12]:
# define corpus

corpus = df.text

# create dictionary counts
di_concrete = count_words(words_concrete, corpus)
di_excitement = count_words(words_excitement, corpus)

# create clean corpus
corpus_clean = clean_txt(corpus)

# create tfidf tokens
vectorizer = TfidfVectorizer(max_features=2000, # consider only top N features          
                             max_df=0.9)        # ignore terms with a document frequency >0.7
tfidf = vectorizer.fit_transform(corpus_clean)

In [13]:
# calculate topic model

svd = TruncatedSVD(n_components=3,
                   algorithm='randomized',
                   n_iter=100,
                   random_state=37)
lsa = svd.fit_transform(tfidf)

In [14]:
# check topics 

vocabulary = vectorizer.get_feature_names()

for i, comp in enumerate(svd.components_):
  terms_comp = zip(vocabulary, comp)
  sorted_terms = sorted(terms_comp, key=lambda x:x[1], reverse=True)[:5]
  print("Topic "+str(i +1)+": ")
  for t in sorted_terms:
    print(t[0])
    print("")

Topic 1: 
http

via

fire

new

news

Topic 2: 
https

like

amp

fire

burning

Topic 3: 
https

http

via

youtube

disaster



### 4) Apply ML model

In [None]:
# merge features

X = pd.DataFrame(lsa, 
                 columns=['topic_1', 'topic_2','topic_3'])
X['concrete'] = di_concrete
X['excitement'] = di_excitement

X_train = X.loc[:len(train)-1]
y_train = df['target'].loc[:len(train)-1]



In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

clf = SVC(kernel='linear', C=1, random_state=37)
scores = cross_val_score(clf, X_train, y_train, cv=10)
print(scores)

### 5) Wrap-up

This notebook takes tweets published in the "Disaster Tweet" challenge on Kaggle to apply feature engineering on natural language.

Word counts, TFIDF-vectorization, and latent semantic analysis were applied to derive variables which could be used as input variables for machine learning algorithms. 