<h1>Haiti 2 - Cleaning and Processing Haiti Data</h1>

In the last notebook you explored the dataset. In this notebook, you will load in the same [text dataset](https://github.com/rmunro/disaster_response_messages). You'll then clean and process the data using several Natural Language Processing (NLP) techniques.

**In this lab you will apply the following steps:**
1. Import Python Packages
2. Load the data
3. Text cleaning and processing
4. Explore the number of tokens
5. Represent a word as a count-based vector (Bag of Words)
6. Explore the top words

## 1. Import Python Packages

In [None]:
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from termcolor import colored
import gensim
import nltk
nltk.download('averaged_perceptron_tagger')

import utils

pd.options.mode.chained_assignment = None
print('All packages imported successfully!')

## 2. Load the data

In [None]:
# Load the datasets
training_data = pd.read_csv("data/disaster_response_training.csv", low_memory=False)
validation_data = pd.read_csv("data/disaster_response_validation.csv", low_memory=False)
test_data = pd.read_csv("data/disaster_response_test.csv", low_memory=False)
print('Datasets loaded successfully!')

# Merge the three datasets
full_data = training_data.append(validation_data).append(test_data)
# Fix column data type
full_data['original'] = full_data['original'].astype(str)

# Select only the Haiti data
haiti_df = full_data[full_data.event == 'haiti_earthquake']
# Fix column data type
haiti_df['actionable_haiti'] = haiti_df.actionable_haiti.astype('int64')
haiti_df['date_haiti'] = pd.to_datetime(haiti_df.date_haiti)
print('Haiti data selected!')
haiti_df.head()

## 3. Text cleaning and processing

Before you can perform any modelling on the text, you need to go through a series of steps to clean and process the data. The steps include

1. **Tokenize:** This splits strings based on white spaces and punctuation. Further, it expands contractions (i.e `can't` becomes `ca` and `n't`
2. **Remove puncutation:** To remove the puncuations from words, you will use `string.punctuation` which is a list of all punctuation symbols. You'll notice that the list is not exhaustive and you may need to add additional punctuations based on your specific dataset (check the `utils` file for details).
3. **Standardize letter case:** You also want to ensure that all words follow the same format in order to recognize duplicate words in a message. To do this, you'll convert all words to lowercase. 
4. **Remove stop words:** These are common words that are often used in speech or text that you may not want to include in our final analysis, as words like "and" are very common and may skew the results of what you're trying to analyze.
5. **Lemmatize each word:** In our case, you want to know what general topics are being spoken about in relation to the Haiti earthquake, and so it doesn't matter as much to us whether someone mentions "help", "helped", or "helping". Lemmatization is not the best solution for every kind of task.

### 3.1 Clean and process a single message
Before you process the entire dataset, let's look at a random message and see how each step cleans and processes each word in the message. In the below cell you see the code with each of the steps. Then in the next cell you can run it in an interactive mode.

In [None]:
# Get stopwords and punctuation from the utils file
STOP_WORDS = utils.STOP_WORDS
punctuation = utils.punctuation
# Instantiate the lemmatizer
lemmatizer = WordNetLemmatizer()


def process_random_message(df):
    message = utils.get_random_message(df)
    
    # Step 1: Tokenize and get POS tags
    tokens = nltk.pos_tag(word_tokenize(message))
    print(colored("Step 1: Tokenize\n", "blue"), "{}\n".format([w[0] for w in tokens]))
    
    # Step 2: Standardize Lettercase
    tokens = [(w[0].lower(), w[1]) for w in tokens]
    print(colored("Step 2: Lowercase\n", "blue"), "{}\n".format([w[0] for w in tokens]))

    # Step 3: Remove Puncuation
    tokens = [w for w in tokens if w[0] not in punctuation]
    print(colored("Step 3: Remove punctuation\n", "blue"), "{}\n".format([w[0] for w in tokens]))

    # Step 4: Remove stop words
    tokens = [w for w in tokens if w[0] not in STOP_WORDS]
    print(colored("Step 5: Remove stop words\n", "blue"), "{}\n".format([w[0] for w in tokens]))

    # Step 5: Lemmatize each word 
    tokens = [lemmatizer.lemmatize(w[0], utils.pos_tag_convert(w[1])) for w in tokens]
    print(colored("Step 4: Lemmatize\n", "blue"), "{}\n".format(tokens))
    
    return tokens


In [None]:
utils.interact_with_filters(process_random_message, haiti_df)

### 3.2 Process the entire dataset

You can put all of the following cleaning together into one method (defined as `clean_tokenize_process_text` in the `utils` file). Now you can run it for the whole dataframe and save the tokenized messages to a new column.

In [None]:
# Process all messages and save tokens to a new column
haiti_df["message_tokens"] = haiti_df.message.apply(
    utils.process_text,
    tokenizer=word_tokenize, pos_tagger=nltk.pos_tag, lemmatizer=lemmatizer, stopwords=STOP_WORDS, punctuation=punctuation
)

haiti_df.head()

You can now look at some of the messages and their tokens directly from the dataframe.

In [None]:
utils.interact_with_filters(utils.print_messages, haiti_df, number_of_messages=5)

## 4. Explore the number of tokens

### 4.1 Count the number of words and tokens
What is the distribution of the length of each message? You can calculate that and use a histogram to plot it.

In [None]:
# Count number of words and save as a column
haiti_df["num_words"] = haiti_df.message.apply(lambda x: len(x.split(" ")))

# Count number of tokens and save as a column
haiti_df["num_tokens"] = haiti_df.message_tokens.apply(lambda x: len(x))

utils.interact_with_filters(utils.histogram_number_of_words, haiti_df)

Several of your messages are quite short. Let's see which messages have only 1 token after processing.

In [None]:
haiti_df[haiti_df.num_tokens==1][['message', 'message_tokens']].sample(5)

As you can see, this helps us to find which messages may be anomalous or strange. 

## 5. Represent a word as a count-based vector (Bag of Words)

### 5.1 Explore a mini corpus
To make things easy to visualize and comprehend, you can look at a small corpus consisting of just a few messages below.

In [None]:
utils.interact_with_filters(
    utils.mini_corpus, haiti_df,
    corpus_size=3
)

### 5.2 Apply BoW to full corpus

In [None]:
# All of the messages in the dataset
corpus = haiti_df.message_tokens

# Create the Dictionary
corpus_dictionary = gensim.corpora.Dictionary(corpus)

print(colored("Before filter: ", "blue"), f"Dictionary contains a total of {len(corpus_dictionary)} unique words")

# Filter for words that occur more than 5 times overall
corpus_dictionary.filter_extremes(no_below=5)

# Create bag of words out of the full corpus
corpus_bow = [corpus_dictionary.doc2bow(doc) for doc in corpus]

print(colored("After filter: ", "blue"), f"Dictionary contains a total of {len(corpus_dictionary)} unique words")

## 6. Explore the top words

Finally you can look at the top words that are used overall in our texts. You can do this by combining all of our messages into one long document, then you can see what are the top most common words.

In [None]:
utils.interact_with_filters(utils.explore_top_tokens, haiti_df)

You can use a visualization tool such as a word cloud to plot the top words used in all of the messages overall.

In [None]:
utils.interact_with_filters(utils.wordcloud_from_top_words, haiti_df, n=200)

In [None]:
utils.interact_with_filters(utils.relative_words_visualization, haiti_df, n=50, show_other=False)

You can save the updated dataframe as a python pickle below. The pickle module is used for serializing and de-serializing a Python object structure. This means you can save any object from python and open it next time exactly the same as it was. If you save it as .csv for example some information about column types may be lost.

In [None]:
#import pickle
#with open('haiti_df.pkl', 'wb') as f:
#    pickle.dump(haiti_df, f)

Citation:
- Robert Munro. 2012. Processing short message communications in low-resource languages. [PhD dissertation, Stanford University]. Stanford Digital Repository. Retrieved from https://purl.stanford.edu/cg721hb0673