<div class="alert alert-info alertinfo" style="margin-top: 0px">
<h1> Natural Language Processing with Disaster Tweets </h1>
part 2 - data cleaning
</div>

<div class="alert-success" style="margin-top: 0px">
<h1> Imports </h1>
</div> 

In [1]:
# standard
import pandas as pd

# NLP
import re
import nltk
from nltk.corpus import words
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk import word_tokenize, pos_tag, ne_chunk
from collections import Counter

### 1. read data set

In [2]:
df = pd.read_csv('train.csv')
test_set = pd.read_csv('test.csv')
df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


<div class="alert-success" style="margin-top: 0px">
<h1> Data cleaning </h1>

- To keep the code clean, the investigation of what cleaning functions are needed is not included here in full.
- Each text may require different cleaning steps, so it is important to check the data and adjust the cleaning steps accordingly.
- Cleaning functions are also udjusted/added when evaluating model accuracy as thats when we can see why some texts are missclassified.
- Typically this process involves many 'going back and forth' adjustments - here I show just one
- I also show the logic - why those changes were made.

</div> 

### 1. missing values
since we are interest in strings we will replace each missing value with ''

In [3]:
# fill missing values with empty string
def summary_table(df):
    '''
    Creates a summary info tableabout given data frame
    
    Args:
        df: data frame
        
    Returns:
        summary: info data frame
    '''
    print('There are {} rows in the original data'.format(df.shape[0]))
    summary = pd.DataFrame(df.dtypes, columns = ['dtypes'])
    summary = summary.reset_index()
    summary['Name'] = summary['index']
    summary = summary[['Name', 'dtypes']]
    summary['Missing'] = df.isnull().sum().values
    summary['Uniques'] = df.nunique().values
    return summary

df.fillna('', inplace=True)
summary_table(df)

There are 7613 rows in the original data


Unnamed: 0,Name,dtypes,Missing,Uniques
0,id,int64,0,7613
1,keyword,object,0,222
2,location,object,0,3342
3,text,object,0,7503
4,target,int64,0,2


### 2. pre-clean and corpus set up
first cleaning function that will be updated later on

In [4]:
# Definitions 1 (those are initial definitions - we might adjust them later when re-evaluating our models)
STOPWORDS = set(stopwords.words('english'))

def remove_urls(text):
    text = re.sub(r"http\S+|www\S+|https\S+", '', text, flags=re.MULTILINE)
    text = re.sub(r'<.*?>', '', text)
    return text

def remove_emails(text):
    text = re.sub(r'\S+@\S+', '', text, flags=re.MULTILINE)
    text = re.sub(r'(^|\s)@\w+', '', text)
    return text.strip()

def remove_foreign_characters(text):
    text = re.sub(r'([^\x00-\x7F])+', '', text)
    return text

def remove_short_words(text):
    text = ' '.join([word for word in text.split() if len(word) > 2])
    return text

def remove_stopwords(text):
    stopwords = set(STOPWORDS)
    return ' '.join([word for word in text.split() if word.lower() not in stopwords])

def remove_symbols_and_numbers(text):
    text = ''.join(' ' if not c.isalpha() else c for c in text)
    text = ' '.join(text.split())
    return text

def clean_phase_1(text):
    text = text.lower()
    text = remove_urls(text)
    text = remove_emails(text)
    text = remove_foreign_characters(text)
    text = remove_symbols_and_numbers(text)
    text = remove_stopwords(text)
    text = remove_short_words(text)
    return text

In [5]:
english_words = set(words.words())

# find 'important words' - 'keywords' in our training set
data = df.copy()
feature_name = 'keyword'
data[feature_name] = data[feature_name].apply(clean_phase_1)
set_values = set(data[feature_name].values)
set_values.remove('')
important_words = set()
for value in set_values:
    words = value.split()
    important_words.update(words)

# find 'unimportant words' - 'locations' in our training set
feature_name = 'location'
data[feature_name] = data[feature_name].apply(clean_phase_1)
set_values = ' '.join(data[feature_name].dropna())
set_values = set_values.split()
set_values = set([word for word in set_values if len(word) > 2])
unimportant_words = set()
for value in set_values:
    words = value.split()
    unimportant_words.update(words)
words_to_keep = unimportant_words.intersection(important_words)
unimportant_words = unimportant_words - words_to_keep

# updating corpus
english_words.update(important_words)
english_words = english_words - unimportant_words

In [6]:
# train data words
feature_name = 'text'
data[feature_name] = data[feature_name].apply(clean_phase_1)
all_text = ' '.join(data['text'].dropna())
words = all_text.split()
word_counts = Counter(words)
word_counts_df = pd.DataFrame.from_dict(word_counts, orient='index', columns=['count']).reset_index()
word_counts_df.columns = ['word', 'count']
word_counts_df = word_counts_df.sort_values(by='count', ascending=False).reset_index(drop=True)
data_words_df = word_counts_df[word_counts_df['count'] >= 5]
train_data_words = set(data_words_df['word'].values)

train_words_common = train_data_words.intersection(english_words)
train_words_uncommon = train_data_words - train_words_common

# updating corpus
english_words.update(train_data_words) # all words on the first phase

### 3. text cleaning

In [7]:
# prepare_for_ml
def filter_words(text, english_words, filter=True):
    if filter:
        words = text.split()
        text = ' '.join([word for word in words if word.lower() in english_words])
    return text

def stem_text(text):
    porter = PorterStemmer()
    words = word_tokenize(text)
    stemmed_words = [porter.stem(word) for word in words]
    stemmed_text = ' '.join(stemmed_words)
    return stemmed_text

def prepare_for_ml(data, english_words):
    data.fillna('', inplace=True)
    data['keyword'] = data['keyword'].apply(clean_phase_1)
    data['keyword'] = data['keyword'].apply(lambda x: filter_words(x, english_words, filter)).apply(stem_text)
    data['processed text'] = data['text'].apply(clean_phase_1)
    data['processed text'] = data['processed text'].apply(lambda x: filter_words(x, english_words, filter)).apply(stem_text)
    return data

df = prepare_for_ml(df, english_words)
test_set = prepare_for_ml(test_set, english_words)
df.head()

Unnamed: 0,id,keyword,location,text,target,processed text
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,reason earthquak may allah forgiv
1,4,,,Forest fire near La Ronge Sask. Canada,1,forest fire near canada
2,5,,,All residents asked to 'shelter in place' are ...,1,resid ask shelter place notifi offic evacu she...
3,6,,,"13,000 people receive #wildfires evacuation or...",1,peopl receiv wildfir evacu order california
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,got sent photo rubi alaska smoke wildfir school


### we are now ready for part 3 - building ML models