# Data cleaning 
The current notebook does all text preprocessing and linguistic processing tasks that are relevant for the current text data, e.g.:
- noise cleaning
- text normalization
- tokenization
- lemmatization

Before finalising the work on this notebook, I've tried to implement translation on the data that is not in english, and throughout I've come across some problems like complete empty strings. There is one additional supplementary file - '/data/preprocess_help_functions.py', that contains such complementary functions for exploration.
The final input for machine learning model are the lemmatized tokens of the messages data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick 
import matplotlib.dates as mdates
from matplotlib.ticker import PercentFormatter, FuncFormatter
%matplotlib inline

import sqlalchemy

from cycler import cycler

import seaborn as sns
sns.set()

import googletrans
from googletrans import Translator

import regex as re
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

from nltk import tokenize # word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk import pos_tag, ne_chunk 
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from textacy import preprocessing
import textacy
from langdetect import detect

import spacy
nlp = spacy.load('en_core_web_sm')
import os 
# environment settings
pd.set_option('display.max_column',None)
pd.set_option('display.max_rows',None)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/asyagadzhalova/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/asyagadzhalova/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/asyagadzhalova/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/asyagadzhalova/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [2]:
os.getcwd()

'/Users/asyagadzhalova/Documents/GitHub/disaster_messages_classification/notebooks'

In [3]:
os.chdir('..')

In [4]:
df=pd.read_pickle(os.getcwd()+'/data/processed/data_after_eda.pkl')

In [5]:
#df.isnull().sum()

## 1.Text processing 

### 1.1.Text preprocessing - text cleaning 
The actions on translation /where the text is not in english/, removal of noise, normalization

In [6]:
df.head()

Unnamed: 0,id,message,genre,trans_ind,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,direct,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,direct,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,direct,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,0,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",direct,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


#### 1.1.1. Noise cleaning 

In [8]:
def text_cleaner(serie):
    '''Function to normalize data, clean special characters,clean rows with only empty strings, noise
    '''
    #lower case
    serie = serie.astype(str).str.lower()
    #cleaning
    serie= serie.str.replace(r'://www.([\w\-\.]+\S+)','') #replace URL
    serie= serie.str.replace(r'[^\w\s]|\b\w{1,2}\b|\d+','') #remove digit, less than 2 chars
    serie= serie.str.replace(r'\s{3,}','empty_string') #replace empty string 
    return serie

In [9]:
df['message_clean'] = text_cleaner(df['message'])

  serie= serie.str.replace(r'://www.([\w\-\.]+\S+)','') #replace URL
  serie= serie.str.replace(r'[^\w\s]|\b\w{1,2}\b|\d+','') #remove digit, less than 2 chars
  serie= serie.str.replace(r'\s{3,}','empty_string') #replace empty string


In [10]:
df['message_clean'].isna().sum()

0

In [11]:
#drop the rows with empty string - they do not contain any information in the message
df.index[df['message_clean']=='empty_string'].values

array([ 7534, 12185, 12189, 12222])

In [12]:
df.drop(df.index[df['message_clean']=='empty_string'].values,axis=0, inplace=True)

In [13]:
df.shape

(26176, 40)

In [14]:
df.reset_index(inplace=True)

In [15]:
df.drop('index',axis=1,inplace=True)

#### Language detection and translation of non-english to english

In [16]:
'''
Function to detect the language of a given text
'''
def detect_language(text):
    if len(text)>10:
        lang = detect(text)
        return lang

In [17]:
df['lang'] = df['message_clean'].map(detect_language)

The translator function is not working/removed the code from here/. The only option is to drop the rows with language different that english - they are less than 2 of the total data, so we can drop them

In [18]:
df[df['lang']!='en'].shape[0]

476

In [19]:
df['lang'].value_counts()

en    25700
fr       88
id       56
af       50
nl       35
da       30
it       28
pt       26
no       25
so       19
es       19
et       15
ca       14
cy       13
tl        9
sq        9
sv        6
ro        6
sl        3
fi        3
pl        3
cs        2
de        2
tr        2
lt        1
hr        1
sw        1
sk        1
Name: lang, dtype: int64

In [20]:
df.shape

(26176, 41)

In [22]:
df.drop(df.index[df['lang']!='en'].values,axis=0, inplace=True)

In [23]:
df.shape

(25700, 41)

In [24]:
def text_cleaner_tokens(serie, words= ['thank you','thanks','thank']):
    '''Remove empty string value, stop words, tokenize and lemmatize  
    Input: cleaned from noise messages
    Output: word tokens extracted from the messages'''
    #define stemmer
    st = PorterStemmer()
    lemmatizer = WordNetLemmatizer()
    stop= stopwords.words('english') + words
    #stop = words #I will not remove for now the stop words since they convey some meaning 
    stop = [st.stem(x) for x in stop]
    #cleaning
    serie= serie.str.replace('empty_string','') #remove the empty string 
    #define stemmer
    st = PorterStemmer()
    serie= serie.apply(lambda x: " ".join([lemmatizer.lemmatize(word) for word in x.split() 
                                           if st.stem(word) not in stop])) 
    return serie

In [25]:
df['tokens']=text_cleaner_tokens(df['message_clean'])

In [26]:
df.head(10)

Unnamed: 0,id,message,genre,trans_ind,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report,message_clean,lang,tokens
0,2,Weather update - a cold front from Cuba that c...,direct,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,weather updateempty_stringcold front from cuba...,en,weather updatecold front cuba could pas haiti
1,7,Is the Hurricane over or is it not over,direct,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,the hurricane overempty_stringnot over,en,hurricane overnot
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,0,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,reports leogane destroyed only hospital cro...,en,report leogane destroyed hospital croix functi...
4,12,"says: west side of Haiti, rest of the country ...",direct,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,says west side haiti rest the country today ...,en,say west side haiti rest country today tonight
5,14,Information about the National Palace-,direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,information about the national palace,en,information national palace
6,15,Storm at sacred heart of jesus,direct,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,storm sacred heart jesus,en,storm sacred heart jesus
7,16,"Please, we need tents and water. We are in Sil...",direct,1,1,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,please need tents and water are silo thank you,en,please need tent water silo
8,17,"I would like to receive the messages, thank you",direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,would like receive the messages thank you,en,would like receive message
9,18,I am in Croix-des-Bouquets. We have health iss...,direct,1,1,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,empty_stringcroixdesbouquets have health issu...,en,croixdesbouquets health issue worker santoarea...
10,20,"There's nothing to eat and water, we starving ...",direct,1,1,1,0,1,1,1,0,0,0,1,1,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,1,1,1,0,0,0,0,0,1,there nothing eat and water starving and thi...,en,nothing eat water starving thirsty


In [27]:
df.to_pickle(os.getcwd()+'/data/processed/data_after_text_processing.pkl')