# Preprocessing Documentation

<insert why preprocessing is important here>

_________________________________________________________________
### Step 1: Import necessary libraries:

In [1]:
import spacy
from TagalogStemmer import TglStemmer
import json
import re

 #### spaCy
 
The spaCy library is used for preprocessing and lemmatizing the English words. It has a built in English dictionary and is comparitively faster than NLTK in word tokenization.
 <br \>
 
 #### TagalogStemmer
 Tagalog is currently not supported in NLP libraries in python we used the Tagalog Words Stemmer (https://github.com/crlwingen/TagalogStemmerPython/blob/master/TglStemmer.py) to process Tagalog words. It should removes the affixes of the Tagalog word and returns the root word. 
 
 *Note: I modified it slightly by removing the print statements and cleaning the input.*
 <br \>
 
 ___________________
 Modifications
 <ul>
 <li>
 *In reading from string:* 	
 From: 
 ```python
 tokens = source.split(' ')
 ``` 
 To: 
 ```python
 tokens = source.strip().split(' ')
 ```
    
 <li> 
 *In reading from file:*
 
 From:
 ```python
 with open(source, 'r') as myfile:
 data = myfile.read().replace('\n', ' ')
    
 return data.split(' ')
 ```
 to:
 
 ```python
 with open(source, 'r', encoding='utf-8', errors='ignore') as myfile:
 data = myfile.read()
 data = re.sub(r'[^\w]', ' ', data)
 data = re.sub(r'\s+', ' ', data).strip()
 
 return data.split(' ')
 ```

______________________
### Step 2: Read the JSON data 
Reads from the result of the fb-scraper

In [2]:
with open('data.json') as json_data:
    data = json.load(json_data)

<br \ >
Sample data:

In [3]:
print(data[0])

{'name': 'Poor Old Man Accidentally Scratches Luxurious Car, Leaves Note That Touches Everyone’s Heart', 'created_time': '2017-12-24T20:29:19+0000', 'like': 52, 'love': 2, 'wow': 0, 'haha': 0, 'sad': 2, 'angry': 0, 'thankful': 0, 'total_reacts': 56, 'comments': 1, 'shares': 4, 'id': '979290192127575_1645818285474759', 'page_id': 'ClassifiedTrends'}


______________
### Step 3: Clean the string
For each message in the data, using the regex library: remove urls, special characters, extra white space and numbers.

In [4]:
messages = []
for d in data:
	try:
		message = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', ' ', d['message'])
		message = re.sub(r'([^\w]|[0-9_])+', ' ', message)
		message = re.sub(r'\s+', ' ', message).strip()
		messages.append(message)
	except:
		#Skip if there is no message
		continue

Sample messages:

In [5]:
print(messages[0])

Sobrang nakakaawa itong sinapit ng batang ito na kung saan ay naipit ang kanyang kaliwang kamay sa escalator sa Gaisano Mall Panoorin nyo po ang buong pangyayari


In [6]:
print(messages[-1])

Top stories you might have missed Plan to get ship shape and competitive Philippine exporters are not only challenged by competition from traders in other Asean countries they are also challenged by the tired approach to shipping goods in their own All that may change though with the floating of the Philippine Export Development Plan PEDP a four pronged strategy devised by the Philippine Export Marketing Bureau EMB to put some wind in the sails of the country s flagging export sector


_______________________
### Step 4: Consider Tagalog stop words
Since the Tagalog Stemmer only returns the root of tagalog words, we have to add the Tagalog stop words to spaCy's dictionary.

*Tagalog stop words are from https://github.com/stopwords-iso/stopwords-tl/blob/master/stopwords-tl.json

In [7]:
#Load English Dictionary
nlp = spacy.load('en')

#Add new words to dictionary as stop word
tl_stop = ['lang', 'eto', 'kanila', 'de', 'nang', 'ito', 'di','silang', 'kanilang', 'at', 'rin', 'yan', 'pa', 'sa', 'mga', 'ay', 'din', 'na', 'ng', 'nag', 'mag', 'pag', 'ang', 'nya', 'nyo', 'sya', 'niyo', 'siya', 'kung', 'po', 'ito', 'iyan', 'akin','aking','ako','alin','am','amin','aming','ano','anumang','apat','at','atin','ating','ay','bababa','bago','bakit','bawat','bilang','dahil','dalawa','dapat','din','dito','doon','gagawin','gayunman','ginagawa','ginawa','ginawang','gumawa','gusto','habang','hanggang','hindi','huwag','iba','ibaba','ibabaw','ibig','ikaw','ilagay','ilalim','ilan','inyong','isa','isang','itaas','ito','iyo','iyon','iyong','ka','kahit','kailangan','kailanman','kami','kanila','kanilang','kanino','kanya','kanyang','kapag','kapwa','karamihan','katiyakan','katulad','kaya','kaysa','ko','kong','kulang','kumuha','kung','laban','lahat','lamang','likod','lima','maaari','maaaring','maging','mahusay','makita','marami','marapat','masyado','may','mayroon','mga','minsan','mismo','mula','muli','na','nabanggit','naging','nagkaroon','nais','nakita','namin','napaka','narito','nasaan','ng','ngayon','ni','nila','nilang','nito','niya','niyang','noon','o','pa','paano','pababa','paggawa','pagitan','pagkakaroon','pagkatapos','palabas','pamamagitan','panahon','pangalawa','para','paraan','pareho','pataas','pero','pumunta','pumupunta','sa','saan','sabi','sabihin','sarili','sila','sino','siya','tatlo','tayo','tulad','tungkol','una','walang']

new_words = tl_stop
for word in new_words:
	lexeme = nlp.vocab[word]
	lexeme.is_stop = True

___________

### Step 5: Generate bag of words

For each message divide into three separate lists because each list will be cleaned differently. This is to consider that the language may be Tagalog, Taglish, or English. *Other Filipino languages (e.g Bisaya) will be included in the tagalog list)*

Since we will only be using bag of words, order does not matter.

<br \>

Lists:
<ul>
    <li> ```english``` parsed using the standard lemma_ of spaCy's nlp. </li>
    <li>```proper``` parses also using the standard lemma_of spaCy's nlp.</li>
        <ul>
            <li>It contains the proper nouns (not part of the english dictionary, checked using .isupper) </li>
            <li>The reason for this is so that the proper nouns would not be parsed with the tagalog stemmer </li>
        </ul>
    <li>```tagalog``` parsed using the TagalogStemmer. </li>   
        <ul> 
            <li>```tagalog_str``` will serve as input to the TagalogStemmer. </li>
            <li>Then the TagalogStemmer will output a list of the root words. </li>
        </ul>
</ul>

In [8]:
#Instantiate bag of words
bows = []

for i in range(0, len(messages)):
	english = []
	proper = []
	tagalog_str = ''

	#For each message tokenize the words, 
	#Check if it is a stop word
	#If it is not, append to respective list
	for word in nlp(messages[i]):
		if len(word.text) > 2 and not word.is_stop:
			if word.text in nlp.vocab:
				english.append(word.lemma_)
			elif word.text[0].isupper():
				proper.append(word.lemma_)
			else:
				tagalog_str += word.text + ' '

	try:
		tagalog = TglStemmer.stemmer('2', tagalog_str, '2')[1]
	except:
		#Skip if tagalog_str is empty
		continue
	
	#Append the list of all the clean words to the bag of words
	bows.append(english + proper + tagalog)

Sample parsed entries in the bag of words:

In [9]:
print(bows[0])

['mall', 'sobrang', 'gaisano', 'panoorin', 'nakakaawa', 'ito', 'sapit', 'batang', 'ipit', 'liwa', 'kamay', 'escalatod', 'buo', 'yari']


In [10]:
print(bows[-1])

['top', 'story', 'miss', 'the', 'philippines', 'web', 'mass', 'right', 'filipinos', 'access', 'internet', 'albeit', 'exactly', 'trouble', 'free', 'give', 'dismal', 'download', 'speed', 'scant', 'provision', 'nevertheless', 'represent', 'country', 'people', 'glance', 'look', 'impressive', 'lot', 'netizen', 'place', 'but', 'figure', 'big', 'proportion', 'country', 'internet', 'absent', 'whopping', 'population', 'fact', 'that', 'mean', 'non', 'filipinos', 'large', 'and', 'lot', 'place', 'let', 'perspective', 'big', 'combine', 'population', 'south', 'africa', 'city', 'chicago', 'wifi', 'netizen']


_________________________________________
## Step 6: Finally, save the bag of words to a file

To json object:

In [11]:
with open('bows.json', 'w') as f:
    json.dump(bows, f)

To text file:

In [14]:
bows_text = []
for i in range(0,len(bows)):
    bows_text.append(' '.join(b for b in bows[i]))
    
bows_text = ' '.join(bow for bow in bows_text)

with open('bows.txt', 'w') as f:
    f.write(bows_text)