# Data Cleaning and Saving

The target of this notebook is to clean the corpus and save the clean, tokenized version as text files.

These text files are used in other models. (e.g. Naive Bayes, Support Vector Machine)

For detailed training set information, please refer to `Data_processing.ipynb` in RNN section.

## 1. Loading Corpus from Disk

The first thing is to locate the corpus directory in local disk. 

In [42]:
# Corpus directory in local disk
dir = 'data/rt-polaritydata'

Find positive and negative samples.

In [43]:
import os

# Walk through 2 files
for rootpath, dirnames, filenames in os.walk(dir):
    for filename in filenames:
        full_path = rootpath + '/' + filename
        if 'neg' in filename:
            neg_path = full_path
        elif 'pos' in filename:
            pos_path = full_path

Read files.

In [45]:
def open_file(path):
    with open(path, mode='r', errors='replace') as f:
        sentence_list = f.readlines()
    return sentence_list

# Open positive text
pos_text_list = open_file(pos_path)
# Open negative text
neg_text_list = open_file(neg_path)

Number of sentences in positive and negative corpus.

In [46]:
print(f'Number of sentences in positive corpus: {len(pos_text_list)}')
print(f'Number of sentences in negative corpus: {len(neg_text_list)}')

Number of sentences in positive corpus: 5331
Number of sentences in negative corpus: 5331


Text format preview.

In [47]:
count = 3
pos_preview = [pos_text_list[i] for i in range(count)]
neg_preview = [neg_text_list[i] for i in range(count)]
print('\nPositive corpus preview:')
print(*pos_preview)
print('Negative corpus preview:')
print(*neg_preview)


Positive corpus preview:
the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . 
 the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth . 
 effective but too-tepid biopic

Negative corpus preview:
simplistic , silly and tedious . 
 it's so laddish and juvenile , only teenage boys could possibly find it funny . 
 exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable . 



## 2. Cleaning Corpus

Since there are too many punctuations, meaningless words and variations (e.g. tense) in the original corpus, we need to clean the data sets by:
* Punctuations elimination;
* Stopwords elimination;
* Lemmatization.

These work could resort to `nltk` package.

In [48]:
import nltk
from typing import List

# Download NLTK punctuations, stopwords
nltk.download('punkt', quiet=True, raise_on_error=True)
nltk.download('stopwords', quiet=True, raise_on_error=True)
NLTK_STOP_WORDS = list(set(nltk.corpus.stopwords.words('english')))
nltk_porter_stemmer = nltk.stem.PorterStemmer()

def preprocess_sentence(sentence: str) -> List[str]:
    # Tokenize a sentence to words
    tokens = nltk.word_tokenize(sentence)
    # Convert all letters to lower-case characters
    tokens = (token.lower() for token in tokens)
    # Remove non-alphabet characters tokens
    tokens = (token for token in tokens if token.isalpha())
    # Remove stop words
    tokens = (token for token in tokens if token not in NLTK_STOP_WORDS)
    # Stem (lemmatize) words
    tokens = (nltk_porter_stemmer.stem(token) for token in tokens)
    return list(tokens)

def preprocess_sentence_list(sentence_list: List[str]) -> List[List[str]]:
    # Walk through all sentences in sentence_list
    token_list = (preprocess_sentence(sentence) for sentence in sentence_list)
    # Remove empty lists
    return [tokens for tokens in token_list if tokens]

Here is the comparison about the same sentence before and after preprocessing.

In [51]:
# Test sentences
test_text_list = ['I thought this movie was great!', 'He thinks that this movie was miserable!']
print('Example sentences before preprocessing:', *test_text_list, sep='\n')
print()
test_tokens_list = preprocess_sentence_list(test_text_list)
print('Example sentences after preprocessing:', *test_tokens_list, sep='\n')

Example sentences before preprocessing:
I thought this movie was great!
He thinks that this movie was miserable!

Example sentences after preprocessing:
['thought', 'movi', 'great']
['think', 'movi', 'miser']


Number of sentences after preprocessing.

The number will shrink since some of the sentences are meaningless to emotion.

In [53]:
pos_token_list = preprocess_sentence_list(pos_text_list)
neg_token_list = preprocess_sentence_list(neg_text_list)
print(f'Number of sentences in positive corpus after preprocessing: {len(pos_token_list)}')
print(f'Number of sentences in negative corpus after preprocessing: {len(neg_token_list)}')

Number of sentences in positive corpus after preprocessing: 5327
Number of sentences in negative corpus after preprocessing: 5328


## 3. Tokenized Clean Corpus Saving

Save the tokenized clean corpus as text files.

In [55]:
# Token file names
token_filename_pos = 'data/pos_sample_tokenized.txt'
token_filename_neg = 'data/neg_sample_tokenized.txt'
# Combine 2 token lists
all_token_list = pos_token_list + neg_token_list
token_filename_all = 'data/all_sample_tokenized.txt'

# Make files that contain all processed tokens
def save_token(token_list, filename):
    sentences = []
    with open(filename, 'w') as f:
        for tokens in token_list:
            sentence = ' '.join(tokens)
            sentences.append(sentence)
            f.write(sentence + '\n')

save_token(token_list=pos_token_list, filename=token_filename_pos)
save_token(token_list=neg_token_list, filename=token_filename_neg)
save_token(token_list=all_token_list, filename=token_filename_all)