## Raw data pre-processing 
Data is collected from 11 MOOCs in the field of Computer Science, robotics, mathematics and physics and are processed total of 563 TXT files. 

#### Preprocessing steps: 

- sentence by sentence split
- lower case
- noise removal 
    - SOME punctuation STAYS 
        - point, single space and comma
    - SOME stopwords stay 
        - between,we,i,in,here,that,you,it,that,this,there,few,if,so,to,a,an,is,until,while
    - mention removal
- word normalization
    - tokenization 
    - lemmatization 
    - stemming 
- word standardization
    - regex

#### Overall logic:
1. traverse recursively all folder and files
2. When a file is found, save it's name into a file list
3. For each file in the file list, apply all the **preprocessing steps** and save it as a new file with "\_PREPROCESSED" added at the end in the same folder
    3.1. Changed: Now I keep deparate file after each preprocessing step as I may need certain file for some algorithms. 

#### Next steps (In the Data Analysis file) :  
1. Apply TF-IDF
2. Try Wikipedia linking
3. Try linking with WordNet
4. Try Bag of Words
5. Try other algorithms? 
6. Define a clear dictionary with words for each category
7. Other Classification algorithms?
8. Try finding n-grams

In [29]:
# Import all necessary modules for EVERYTHING here
import os
import sys
import os.path
import string
import time

# ---- for TF-IDF & NLTK
import math
from textblob import TextBlob as tb
import nltk
from nltk.corpus import wordnet as wn
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
#nltk.download('stopwords')
from nltk.stem import LancasterStemmer
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
#nltk.download('punkt')
#nltk.download('wordnet')
# nltk.download('averaged_perceptron_tagger')
from pathlib import Path
from beautifultable import BeautifulTable

import re, string, unicodedata
import contractions
import inflect
from bs4 import BeautifulSoup
from tabulate import tabulate

In [2]:
def is_noun(tag):
    return tag in ['NN', 'NNS', 'NNP', 'NNPS', 'FW']


def is_verb(tag):
    return tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'MD']


def is_adverb(tag):
    return tag in ['RB', 'RBR', 'RBS', 'WRB']


def is_adjective(tag):
    return tag in ['JJ', 'JJR', 'JJS']

def penn_to_wn(tag):
    if is_adjective(tag):
        # returns a
        return wn.ADJ
    elif is_noun(tag):
        # returns n
        return wn.NOUN
    elif is_adverb(tag):
        # returns r
        return wn.ADV
    elif is_verb(tag):
        # returns v
        return wn.VERB
    return None

In [None]:
# ------------------------------------------------ METHODS  --------------------------------------------------------

# Create a set of all allowed characters.
# {...} is the syntax for a set literal in Python.
allowedPunct = {",", ".", " "}.union(string.ascii_lowercase)

# Use full for BoF and others. 
# Use part in order to keep some data important if we need to match tuples from the text based on natual speach
path_swAni_full = "/media/sf_Shared_Folder/stopwordsALL.txt"
path_swAni_part = "/media/sf_Shared_Folder/stopwords.txt"

swFile = open(path_swAni_full, 'r', encoding = "ISO-8859-1")
stopWords = swFile.read().split()
#print(stopWords)

#stopWords = stopwords.words('english')
#print(stopWords)

# function #1    
def splitSentences(iFile, oPathNoExt):
    if iFile.name.endswith("_Raw.txt"):
        print("[Splitting sentences on file:] " + iFile.name)
        baseName = oPathNoExt.split(".en", 1)[0]
        OFName = baseName + ".en_sent.txt"
        
        # the WITH keyword makes it possible to omit the file.close() function at the end to close the file
        with open(OFName,"w") as oFile:
            print(oFile.name)
            text = iFile.read()    
            # initial text is full of new lines, so we have to remove them first. 
            text = text.replace("\n", " ")    
            # have the sentences split and print them one by one
            sentences = sent_tokenize(text)
            # write to the output file
            for sent in sentences:
                oFile.write(sent+"\n")
        oFile.close()
    
"""
# -------------- discarded -----------------
# function #2 (optional)
# only leaves comma, space, dot
# read input file, save to output file
def removePunctuation(iFile, oPathNoExt):
    #text = re.sub("[\[\].;@#$%^&*:()][^,]", " ", file)
    if filePath.endswith("_sent.txt"):
        print("[Punctuation removal on file: ] " + iFile.name)
        with open(oPathNoExt + "_noPunct.txt","w") as oFile:
            oFile.write("".join([letter for letter in iFile if letter in allowed]))
        oFile.close()
    else:
        pass
"""
      
# function #3
def sentTokenize(iFile, oPathNoExt):
    if iFile.name.endswith("_sent.txt"):
        print("[Tokenizing file: ] " + iFile.name)
        baseName = oPathNoExt.split(".en", 1)[0]
        # print("BASE NAME: ", baseName)
        OFName = baseName + ".en_tokens.txt"
        
        with open(OFName, "w") as oFile:
            tokens = iFile.read()
            tokens = word_tokenize(tokens)
            for tok in tokens:
                UNallowedPunct = {",", ".", "[", "]", "=", "*", "/", "+", 
                                  "-", "%", "#", "?", "(", ")", "-", "_", 
                                  ";", ":", "'", "\"", "^", "`"}.union(string.ascii_lowercase)
                if tok in UNallowedPunct:
                    pass
                else:
                    oFile.write(tok+"\n")
        oFile.close()
    else:
        pass
    
    
# function #4 - partOfSpeechTag Tagging
def rmStopWords(iFile, oPathNoExt):
    
    if not iFile.name.endswith("_tokens.txt"):
        pass
    else:
        print("[Removing StopWords: ] " + iFile.name)
        baseName = oPathNoExt.split(".en", 1)[0]
        # print("BASE NAME: ", baseName)
        OFName = baseName + ".en_noStopWordsALL.txt"
        
        tokenList = iFile.read().split()
        
        with open(OFName, "w") as oFile:     
            final = []
            
            for tok in tokenList:
                if tok in stopWords:
                    # if token is a stop word, don't save in the output (skip)
                    pass
                else:
                    final.append(tok)
                    oFile.write(tok+"\n")
            oFile.close()
            

In [83]:
# function #5 - partOfSpeechTag Tagging
def POStag(iFile, oPathNoExt):
    if iFile.name.endswith("_noStopWordsALL.txt"):
        print("[Part of Speech tagging: ] " + iFile.name)
        baseName = oPathNoExt.split(".en", 1)[0]
        # print("BASE NAME: ", baseName)
        OFName = baseName + ".en_tokPOStag.txt"
        
        tokenList = iFile.read().split()
        
        with open(OFName, "w") as oFile:            
            taggedTok = pos_tag(tokenList)
            
            for tok in taggedTok:
                UNallowedPunct = {",", ".", "[", "]"}.union(string.ascii_lowercase)
                if tok[0] in UNallowedPunct:
                    # if token is a punctuation symbol, don't save in the output
                    pass
                else:
                    #print(tok[0],',',tok[1])    # output: now , RB
                    #print(tok)                  # output: ('now', 'RB')
                    outpLine = tok[0]+","+tok[1]
                    oFile.write(outpLine+"\n")
                    #oFile.write(str(tok)+"\n")
                    
        oFile.close()
    else:
        pass
    

    
"""
Outputs data as a tuple (Word,WordNetPOSTag,Stem)

POS TAG: read line by line, split each line by (","), write line[0] to the file with comma, following it
STEM: take line[1] and compare POS tag, return the appropriate POS tag as per Wordnet, i.e. 
r for adverb
a for adjective
n for noun
v for verb
"""
def stem(iFile, oPathNoExt):
    stemmer = LancasterStemmer()
    
    #if not iFile.name.endswith("_tokPOStag.txt"):
    if not iFile.name.endswith("NounVerbAdjectiveAdverb.txt"):
        pass
    else:
        print("[Stemming: ] " + iFile.name)
        baseName = oPathNoExt.split(".en", 1)[0]
        OFName = baseName + ".en_stemmedbyPOS.txt"
        
        content = iFile.read().split()
        
        with open(OFName, "w") as oFile:   
           
            for line in content:
                curline = line.split(",")    #list with 2 elements
                word = curline[0]
                posTag = curline[1]
                
                stem = stemmer.stem(word)
                
                if(penn_to_wn(posTag) == "a"):
                    oFile.write(word+",a,"+stem+"\n")
                elif(penn_to_wn(posTag) == "n"):
                    oFile.write(word+",n,"+stem+"\n")
                elif(penn_to_wn(posTag) == "r"):
                    oFile.write(word+",r,"+stem+"\n")
                elif(penn_to_wn(posTag) == "v"):
                    oFile.write(word+",v,"+stem+"\n")
                else:
                    pass
                
            oFile.close() 
            
# Requires the output from stem() and more specifically the POS tag in order to know what 
# word to make out of the stem. If not specific, it assumes NOUN
def lemmatize(iFile, oPathNoExt):
    lemmatizer = WordNetLemmatizer()
    
    if not iFile.name.endswith("en_stemmedbyPOS.txt"):
    #if not iFile.name.endswith("NounVerbAdjectiveAdverb.en_stemmedbyPOS.txt"):
        pass
    else:
        print("[Lemmatizing: ] " + iFile.name)
        baseName = oPathNoExt.split(".en", 1)[0]
        # print("BASE NAME: ", baseName)
        OFName = baseName + ".en_lemmatized.txt"
        
        content = iFile.read().split()
        
        #print("Word | Stem | Lemma | LemmaPOS")
        #print("------------------------------")
        with open(OFName, "w") as oFile:    
            for line in content:
                curline = line.split(",")    #list with 2 elements
                word = curline[0]
                posTag = curline[1]
                stem = curline[2]
                
                lemma = nltk.stem.WordNetLemmatizer().lemmatize(stem)
                lemmaPOS = nltk.stem.WordNetLemmatizer().lemmatize(stem, 'v')
                #print(word+" | "+stem+" | "+lemma+" | "+lemmaPOS)
                oFile.write(lemmaPOS+"\n")
            
            oFile.close()

In [84]:
# ----------------------------------------------- PROGRAM  -------------------------------------------------------

path = "/media/sf_Shared_Folder/TEST/one file"   # TEST DATA PATH
#path = "/media/sf_Shared_Folder/Coursera Downloads PreProcessed"   # REAL DATA PATH

counter = 0

for root, subdirs, files in os.walk(path):

    for curFile in os.listdir(root):

        filePath = os.path.join(root, curFile)

        if os.path.isdir(filePath):
            pass

        else:
            # check for file extension and if not TXT, continue and disregard the current file
            if not filePath.endswith(".txt"):
                pass
            else: 
                # else create a new txt file with "_PROC.txt" to store the output and process the original file
                try: 
                    counter += 1
                    #fileName = print(os.path.abspath(filePath))
                    curFile = open(filePath, 'r', encoding = "ISO-8859-1") #IMPORTANT ENCODING! UTF8 DOESN'T WORK
                    #outpFile = open(os.path.abspath)

                    fileExtRemoved = os.path.splitext(os.path.abspath(filePath))[0]
                    #outpFileBase = open(FileExtRemoved + "_PROC.txt","w")

                    """
                    call each processing function here and pass it the file
                    First argument: current input file
                    Second argument: path without extension of the current file
                    the path will be used to save the output file with the same name and same location
                    but with different file ending based on what the fuunction did
                    
                    Functions take input files ending on:
                    splitSentences()	>>	_Raw.txt
                    sentTokenize()		>>	_sent.txt
                    POStag()			>>	_noStopWordsALL.txt
                    rmStopWords()		>>	_tokens.txt
                    stem()				>>	_tokPOStag.txt
                    lemmatize()			>>	_stemmed.txt
                    """  
                    #removePunctuation(curFile, fileExtRemoved)
                    #splitSentences(curFile, fileExtRemoved)
                    #sentTokenize(curFile, fileExtRemoved)
                    #rmStopWords(curFile, fileExtRemoved)
                    #POStag(curFile, fileExtRemoved)
                    #stem(curFile, fileExtRemoved)
                    lemmatize(curFile, fileExtRemoved)
                    
                    if curFile.name.endswith("_Raw.txt"):
                        counter += 1
                    
                finally: 
                    curFile.close()
        
print("\nTotal number of {} {} files found.".format(counter, "TXT"))

[Lemmatizing: ] /media/sf_Shared_Folder/TEST/one file/01_what-is-the-definition-of-derivative.en_stemmedbyPOS.txt
[Lemmatizing: ] /media/sf_Shared_Folder/TEST/one file/04_how-does-wiggling-x-affect-f-x.en_stemmedbyPOS.txt
[Lemmatizing: ] /media/sf_Shared_Folder/TEST/one file/NounVerbAdjectiveAdverb.en_stemmedbyPOS.txt

Total number of 21 TXT files found.


### TODO next:
- get the _tokPOStag.txt files
- read and save every line as key-value pair or list of 2 elements
- compare the second element of each line, i.e the POS, match with the wordnet pos tags
- process stemming correctly
- [lemmatize](https://stackoverflow.com/questions/25534214/nltk-wordnet-lemmatizer-shouldnt-it-lemmatize-all-inflections-of-a-word)
- [find n-grams](https://stackoverflow.com/questions/17531684/n-grams-in-python-four-five-six-grams)
- Perform [Bag of Words](https://pythonprogramminglanguage.com/bag-of-words/)

---------- offtopic ---------

- [Puthon theory](http://xahlee.info/python/python_basics.html)
- [Text classification](https://gallery.azure.ai/Experiment/Text-Classification-Step-2-of-5-text-preprocessing-2)
- [Preprocessing steps](https://www.kdnuggets.com/2018/03/text-data-preprocessing-walkthrough-python.html)

Jupyter Notebook Shortcuts:
- **A**: Insert cell --**ABOVE**--
- **B**: Insert cell --**BELOW**--
- **M**: Change cell to --**MARKDOWN**--
- **Y**: Change cell to --**CODE**--
    
- **Shift + Tab** will show you the Docstring (**documentation**) for the the object you have just typed in a code cell  you can keep pressing this short cut to cycle through a few modes of documentation.
- **Ctrl + Shift + -** will split the current cell into two from where your cursor is.
- **Esc + O** Toggle cell output.
- **Esc + F** Find and replace on your code but not the outputs.

[MORE SHORTCUTS](https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/)

### ----------------------------------------------------------------------------------------------------------------------------------------------------