---
## 1. Utilities

There are several Python bindings available for __fastText__. Some of them are <br>
1. __fastText__ by __Facebook__ (https://github.com/facebookresearch/fastText/tree/master/python)
2. __fasttext__ by __Bayu Aldi Yansyah__ (https://pypi.org/project/fasttext/) 

We will be using __fastText__ Python binding provided by __Facebook__

#### Necessary Imports

In [5]:
import csv
import fasttext
import pandas as pd
from prettytable import PrettyTable

#### Function to process text
Text processing steps include:
1. Converting to lower-case
2. Removing punctuation
3. Replacing numbers (digits) with text (_eg._ '1' to 'one')

In [2]:
def processText(rawText):
    """
    input: string, raw text
    output: string, processed text
    """ 
    # convert to lower-case
    text = rawText.lower() 
    
    # remove punctuation (except single quote)
    # https://www.geeksforgeeks.org/python-maketrans-translate-functions/        
    # https://stackoverflow.com/a/31482417/7551231
    table = text.maketrans('!#$%&\()*+/<=>?@[]^_`{|}~', '                         ', '.?;,"-')
    text = text.translate(table)
    table = text.maketrans({"'":None})
    text = text.translate(table)
        
    # replace numbers with text 
    table = text.maketrans({'1':' one ',
                            '2':' two ',
                            '3':' three ',
                            '4':' four ',
                            '5':' five ',
                            '6':' six ',
                            '7':' seven ',
                            '8':' eight ',
                            '9':' nine ',
                            '0':' zero '})
    text = text.translate(table)    
    return text

---
## 2. Preparing data for fastText

#### Training & Validation Data

In [3]:
# read training data into a dataframe
df = pd.read_csv('train.csv', sep = ',')

# extract the column that has questions
rawText = df['question_text']

# process raw text
processed = []
for row in df.itertuples():
    rawText = row[2]
    processedText = processText(rawText)
    processed.append(processedText)
    
# modify training data frame with processed text
df['question_text'] = processed

# drop 'qid' columns
df = df.drop(['qid'], axis = 1)

# prefix class labels with '__label__' tag
# eg. 0 becomes __label__0
# eg. 1 becomes __label__1
df['target'] = df['target'].replace(0, '__label__0', regex = True)
df['target'] = df['target'].replace(1, '__label__1', regex = True)

# shuffle the dataframe
df = df.sample(frac = 1, random_state = 9999).reset_index(drop = True)

# training and validation sets (80-20 split)
splitPoint = int(0.8 * df.shape[0])
train, valid = df[0:splitPoint][['target', 'question_text']], df[splitPoint:][['target', 'question_text']]

# write to text file in the form "__label__labelName  content"
train.to_csv('fasttext_train.txt', 
             sep = ' ', 
             encoding = 'utf-8', 
             index = False, 
             header = False, 
             quoting = csv.QUOTE_NONE,             
             escapechar = ' ',
             mode = 'a')

valid.to_csv('fasttext_valid.txt', 
             sep = ' ', 
             encoding = 'utf-8', 
             index = False, 
             header = False,
             quoting = csv.QUOTE_NONE,
             escapechar = ' ',
             mode = 'a')

---
## 3. Text Classification using fastText

### A. Unigram Model (Default Parameters)

In [4]:
# Default parameters are https://github.com/facebookresearch/fastText/blob/master/python/fastText/FastText.py
# Parameters: learning rate = 0.1, negative words = 5, n-gram = 1, #epochs = 5  
# Note: 'minn' and 'maxn' are defaulty 0
model = fasttext.train_supervised(input = 'fasttext_train.txt')

# saving the model
model.save_model(path = 'unigram.bin')

# prediction on validation set
# http://ir-ratio.blogspot.com/2012/03/precision-at-1-and-reciprocal-rank.html 
validationParams = model.test(path = 'fasttext_valid.txt')
print('Number of examples: ', validationParams[0]) 
print('P@1: ', validationParams[1])
print('R@1: ', validationParams[2])

Number of examples:  261225
P@1:  0.953006029285099
R@1:  0.953006029285099


### B. Unigram Model (minn = 2, maxn = 6)

In [6]:
# Note: minn = 2, maxn = 6
model = fasttext.train_supervised(input = 'fasttext_train.txt', minn = 2, maxn = 6)

# saving the model
model.save_model(path = 'unigram26.bin')

# prediction on validation set
# http://ir-ratio.blogspot.com/2012/03/precision-at-1-and-reciprocal-rank.html 
validationParams = model.test(path = 'fasttext_valid.txt')
print('Number of examples: ', validationParams[0])
print('P@1: ', validationParams[1])
print('R@1: ', validationParams[2])

Number of examples:  261225
P@1:  0.9526653268255335
R@1:  0.9526653268255335


### C. Bigram Model

In [7]:
# Parameters: learning rate = 0.1, negative words = 5, n-gram = 2, #epochs = 5  
model = fasttext.train_supervised(input = 'fasttext_train.txt', wordNgrams = 2)

# saving the model
model.save_model(path = 'bigram.bin')

# prediction on validation set
# http://ir-ratio.blogspot.com/2012/03/precision-at-1-and-reciprocal-rank.html 
validationParams = model.test(path = 'fasttext_valid.txt')
print('Number of examples: ', validationParams[0])
print('P@1: ', validationParams[1])
print('R@1: ', validationParams[2])

Number of examples:  261225
P@1:  0.953905636903053
R@1:  0.953905636903053


### D. Trigram Model

In [8]:
# Parameters: learning rate = 0.1, negative words = 5, n-gram = 3, #epochs = 5  
model = fasttext.train_supervised(input = 'fasttext_train.txt', wordNgrams = 3)

# saving the model
model.save_model(path = 'trigram.bin')

# prediction on validation set
# http://ir-ratio.blogspot.com/2012/03/precision-at-1-and-reciprocal-rank.html 
validationParams = model.test(path = 'fasttext_valid.txt')
print('Number of examples: ', validationParams[0])
print('P@1: ', validationParams[1])
print('R@1: ', validationParams[2])

Number of examples:  261225
P@1:  0.9546521198200785
R@1:  0.9546521198200785


### E. Determining Dimensions of Hidden Layer

In [11]:
# PrettyTable for displaying results
results = PrettyTable()
results.field_names = ['Dimension', 'P@1', 'R@1']

# perform iteration along varying dimensions
dimensions = [10, 25, 50, 75, 100, 150, 200, 250, 300]
for dim in dimensions:
    # Parameters: learning rate = 0.1, negative words = 5, #epochs = 5  
    model = fasttext.train_supervised(input = 'fasttext_train.txt', dim = dim)    
    # prediction on validation set
    validationParams = model.test(path = 'fasttext_valid.txt')
    # add entries to results
    results.add_row([dim, validationParams[1], validationParams[2]])
    
# print results
print(results)    

+-----------+--------------------+--------------------+
| Dimension |        P@1         |        R@1         |
+-----------+--------------------+--------------------+
|     10    | 0.9530634510479472 | 0.9530634510479472 |
|     25    | 0.9530519666953775 | 0.9530519666953775 |
|     50    | 0.9530940759881328 | 0.9530940759881328 |
|     75    | 0.9530979041056561 | 0.9530979041056561 |
|    100    | 0.9531208728107953 | 0.9531208728107953 |
|    150    | 0.9531668102210737 | 0.9531668102210737 |
|    200    | 0.9531706383385969 | 0.9531706383385969 |
|    250    | 0.9531706383385969 | 0.9531706383385969 |
|    300    |  0.95328165374677  |  0.95328165374677  |
+-----------+--------------------+--------------------+


High Recall is achieved with 300 neurons in hidden layer

### F. Determining 'n' in n-gram

In [13]:
# PrettyTable for displaying results
results = PrettyTable()
results.field_names = ['n', 'P@1', 'R@1']

# perform iteration along varying 'n' in n-gram
nGram = [1, 2, 3, 4, 5]
for n in nGram:
    # Parameters: learning rate = 0.1, negative words = 5, #epochs = 5  
    model = fasttext.train_supervised(input = 'fasttext_train.txt', wordNgrams = n)    
    # prediction on validation set
    validationParams = model.test(path = 'fasttext_valid.txt')
    # add entries to results
    results.add_row([n, validationParams[1], validationParams[2]])
    
# print results
print(results)

+---+--------------------+--------------------+
| n |        P@1         |        R@1         |
+---+--------------------+--------------------+
| 1 | 0.9531208728107953 | 0.9531208728107953 |
| 2 | 0.9541353239544454 | 0.9541353239544454 |
| 3 | 0.9545717293520911 | 0.9545717293520911 |
| 4 | 0.9545411044119054 | 0.9545411044119054 |
| 5 | 0.9545564168819983 | 0.9545564168819983 |
+---+--------------------+--------------------+


n-gram with n = 3 achieves high recall.

### G. Determining 'minn' and 'maxnn'

In [15]:
# PrettyTable for displaying results
precision, recall = PrettyTable(), PrettyTable()
precision.field_names = ['', 'maxn = 2', 'maxn = 3', 'maxn = 4', 'maxn = 5', 'maxn = 6']
recall.field_names = ['', 'maxn = 2', 'maxn = 3', 'maxn = 4', 'maxn = 5', 'maxn = 6']

# perform iteration along 'minn' and 'maxn'
for minn in range(2,7):
    precisionTemp, recallTemp = ['minn = ' + str(minn)], ['minn = ' + str(minn)]
    for maxn in range(2,7):
        if maxn < minn:
            precisionTemp.append('')
            recallTemp.append('')
        else:
            # Parameters: learning rate = 0.1, negative words = 5, #epochs = 5  
            model = fasttext.train_supervised(input = 'fasttext_train.txt', 
                                          minn = minn,
                                          maxn = maxn)    
            # prediction on validation set
            validationParams = model.test(path = 'fasttext_valid.txt')    
            # add entries to results
            precisionTemp.append(validationParams[1])
            recallTemp.append(validationParams[2])
    precision.add_row(precisionTemp)
    recall.add_row(recallTemp)    
    
# print results
print(precision)
print(recall)

+----------+--------------------+--------------------+--------------------+--------------------+--------------------+
|          |      maxn = 2      |      maxn = 3      |      maxn = 4      |      maxn = 5      |      maxn = 6      |
+----------+--------------------+--------------------+--------------------+--------------------+--------------------+
| minn = 2 | 0.9511111111111111 | 0.9510115800555077 | 0.9521714996650397 | 0.9523399368360609 | 0.9527457172935209 |
| minn = 3 |                    | 0.9524088429514787 | 0.9525007177720356 | 0.9528911857594028 | 0.952764857881137  |
| minn = 4 |                    |                    | 0.9531629821035505 | 0.9530672791654704 | 0.9530098574026222 |
| minn = 5 |                    |                    |                    | 0.953354387979711  | 0.9531361852808882 |
| minn = 6 |                    |                    |                    |                    | 0.9532242319839219 |
+----------+--------------------+--------------------+--

High Recall is at minn = 5 and maxn = 5

### H. Checking Combination

In [16]:
# defining our model
model = fasttext.train_supervised(input = 'fasttext_train.txt', 
                                  wordNgrams = 3,
                                  minn = 5,
                                  maxn = 5,
                                  dim = 300)

# Parameters: learning rate = 0.1, negative words = 5, minn = 5, maxn = 5, n-gram = 3, #epochs = 5, dim = 300  
model = fasttext.train_supervised(input = 'fasttext_train.txt', 
                                  wordNgrams = 3,
                                  minn = 5,
                                  maxn = 5,
                                  dim = 300)

# saving the model
model.save_model(path = 'finalModel.bin')

# prediction on validation set
# http://ir-ratio.blogspot.com/2012/03/precision-at-1-and-reciprocal-rank.html 
validationParams = model.test(path = 'fasttext_valid.txt')
print('Number of examples: ', validationParams[0])
print('P@1: ', validationParams[1])
print('R@1: ', validationParams[2])

Number of examples:  261225
P@1:  0.955582352378218
R@1:  0.955582352378218


---
## 4. Processing Test Data 

In [17]:
# read training data into a dataframe
df = pd.read_csv('test.csv', sep = ',')

# extract the column that has questions
rawText = df['question_text']

# process raw text 
processed = []
for row in df.itertuples():
    rawText = row[2]
    processedText = processText(rawText)
    processed.append(processedText)
    
# modify training data frame with processed text
df['question_text'] = processed

## 5. Test Data Prediction

In [18]:
# predict
questions_list = df['question_text'].tolist() 
predictions = []
for question in questions_list:
    predictions.append(model.predict(question.strip())[0][0])
    
# make 'qid' and 'prediction' into a dataframe and write it as CSV
df['prediction'] = predictions

# replace '__label__0' with 0 and '__label__1' with 1
df['prediction'] = df['prediction'].replace('__label__0', 0, regex = True)
df['prediction'] = df['prediction'].replace('__label__1', 1, regex = True)

# write to CSV
df.to_csv('sample_submission.csv', 
          sep = ',',
          columns = ['qid', 'prediction'],
          index = False)    