#### Solution Thought Process:

#### 1. Read in all books with known authors (Dataset)

#### 2. Identify writing style and commonly used words by authors (NLP Engine)

#### 3. Build rules using identified styles and rules to classify unknown books (ML Engine)

# 1. Importing in Necessary Libraries

In [2]:
#Libraries for Data Manipulation
import pandas as pd
import numpy as np

#Library for Progress Bar
from tqdm import tqdm

#Libraries for Different ML Algorithms
import xgboost as xgb
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

from keras.models import Sequential
from keras.layers.recurrent import LSTM, GRU
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.embeddings import Embedding
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils
from keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D
from keras.callbacks import EarlyStopping

#Library for Dimensionality Reduction
from sklearn.decomposition import TruncatedSVD

#Libraries for Miscellaneous Steps 
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

#Libraries for NLP
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from keras.preprocessing import sequence, text

from nltk import word_tokenize
from nltk.corpus import stopwords

Using TensorFlow backend.


#### Loading English Stopwords Dictionary from the NLTK package.

In [3]:
english_stopwords = stopwords.words('english')

# 2. Reading in Dataset

#### There have been recent buzz around using import.io basic free-to-use platform as a means of scraping information off the website, however, I have yet to learn that platform and so for this practice, I am just using a pre-scraped dataset found on Kaggle.

In [4]:
#Reading in Train Dataset
author_identification_train = pd.read_csv('C:/Users/esnxwng/Desktop/Fun Stuff/Author_Identification/Author_Identification_Train.csv')

#Reading in Test Dataset
author_identification_test = pd.read_csv('C:/Users/esnxwng/Desktop/Fun Stuff/Author_Identification/Author_Identification_Test.csv')

#Reading in Sample Desired Output
author_identification_sample = pd.read_csv('C:/Users/esnxwng/Desktop/Fun Stuff/Author_Identification/Author_Identification_Sample.csv')

## 2a. Having a quick look at the Dataset

#### In the train section of the dataset, there are 3 columns. The ID column contains little or no information to us as ML practitioners. The key idea here is to use a chunk of text found in the second column to predict the author in the third column. This can be seen clearer when you look at the test dataset which has the third column missing.

In [6]:
author_identification_train.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


In [7]:
author_identification_test.head()

Unnamed: 0,id,text
0,id02310,"Still, as I urged our leaving Ireland with suc..."
1,id24541,"If a fire wanted fanning, it could readily be ..."
2,id00134,And when they had broken down the frail door t...
3,id27757,While I was thinking how I should possibly man...
4,id04081,I am not sure to what limit his knowledge may ...


#### The desired output contains the probability of the 3 authors (3 classes shown in the training dataset). They are EAP, HPL and MWS.

In [10]:
author_identification_sample.head()

Unnamed: 0,id,EAP,HPL,MWS
0,id02310,0.403494,0.287808,0.308698
1,id24541,0.403494,0.287808,0.308698
2,id00134,0.403494,0.287808,0.308698
3,id27757,0.403494,0.287808,0.308698
4,id04081,0.403494,0.287808,0.308698


#### For Fun Stuff: By finishing a Jupyter Cell with the name of a variable, Jupyter-Notebook will display that variable without the need for Print(). We can alter/modify the 'ast_note_interactivity' kernel option to make Jupyter-Notebook display this for multiple variable or statement on its line.

In [11]:
from IPython.core.interactiveshell import InteractiveShell

#Modifying the ast_node_interactivity kernel option
InteractiveShell.ast_node_interactivity = "all"

In [12]:
author_identification_train.head()
author_identification_test.head()
author_identification_sample.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


Unnamed: 0,id,text
0,id02310,"Still, as I urged our leaving Ireland with suc..."
1,id24541,"If a fire wanted fanning, it could readily be ..."
2,id00134,And when they had broken down the frail door t...
3,id27757,While I was thinking how I should possibly man...
4,id04081,I am not sure to what limit his knowledge may ...


Unnamed: 0,id,EAP,HPL,MWS
0,id02310,0.403494,0.287808,0.308698
1,id24541,0.403494,0.287808,0.308698
2,id00134,0.403494,0.287808,0.308698
3,id27757,0.403494,0.287808,0.308698
4,id04081,0.403494,0.287808,0.308698


# 3. Creating Evaluation Metric

#### Logarithmic loss (related to cross-entropy) measures the performance of a classification model where the prediction output is a probability value between 0 and 1. The goal of our machine learning models is to minimize Log Loss. A perfect model would have a log loss of 0. Log loss increases as the predicted probability diverges from the actual label. So predicting a probability of .012 when the actual observation label is 1 would be bad and result in a high log loss.

#### The difference between Log Loss vs Accuracy:

#### 1. Accuracy - Count of predictions where predicted values == actual values. May not always be a good indicator because of its yes/no nature.

#### 2. Log Loss - Takes into account the uncertainity of predictions based on how much it varies from actual label. Provides a nuanced view into performance. This means that if your predicted probability is 0.5 for 1 (when the actual label is 1), you will get penalized lesser as compared to a predicted probability of 0.3 for 1 (when the actual label is 1).

<img src="Log_loss_graph.png">

## 3a. Writing our own customized function

In [15]:
def multiclass_logloss(actual, predicted, eps = 1e-15):
    
    """Multi-class Logarithmic-Loss Metric.
    1. actual = array containing actual target classes
    2. predicted = matrix with class predictions, one probability per class
    """
    
    #To visualize the following logic, note that array is a nx1 shaped. The column contains the different authors.
    
    #Converting 'actual' to a binary array if it is not already:
    if len(actual.shape) == 1:
        
        #This creates a numpy array filled with 0's that has the number of rows in the actual dataset while following the number of columns from the predicted dataset
        actual2 = np.zeros((actual.shape[0], predicted.shape[1]))
        
        #The enumerate function gives you an index and the value for each row in the array.
        for index, val in enumerate(actual):
            
            #The new array - actual2 which was previous filled with zeros in an nx3 matrix is going to be filled with 1 or 0s. 
            actual2[index, val] = 1 #For example, if a data in the actual dataframe is read in at row 300th with a value of 'EPS', it will then be placed in the 300th row of actual2 but under the column 'EPS' with a value of 1.
        
        #Replacing the array with the new binary array.    
        actual = actual2

    #######This is how you calculate the multiclass-logloss function##########

    #The function np.clip() is used to limit values in an array. The arguments in this function is as follows np.clip(a, a_min, a_max, out=None).
    clip = np.clip(predicted, eps, 1-eps) #Therefore eps is the minimum probability for the predicted matrix to have.    
    rows = actual.shape[0]
    vsota = np.sum(actual * np.log(clip))
    return -1.0 / rows * vsota

# We use -1.0 instead of 1.0 because the neative sign provides an easy metric for comparison. The positive of log of numbers < 1 returns negative values. Therefore the negative of log of numbers of < 1 returns positive values which are intuitively easier to handle.

<img src = 'Log_vs_neglog.gif'>

## 3b. Using a pre-defined function in a library

#### Alternatively, if you want to avoid the hassle, just use the pre-setup package in sklearn.metrics. The function has already been configured to accept in the same 3 arguments as the customized function above.

In [17]:
from sklearn.metrics import log_loss

# 4. Pre-processing Dataset

## 4a. Label Encoding

#### The intention of using label encoding is to convert the text labels, in this case the author's initials into integers 0, 1 and 2.

In [18]:
label_encoder = preprocessing.LabelEncoder()

In [21]:
#Basically, what we are doing here is to take the dataset titled 'author_identification_train', picking the column named 'author' and just taking the values instead of the whole file structure associated.
actual_y = label_encoder.fit_transform(author_identification_train.author.values)

## 4b. Train Test Split

In [24]:
X_train, X_valid, y_test, y_valid = train_test_split(author_identification_train.text.values, 
                                                     actual_y, 
                                                     stratify = actual_y, 
                                                     random_state = 777, 
                                                     test_size = 0.3, 
                                                     shuffle = True)