## 0. Understanding the Problem

The process of classifying the authors for a given text is called authorship attribution. Each author writes about different topics and has their own style of writing (author fingerprint) which allows for the identification. Applications of this kind of model include plaigarism detection and resolving the disputed authorship. 

In the dataset given there are 2 columns: Author and Text
This makes it a supervised learning problem since there is data and a assigned label  to each text. 

In the problem, the cost of false positive and false negatives both carry significant consequences. Therefore, a good model should have a balance of sensitivity and specificity. F1-score would be the ideal metric for the model evaluation

## 1. Installing and Importing Packages

In [1]:
# %pip install openpyxl --upgrade
# %pip install textstat

In [38]:
import re
import pandas as pd
from pprint import pprint
import numpy as np
import seaborn as sns
import string
from nltk.corpus import stopwords
from nltk import bigrams, trigrams, FreqDist
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import MWETokenizer
import matplotlib.pyplot as plt
import textstat
import warnings
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
warnings.filterwarnings("ignore")
sns.set_style("darkgrid")
palette = "cool"

## DATA UNDERSTANDING

In [3]:
df_train = pd.read_excel("Assignment_Data/Data.xlsx")

df_train.head()

Unnamed: 0,Text,Author
0,Scoring in PROC DISCRIM is as easy as validati...,AM
1,"In the GLM procedure, you may have used LSMEAN...",AM
2,"The first problem, accuracy of the data file, ...",AM
3,If the homogeneity of covariance matrices assu...,AM
4,"With a CONTRAST statement, you specify L, in t...",AM


In [4]:
df_train.shape

(1922, 2)

In [5]:
df_train.info()
# no null values in the dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1922 entries, 0 to 1921
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Text    1922 non-null   object
 1   Author  1922 non-null   object
dtypes: object(2)
memory usage: 30.2+ KB


In [6]:
# check for duplicates in the text data 
duplicate_data = df_train.duplicated(keep="first")

print(duplicate_data.sum(), "duplicate rows are present within the data")


display(df_train[duplicate_data].sort_values(by="Text").head(8))

1106 duplicate rows are present within the data


Unnamed: 0,Text,Author
1367,"%distribution(data=&data,out=&report_name,cont...",DM
1419,"%distribution(data=&data,out=&report_name,cont...",DM
1414,"%generate_grouping(from=work.profile_codes,val...",DM
1243,"%generate_grouping(from=work.profile_codes,val...",DM
1362,*\tTemporal infidelity occurs when model input...,DM
1370,*\tTemporal infidelity occurs when model input...,DM
248,*\texamining group differences on predictor va...,AM
44,*\texamining group differences on predictor va...,AM


In [7]:
# remove the duplicate rows within the dataset
df_train.drop_duplicates(inplace=True)
display(df_train.shape)

(816, 2)

<div class="alert alert-info" role="alert">
    There's quite a lot of duplicated rows present. These will be unhelpful for training the model and need to be removed.
</div>

In [8]:
df_train["sentence_length"] = df_train.Text.apply(len)
df_train.head()

Unnamed: 0,Text,Author,sentence_length
0,Scoring in PROC DISCRIM is as easy as validati...,AM,215
1,"In the GLM procedure, you may have used LSMEAN...",AM,782
2,"The first problem, accuracy of the data file, ...",AM,990
3,If the homogeneity of covariance matrices assu...,AM,934
4,"With a CONTRAST statement, you specify L, in t...",AM,1490


In [9]:
df_train["sentence_length"].describe()

count     816.000000
mean      688.286765
std       532.106926
min        61.000000
25%       304.000000
50%       526.500000
75%       922.000000
max      4096.000000
Name: sentence_length, dtype: float64

<div class="alert alert-info" role="alert">
    --> The smallest text is 61 characters long and the longest text is 4096 characters long.<br>
    --> The standard deviation is 547 characters which shows that the lengths are not uniform and that they vary widely.<br>
    --> Mean is greater than median, this means that it is a right skewed distribution with majority of the texts are short and it is the longer texts which are pulling the mean up, indicating the presence of outliers.<br>
</div>

In [10]:
# function to count the number of words inside a sentence
def word_counter(sent):
    return len(sent.split(" "))

df_train["word_count"] = df_train["Text"].apply(word_counter)

df_train.head()

Unnamed: 0,Text,Author,sentence_length,word_count
0,Scoring in PROC DISCRIM is as easy as validati...,AM,215,37
1,"In the GLM procedure, you may have used LSMEAN...",AM,782,129
2,"The first problem, accuracy of the data file, ...",AM,990,159
3,If the homogeneity of covariance matrices assu...,AM,934,146
4,"With a CONTRAST statement, you specify L, in t...",AM,1490,247


In [11]:
# getting the average word length

def avg_word_length(sent): 
    sent_len = len(sent.split(" "))
    return sum([len(wrd)  for wrd in sent]) / sent_len

df_train["avg_word_length"] = df_train["Text"].apply(avg_word_length)

df_train.head()

Unnamed: 0,Text,Author,sentence_length,word_count,avg_word_length
0,Scoring in PROC DISCRIM is as easy as validati...,AM,215,37,5.810811
1,"In the GLM procedure, you may have used LSMEAN...",AM,782,129,6.062016
2,"The first problem, accuracy of the data file, ...",AM,990,159,6.226415
3,If the homogeneity of covariance matrices assu...,AM,934,146,6.39726
4,"With a CONTRAST statement, you specify L, in t...",AM,1490,247,6.032389


In [12]:
df_train[["word_count", "avg_word_length"]].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
word_count,816.0,108.006127,84.159438,12.0,47.0,81.0,147.0,704.0
avg_word_length,816.0,6.429311,0.597206,5.0,6.056834,6.349312,6.710606,10.9375


In [26]:
# adding a type-token ratio feature to check the lexical richness of the text, it is measured by looking at the ratio of the unique tokens to the total number of tokens used within the text. 

# the aim is to look at the lexical richness of each author's texts by grouping by the author name.


def text_tokenizer(sent):
    lemmatizer = WordNetLemmatizer()
    sent = sent.lower() # convert the text to lowercase
    tokens = re.split(r'\W+', sent) # split text based on non word characters
    clean_tokens = [lemmatizer.lemmatize(i) for i in tokens if i not in string.punctuation and not i.isdigit()] # removing number, punctuation and lemmatizing the text
    return set(clean_tokens) # return set of clean tokens
    
    
def lexical_diversity(text):
    tokens = text_tokenizer(text)
    types = len(set(tokens))
    return types / len(tokens) if tokens else 0


df_train['Lexical_Diversity'] = df_train['Text'].apply(lexical_diversity)
author_lexical_diversity = df_train.groupby('Author')['Lexical_Diversity'].mean()


# Display the results
author_lexical_diversity.sort_values(ascending=False)

Author
AM    1.0
CD    1.0
DM    1.0
DO    1.0
FE    1.0
TK    1.0
Name: Lexical_Diversity, dtype: float64

In [50]:
# looking at the row with the smallest type to token ratio. 
small_ttt = df_train["type_to_token"].argmin()
df_train.iloc[small_ttt]

KeyError: 'type_to_token'

In [28]:
df_train["comma_count"] = df_train["Text"].str.count(",")
df_train.head()

Unnamed: 0,Text,Author,sentence_length,word_count,avg_word_length,type_to_token_by_author,Lexical_Diversity,comma_count
0,Scoring in PROC DISCRIM is as easy as validati...,AM,215,37,5.810811,0.756757,1.0,0
1,"In the GLM procedure, you may have used LSMEAN...",AM,782,129,6.062016,0.542636,1.0,4
2,"The first problem, accuracy of the data file, ...",AM,990,159,6.226415,0.583851,1.0,6
3,If the homogeneity of covariance matrices assu...,AM,934,146,6.39726,0.463087,1.0,4
4,"With a CONTRAST statement, you specify L, in t...",AM,1490,247,6.032389,0.370518,1.0,8


In [31]:
def avg_sentence_length(txt):
    sents = re.split(r'[.!?]+', txt)
    sents = [sent.strip() for sent in sents if sent.strip()]
    word_counts = [len(sent.split()) for sent in sents]

    if len(word_counts) > 0:
        return sum(word_counts) / len(word_counts)
    else:
        return 0
    
df_train['avg_sentence_length'] = df_train['Text'].apply(avg_sentence_length)

In [32]:
df_train

Unnamed: 0,Text,Author,sentence_length,word_count,avg_word_length,type_to_token_by_author,Lexical_Diversity,comma_count,Avg_Sentence_Length,avg_sentence_length
0,Scoring in PROC DISCRIM is as easy as validati...,AM,215,37,5.810811,0.756757,1.0,0,12.333333,12.333333
1,"In the GLM procedure, you may have used LSMEAN...",AM,782,129,6.062016,0.542636,1.0,4,18.428571,18.428571
2,"The first problem, accuracy of the data file, ...",AM,990,159,6.226415,0.583851,1.0,6,16.000000,16.000000
3,If the homogeneity of covariance matrices assu...,AM,934,146,6.397260,0.463087,1.0,4,18.250000,18.250000
4,"With a CONTRAST statement, you specify L, in t...",AM,1490,247,6.032389,0.370518,1.0,8,15.687500,15.687500
...,...,...,...,...,...,...,...,...,...,...
1917,6. Almost everyone will agree that we live in ...,TK,294,51,5.764706,0.680000,1.0,1,17.000000,17.000000
1918,17. Art forms that appeal to modern leftist in...,TK,331,55,6.018182,0.851852,1.0,3,27.500000,27.500000
1919,201. Suppose for example that the revolutionar...,TK,837,133,6.293233,0.563910,1.0,3,14.777778,14.777778
1920,71. People have many transitory drives or impu...,TK,881,156,5.647436,0.602564,1.0,7,19.500000,19.500000


In [34]:
import spacy

# Load the SpaCy English model
nlp = spacy.load("en_core_web_sm")


def pos_proportions(text, pos_tag):
    doc = nlp(text)
    pos_counts = sum(1 for token in doc if token.pos_ == pos_tag)
    total_words = sum(1 for token in doc if token.is_alpha)
    return pos_counts / total_words if total_words > 0 else 0


df_train['POS_Nouns'] = df_train['Text'].apply(lambda x: pos_proportions(x, "NOUN"))
df_train['POS_Verbs'] = df_train['Text'].apply(lambda x: pos_proportions(x, "VERB"))


df_train.head()

Unnamed: 0,Text,Author,sentence_length,word_count,avg_word_length,type_to_token_by_author,Lexical_Diversity,comma_count,Avg_Sentence_Length,avg_sentence_length,POS_Nouns,POS_Verbs
0,Scoring in PROC DISCRIM is as easy as validati...,AM,215,37,5.810811,0.756757,1.0,0,12.333333,12.333333,0.305556,0.166667
1,"In the GLM procedure, you may have used LSMEAN...",AM,782,129,6.062016,0.542636,1.0,4,18.428571,18.428571,0.286822,0.093023
2,"The first problem, accuracy of the data file, ...",AM,990,159,6.226415,0.583851,1.0,6,16.0,16.0,0.234177,0.158228
3,If the homogeneity of covariance matrices assu...,AM,934,146,6.39726,0.463087,1.0,4,18.25,18.25,0.312925,0.081633
4,"With a CONTRAST statement, you specify L, in t...",AM,1490,247,6.032389,0.370518,1.0,8,15.6875,15.6875,0.282869,0.12749


In [39]:
def flesch_reading_score(text):
    return textstat.flesch_reading_ease(text)


df_train['flesch_reading_score'] = df_train['Text'].apply(flesch_reading_score)

df_train.head()

Unnamed: 0,Text,Author,sentence_length,word_count,avg_word_length,type_to_token_by_author,Lexical_Diversity,comma_count,Avg_Sentence_Length,avg_sentence_length,POS_Nouns,POS_Verbs,flesch_reading_score
0,Scoring in PROC DISCRIM is as easy as validati...,AM,215,37,5.810811,0.756757,1.0,0,12.333333,12.333333,0.305556,0.166667,58.99
1,"In the GLM procedure, you may have used LSMEAN...",AM,782,129,6.062016,0.542636,1.0,4,18.428571,18.428571,0.286822,0.093023,52.8
2,"The first problem, accuracy of the data file, ...",AM,990,159,6.226415,0.583851,1.0,6,16.0,16.0,0.234177,0.158228,46.88
3,If the homogeneity of covariance matrices assu...,AM,934,146,6.39726,0.463087,1.0,4,18.25,18.25,0.312925,0.081633,44.44
4,"With a CONTRAST statement, you specify L, in t...",AM,1490,247,6.032389,0.370518,1.0,8,15.6875,15.6875,0.282869,0.12749,55.54


In [43]:
def unique_word_count(text):
    words = text.split()
    return len(set(words))

df_train['unique_word_count'] = df_train['Text'].apply(unique_word_count)
df_train.head()

Unnamed: 0,Text,Author,sentence_length,word_count,avg_word_length,type_to_token_by_author,Lexical_Diversity,comma_count,Avg_Sentence_Length,avg_sentence_length,POS_Nouns,POS_Verbs,flesch_reading_score,Unique_Word_Count,Exclamation_Count,exclamation_count,unique_word_count
0,Scoring in PROC DISCRIM is as easy as validati...,AM,215,37,5.810811,0.756757,1.0,0,12.333333,12.333333,0.305556,0.166667,58.99,32,0,0,32
1,"In the GLM procedure, you may have used LSMEAN...",AM,782,129,6.062016,0.542636,1.0,4,18.428571,18.428571,0.286822,0.093023,52.8,84,0,0,84
2,"The first problem, accuracy of the data file, ...",AM,990,159,6.226415,0.583851,1.0,6,16.0,16.0,0.234177,0.158228,46.88,110,1,1,110
3,If the homogeneity of covariance matrices assu...,AM,934,146,6.39726,0.463087,1.0,4,18.25,18.25,0.312925,0.081633,44.44,82,0,0,82
4,"With a CONTRAST statement, you specify L, in t...",AM,1490,247,6.032389,0.370518,1.0,8,15.6875,15.6875,0.282869,0.12749,55.54,120,0,0,120


In [44]:
def count_exclamations(text):
    return text.count('!')

df_train['exclamation_count'] = df_train['Text'].apply(count_exclamations)
df_train.head()

Unnamed: 0,Text,Author,sentence_length,word_count,avg_word_length,type_to_token_by_author,Lexical_Diversity,comma_count,Avg_Sentence_Length,avg_sentence_length,POS_Nouns,POS_Verbs,flesch_reading_score,Unique_Word_Count,Exclamation_Count,exclamation_count,unique_word_count
0,Scoring in PROC DISCRIM is as easy as validati...,AM,215,37,5.810811,0.756757,1.0,0,12.333333,12.333333,0.305556,0.166667,58.99,32,0,0,32
1,"In the GLM procedure, you may have used LSMEAN...",AM,782,129,6.062016,0.542636,1.0,4,18.428571,18.428571,0.286822,0.093023,52.8,84,0,0,84
2,"The first problem, accuracy of the data file, ...",AM,990,159,6.226415,0.583851,1.0,6,16.0,16.0,0.234177,0.158228,46.88,110,1,1,110
3,If the homogeneity of covariance matrices assu...,AM,934,146,6.39726,0.463087,1.0,4,18.25,18.25,0.312925,0.081633,44.44,82,0,0,82
4,"With a CONTRAST statement, you specify L, in t...",AM,1490,247,6.032389,0.370518,1.0,8,15.6875,15.6875,0.282869,0.12749,55.54,120,0,0,120


In [45]:
# value to check how complex the text is
df_train['gunning_fog_index'] = df_train['Text'].apply(textstat.gunning_fog)
df_train.head()

Unnamed: 0,Text,Author,sentence_length,word_count,avg_word_length,type_to_token_by_author,Lexical_Diversity,comma_count,Avg_Sentence_Length,avg_sentence_length,POS_Nouns,POS_Verbs,flesch_reading_score,Unique_Word_Count,Exclamation_Count,exclamation_count,unique_word_count,Gunning_Fog_Index
0,Scoring in PROC DISCRIM is as easy as validati...,AM,215,37,5.810811,0.756757,1.0,0,12.333333,12.333333,0.305556,0.166667,58.99,32,0,0,32,10.33
1,"In the GLM procedure, you may have used LSMEAN...",AM,782,129,6.062016,0.542636,1.0,4,18.428571,18.428571,0.286822,0.093023,52.8,84,0,0,84,11.39
2,"The first problem, accuracy of the data file, ...",AM,990,159,6.226415,0.583851,1.0,6,16.0,16.0,0.234177,0.158228,46.88,110,1,1,110,11.39
3,If the homogeneity of covariance matrices assu...,AM,934,146,6.39726,0.463087,1.0,4,18.25,18.25,0.312925,0.081633,44.44,82,0,0,82,11.98
4,"With a CONTRAST statement, you specify L, in t...",AM,1490,247,6.032389,0.370518,1.0,8,15.6875,15.6875,0.282869,0.12749,55.54,120,0,0,120,9.15


In [None]:
plt.figure(figsize=(10, 4))
sns.countplot(data=df_train, x="Author", order=df_train["Author"].value_counts().index, palette=palette);
plt.title("Target Variable Distribution");
plt.ylabel("Author Counts");
plt.xlabel("Authors");
plt.xticks();

<div class="alert alert-info" role="alert">
    Data is imbalanced, need to do a combination of oversampling and undersampling to have equal representation of each class (author) and ensure that our model shows no bias. The reason for using combination of oversampling and undersampling is because making use of just undersampling techniques will reduce the amount of data provided and doing just oversampling will run the risk of overfitting the model, as it might start to "memorize" the oversampled data instead of learning to generalize from patterns in the data.
</div>

In [None]:
# sns.boxplot(data=df_train, x=df_train.Author, y=df_train.length)
sns.violinplot(data=df_train, x=df_train.Author, y=df_train.length, inner="points", palette=palette)

<div class="alert alert-info" role="alert">
    Presence of outliers indicates that texts which are unusually long or short could indicate the author's sentiment more strongly. The violin plot shows that majority of the text lengths are around 400 - 500 characters.
</div>
<div class="alert alert-info" role="alert">
    Given the differences in the text lengths, it would be best to normalize the text to control the author specific length effects.
</div>

In [None]:
text = " ".join(df_train.Text)

wordcloud = WordCloud().generate(text)

plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

<div class="alert alert-info" role="alert">
At first glance it can be seen that: 
    <br>
    -> Most important words: "Model", "Data", "Analysis", "Variable" <br>
    -> Stopwords: "One", "May", "Even", "Must"
</div>

In [None]:
# look at the words that each author has written

author_texts = {author: " ".join(df_train[df_train['Author'] == author]['Text']) for author in df_train.Author.unique()}

fig = plt.figure(figsize = (8,4))


for author, text in author_texts.items():
    wordcloud = WordCloud().generate(text)
    plt.figure(figsize = (8, 8), facecolor = None) 
    plt.subplot(122)
    plt.imshow(wordcloud) 
    plt.title(f"Text by {author}")
    plt.axis("off") 
    plt.tight_layout(pad = 0) 
    plt.show()

<div class="alert alert-info" role="alert">
At first glance it can be seen that each author is speaking about different topics: 
    <br>
    -> AM: talks about statistical classification techniques, discussing things such as discriminant analysis, group differences, and variable selection <br>
    -> CD: talks about machine learning and probability concepts like logistic regression, odds ratios, model estimation, predictor variable <br>
    -> DM: talks about data modeling, specifically addressing the management and analysis of transactional data sets <br>
    -> DO: talks about experimental design and topics like block design and factorial experiements <br>
    -> FE: talks about time series forecasting with topics like PROC, ARIMA and forecast <br>
    -> TK: talks about societal impact, with topics like society, psychological, and people.
</div>

## DATA PREPARATION

In [None]:
def text_preprocessing(txt):
    lemmatiser = WordNetLemmatizer()
    txt_clean = txt.replace('\t', ' ').replace('\n', ' ')
    txt_tokens = txt_clean.split(" ")
    txt_lowercase = [word.lower() for word in txt_tokens]
    txt_no_punctuation = [word.translate(str.maketrans('', '', string.punctuation)) for word in txt_lowercase]
    txt_lemmatize = [lemmatiser.lemmatize(word) for word in txt_no_punctuation]
    txt_no_stopwords = [word for word in txt_lemmatize if word not in stopwords.words('english') and word != ""]
    return txt_no_stopwords

In [None]:
df_train["preprocessed_text"] = df_train["Text"].apply(lambda x: text_preprocessing(x))
df_train.head()

In [None]:
grouped = df_train.groupby('Author')['preprocessed_text'].sum()

for author, words in grouped.items():
    text = ' '.join(words)
    wordcloud = WordCloud().generate(text)
    plt.figure(figsize = (8, 8)) 
    plt.subplot(122)
    plt.imshow(wordcloud) 
    plt.title(f"Text by {author}")
    plt.axis("off")
    plt.tight_layout(pad = 0) 
    plt.show()

In [None]:
all_tokens = sum(df_train['preprocessed_text'], [])
unigram_freq = FreqDist(all_tokens)

unigrams, counts = zip(*unigram_freq.most_common(15))
plt.barh(unigrams, counts)
plt.title('TOP 15 UNIGRAMS')
plt.show()

In [None]:
all_bigrams = list(bigrams(all_tokens))
bigram_freq = FreqDist(all_bigrams)
print(bigram_freq.most_common(10))

bigrams, counts = zip(*bigram_freq.most_common(15))
bigram_labels = [' '.join(bigram) for bigram in bigrams]
plt.barh(bigram_labels, counts)
plt.title('Top 15 BIGRAMS')
plt.figure(figsize = (15, 15))
plt.show()

In [None]:
all_trigrams = list(trigrams(all_tokens))
all_trigrams
trigram_freq = FreqDist(all_trigrams)
print(trigram_freq.most_common(15))

trigrams, counts = zip(*trigram_freq.most_common(15))
trigram_labels = [' '.join(trigram) for trigram in trigrams]
plt.barh(trigram_labels, counts)
plt.title('TOP 15 TRIGRAMS')
plt.figure(figsize = (15, 15))
plt.show()

In [None]:
compound_words = [i[0] for i in bigram_freq.most_common(15)]
compound_words += [('logistic', 'regression', 'model')]
mwe_tokenizer = MWETokenizer(compound_words, separator="_")
mwe_tokenizer

In [None]:
# use this to update the tokens with compound words found from the bigram analysis
# also remove special characters and numbers
def apply_mwe_to_tokens(row):
    tokens = mwe_tokenizer.tokenize(row)
    cleaned_tokens = [token.replace('\t', '') for token in tokens if not token.isnumeric()]
    return cleaned_tokens

df_train['preprocessed_text'] = df_train['preprocessed_text'].apply(apply_mwe_to_tokens)
df_train.head()

In [None]:
df_train.to_csv("cleaned_data.csv")