<a href="https://colab.research.google.com/github/aaubs/ds-master/blob/main/notebooks/M2_spookyauthor_identification_nlp_explainer_tm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![](https://storage.googleapis.com/kaggle-media/competitions/spooky-books/dmitrij-paskevic-44124.jpg)

According to the Spooky Author Identification Kaggle competition page there are three distinct author initials we have already been provided with a mapping of these initials to the actual author which is as follows:

(Links to their Wikipedia page profiles if you click on their names)

 - **EAP - Edgar Allen Poe:** American writer who wrote poetry and short stories that revolved around tales of mystery and the grisly and the grim. Arguably his most famous work is the poem - "The Raven" and he is also widely considered the pioneer of the genre of the detective fiction.

 - **HPL - HP Lovecraft:** Best known for authoring works of horror fiction, the stories that he is most celebrated for revolve around the fictional mythology of the infamous creature "Cthulhu" - a hybrid chimera mix of Octopus head and humanoid body with wings on the back.

 - **MWS - Mary Shelley:** Seemed to have been involved in a whole panoply of literary pursuits - novelist, dramatist, travel-writer, biographer. She is most celebrated for the classic tale of Frankenstein where the scientist Frankenstein a.k.a "The Modern Prometheus" creates the Monster that comes to be associated with his name.



***Your Tasks:***

1.  Predict who wrote the text 

2.  Identify the topics on which each author focused




In [163]:
!pip install tweet-preprocessor -q

# Installing Gensim and PyLDAvis
!pip install -qq -U gensim
!pip install -qq pyLDAvis

In [164]:
# explainability (why did the model say it's related to this author)
!pip install eli5

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [165]:
import pandas as pd
import numpy as np
import preprocessor as prepro # text prepro
import tqdm #progress bar

import spacy #spacy for quick language prepro
nlp = spacy.load('en_core_web_sm') #instantiating English module

# sampling, splitting
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split


# loading ML libraries
from sklearn.pipeline import make_pipeline #pipeline creation
from sklearn.feature_extraction.text import TfidfVectorizer #transforms text to sparse matrix
from sklearn.linear_model import LogisticRegression #Logit model
from sklearn.metrics import classification_report #that's self explanatory
from sklearn.decomposition import TruncatedSVD #dimensionality reduction
from xgboost import XGBClassifier

import altair as alt #viz

#explainability
import eli5
from eli5.lime import TextExplainer

# topic modeling

from gensim.corpora.dictionary import Dictionary # Import the dictionary builder
from gensim.models import LdaMulticore # we'll use the faster multicore version of LDA

# Import pyLDAvis
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

%matplotlib inline
pyLDAvis.enable_notebook()

In [166]:
# prepro settings
prepro.set_options(prepro.OPT.URL, prepro.OPT.NUMBER, prepro.OPT.RESERVED, prepro.OPT.MENTION, prepro.OPT.SMILEY)

In [167]:
data = pd.read_csv('https://raw.githubusercontent.com/aaubs/ds-master/main/data/spooky_author_identification.csv')

In [168]:
data.shape

(19579, 3)

In [169]:
data.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


In [170]:
data['text_clean'] = data['text'].map(lambda t: prepro.clean(t))
data['text_clean'] = data['text_clean'].str.replace('#','')

In [171]:
# run progress bar and clean up using spacy but without some heavy parts of the pipeline

clean_text = []

pbar = tqdm.tqdm(total=len(data['text_clean']),position=0, leave=True)

for text in nlp.pipe(data['text_clean'], disable=["tagger", "parser", "ner"]):

  txt = [token.lemma_.lower() for token in text 
         if token.is_alpha 
         and not token.is_stop 
         and not token.is_punct]

  clean_text.append(" ".join(txt))

  pbar.update(1)

100%|██████████| 19579/19579 [2:25:24<00:00,  2.24it/s]
 99%|█████████▉| 19457/19579 [00:47<00:00, 825.90it/s]

In [172]:
# write everything into one function that can be re-used later
def text_prepro(texts):
  """
  takes in a pandas series (1 column of a DF)
  removes twitter stuff
  lowercases, normalizes text
  """
  texts_clean = texts.map(lambda t: prepro.clean(t))
  texts_clean = texts_clean.str.replace('#','')

  clean_container = []

  pbar = tqdm.tqdm(total=len(texts_clean),position=0, leave=True)

  for text in nlp.pipe(texts_clean, disable=["tagger", "parser", "ner"]):

    txt = [token.lemma_.lower() for token in text 
          if token.is_alpha 
          and not token.is_stop 
          and not token.is_punct]

    clean_container.append(" ".join(txt))
    pbar.update(1)
  
  return clean_container

In [173]:
# apply all prepro-pipeline to texts
data['text_clean'] = text_prepro(data['text'])

100%|██████████| 19579/19579 [00:26<00:00, 742.16it/s]


In [174]:
data

Unnamed: 0,id,text,author,text_clean
0,id26305,"This process, however, afforded me no means of...",EAP,process afforded means ascertaining dimensions...
1,id17569,It never once occurred to me that the fumbling...,HPL,occurred fumbling mere mistake
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP,left hand gold snuff box capered hill cutting ...
3,id27763,How lovely is spring As we looked from Windsor...,MWS,lovely spring looked windsor terrace sixteen f...
4,id12958,"Finding nothing else, not even gold, the Super...",HPL,finding gold superintendent abandoned attempts...
...,...,...,...,...
19574,id17718,"I could have fancied, while I looked at it, th...",EAP,fancied looked eminent landscape painter built...
19575,id08973,The lids clenched themselves together as if in...,EAP,lids clenched spasm
19576,id05267,"Mais il faut agir that is to say, a Frenchman ...",EAP,mais il faut agir frenchman faints outright
19577,id17513,"For an item of news like this, it strikes us i...",EAP,item news like strikes coolly received


In [175]:
# renaming and reordering

data_df = pd.DataFrame({'label':data['author'], 'text':data['text_clean']})

In [176]:
data_df

Unnamed: 0,label,text
0,EAP,process afforded means ascertaining dimensions...
1,HPL,occurred fumbling mere mistake
2,EAP,left hand gold snuff box capered hill cutting ...
3,MWS,lovely spring looked windsor terrace sixteen f...
4,HPL,finding gold superintendent abandoned attempts...
...,...,...
19574,EAP,fancied looked eminent landscape painter built...
19575,EAP,lids clenched spasm
19576,EAP,mais il faut agir frenchman faints outright
19577,EAP,item news like strikes coolly received


In [177]:
data_df.label.value_counts().reset_index()

Unnamed: 0,index,label
0,EAP,7900
1,MWS,6044
2,HPL,5635


In [178]:
alt.Chart(data_df.label.value_counts().reset_index()).mark_bar(filled=True).encode(
    alt.X('label:Q', title='N Docs'),
    alt.Y('index:N', title='Author')
)

In [179]:
# fixing sample imbalance
rus = RandomUnderSampler(random_state=42)
data_df_res, y_res = rus.fit_resample(data_df, data_df['label'])

In [180]:
data_df_res['label'].value_counts()

EAP    5635
HPL    5635
MWS    5635
Name: label, dtype: int64

In [181]:
# Splitting the dataset into the Training set and Test set (since we have a new output variable)
X_train, X_test, y_train, y_test = train_test_split(data_df_res['text'], y_res, test_size = 0.4, random_state = 42)

In [182]:
#instantiate models and "bundle up as pipeline"

tfidf = TfidfVectorizer()
cls = LogisticRegression()

pipe = make_pipeline(tfidf, cls)

In [183]:
pipe.fit(X_train,y_train) # fit model

Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer()),
                ('logisticregression', LogisticRegression())])

In [184]:
# evaluate model performance on training set

y_eval = pipe.predict(X_train)
report = classification_report(y_train, y_eval)
print(report)

              precision    recall  f1-score   support

         EAP       0.94      0.93      0.94      3378
         HPL       0.95      0.96      0.95      3356
         MWS       0.95      0.95      0.95      3409

    accuracy                           0.95     10143
   macro avg       0.95      0.95      0.95     10143
weighted avg       0.95      0.95      0.95     10143



In [185]:
# evaluate model performance on test set

y_pred = pipe.predict(X_test)
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

         EAP       0.74      0.76      0.75      2257
         HPL       0.80      0.79      0.80      2279
         MWS       0.80      0.80      0.80      2226

    accuracy                           0.78      6762
   macro avg       0.78      0.78      0.78      6762
weighted avg       0.78      0.78      0.78      6762



In [186]:
# run single prediction
# 'I am serious in asserting that my breath was entirely gone.' Edgar Allan Poe

t1 = ['For I am Iranon, who was a Prince in Aira.'] # "The Quest of Iranon" by H. P. Lovecraft

In [187]:
# preprocess

t1_p = text_prepro(pd.Series(t1)) # note, we need to pack text up as pd.Series 

100%|██████████| 1/1 [00:00<00:00, 163.54it/s]


In [188]:
# predict

pipe.predict(t1_p)

array(['HPL'], dtype=object)

In [189]:
# overall weights (works only for linear models)
eli5.show_weights(pipe, top=10, target_names=['Edgar Allen Poe', 'HP Lovecraft', 'Mary Shelley'])



Weight?,Feature,Unnamed: 2_level_0
Weight?,Feature,Unnamed: 2_level_1
Weight?,Feature,Unnamed: 2_level_2
+1.956,character,
+1.890,lady,
+1.881,dupin,
+1.863,fact,
+1.829,matter,
+1.805,de,
… 7272 more positive …,… 7272 more positive …,
… 11655 more negative …,… 11655 more negative …,
-1.794,father,
-1.935,told,

Weight?,Feature
+1.956,character
+1.890,lady
+1.881,dupin
+1.863,fact
+1.829,matter
+1.805,de
… 7272 more positive …,… 7272 more positive …
… 11655 more negative …,… 11655 more negative …
-1.794,father
-1.935,told

Weight?,Feature
+2.866,west
+2.502,gilman
+2.384,later
+2.374,ancient
+2.315,innsmouth
+2.303,street
+2.186,old
+2.100,told
+2.059,stone
… 8526 more positive …,… 8526 more positive …

Weight?,Feature
+4.074,raymond
+3.165,perdita
+3.074,adrian
+2.827,love
+2.461,idris
+2.342,father
+2.293,england
+2.121,life
+2.093,misery
+2.087,tears


In [190]:
pipe[1]

LogisticRegression()

In [191]:
# explain one prediction
eli5.show_prediction(pipe[1], t1_p[0], vec=pipe[0],
                     target_names=['Edgar Allen Poe', 'HP Lovecraft', 'Mary Shelley'])

Contribution?,Feature
0.27,<BIAS>
-0.243,Highlighted in text (sum)

Contribution?,Feature
0.983,Highlighted in text (sum)
-0.091,<BIAS>

Contribution?,Feature
-0.18,<BIAS>
-0.74,Highlighted in text (sum)


In [192]:
data

Unnamed: 0,id,text,author,text_clean
0,id26305,"This process, however, afforded me no means of...",EAP,process afforded means ascertaining dimensions...
1,id17569,It never once occurred to me that the fumbling...,HPL,occurred fumbling mere mistake
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP,left hand gold snuff box capered hill cutting ...
3,id27763,How lovely is spring As we looked from Windsor...,MWS,lovely spring looked windsor terrace sixteen f...
4,id12958,"Finding nothing else, not even gold, the Super...",HPL,finding gold superintendent abandoned attempts...
...,...,...,...,...
19574,id17718,"I could have fancied, while I looked at it, th...",EAP,fancied looked eminent landscape painter built...
19575,id08973,The lids clenched themselves together as if in...,EAP,lids clenched spasm
19576,id05267,"Mais il faut agir that is to say, a Frenchman ...",EAP,mais il faut agir frenchman faints outright
19577,id17513,"For an item of news like this, it strikes us i...",EAP,item news like strikes coolly received


In [193]:
data[data['author']=='EAP']

Unnamed: 0,id,text,author,text_clean
0,id26305,"This process, however, afforded me no means of...",EAP,process afforded means ascertaining dimensions...
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP,left hand gold snuff box capered hill cutting ...
6,id09674,"The astronomer, perhaps, at this point, took r...",EAP,astronomer point took refuge suggestion non lu...
7,id13515,The surcingle hung in ribands from my body.,EAP,surcingle hung ribands body
8,id19322,I knew that you could not say to yourself 'ste...,EAP,knew stereotomy brought think atomies theories...
...,...,...,...,...
19572,id03325,But these and other difficulties attending res...,EAP,difficulties attending respiration means great...
19574,id17718,"I could have fancied, while I looked at it, th...",EAP,fancied looked eminent landscape painter built...
19575,id08973,The lids clenched themselves together as if in...,EAP,lids clenched spasm
19576,id05267,"Mais il faut agir that is to say, a Frenchman ...",EAP,mais il faut agir frenchman faints outright


In [194]:
eli5.show_prediction(pipe[1], data['text_clean'][100], vec=pipe[0],
                     target_names=['Edgar Allen Poe', 'HP Lovecraft', 'Mary Shelley'])



Contribution?,Feature
0.27,<BIAS>
-0.046,Highlighted in text (sum)

Contribution?,Feature
-0.091,<BIAS>
-0.592,Highlighted in text (sum)

Contribution?,Feature
0.638,Highlighted in text (sum)
-0.18,<BIAS>


## Let's try a complex (black-box) model

In [195]:
#instantiate models and "bundle up as pipeline"
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

tfidf = TfidfVectorizer()
svd = TruncatedSVD(n_components = 800)
cls_xg = XGBClassifier(nthread=10)
# cls_xg = GaussianNB()
# cls_xg = KNeighborsClassifier(n_neighbors=5)

# cls_xg = LogisticRegression()

pipe_xg = make_pipeline(tfidf, svd, cls_xg)

In [196]:
pipe_xg.fit(X_train,y_train) # fit model



Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer()),
                ('truncatedsvd', TruncatedSVD(n_components=800)),
                ('xgbclassifier',
                 XGBClassifier(nthread=10, objective='multi:softprob'))])

In [197]:
# evaluate model performance on training set

y_eval = pipe_xg.predict(X_train)
report = classification_report(y_train, y_eval)
print(report)

              precision    recall  f1-score   support

         EAP       0.72      0.69      0.71      3378
         HPL       0.71      0.78      0.74      3356
         MWS       0.78      0.75      0.76      3409

    accuracy                           0.74     10143
   macro avg       0.74      0.74      0.74     10143
weighted avg       0.74      0.74      0.74     10143



In [198]:
# evaluate model performance on test set

y_pred = pipe_xg.predict(X_test)
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

         EAP       0.60      0.60      0.60      2257
         HPL       0.62      0.69      0.65      2279
         MWS       0.69      0.60      0.64      2226

    accuracy                           0.63      6762
   macro avg       0.64      0.63      0.63      6762
weighted avg       0.63      0.63      0.63      6762



In [199]:
# explain single prediction
te = TextExplainer(random_state=42)
te.fit(data['text_clean'][100], pipe_xg.predict_proba)
te.show_prediction(target_names=['Edgar Allen Poe', 'HP Lovecraft', 'Mary Shelley'])

`itemfreq` is deprecated and will be removed in a future version. Use instead `np.unique(..., return_counts=True)`
  for idx, freq in itemfreq(sampler_indices)


Contribution?,Feature
-0.157,<BIAS>
-0.697,Highlighted in text (sum)

Contribution?,Feature
0.039,Highlighted in text (sum)
-0.589,<BIAS>

Contribution?,Feature
-0.277,Highlighted in text (sum)
-0.469,<BIAS>


In [200]:
# preprocess texts (we need tokens)
tokens = []

for summary in nlp.pipe(data['text_clean'], disable=["ner"]):
  proj_tok = [token.lemma_.lower() for token in summary 
              if token.pos_ in ['NOUN', 'PROPN', 'ADJ', 'ADV'] 
              and not token.is_stop
              and not token.is_punct] 
  tokens.append(proj_tok)

In [201]:
data['tokens'] = tokens

In [202]:
data

Unnamed: 0,id,text,author,text_clean,tokens
0,id26305,"This process, however, afforded me no means of...",EAP,process afforded means ascertaining dimensions...,"[process, mean, dimension, dungeon, circuit, r..."
1,id17569,It never once occurred to me that the fumbling...,HPL,occurred fumbling mere mistake,"[mere, mistake]"
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP,left hand gold snuff box capered hill cutting ...,"[hand, gold, snuff, box, hill, manner, fantast..."
3,id27763,How lovely is spring As we looked from Windsor...,MWS,lovely spring looked windsor terrace sixteen f...,"[lovely, spring, windsor, terrace, fertile, co..."
4,id12958,"Finding nothing else, not even gold, the Super...",HPL,finding gold superintendent abandoned attempts...,"[gold, superintendent, attempt, occasionally, ..."
...,...,...,...,...,...
19574,id17718,"I could have fancied, while I looked at it, th...",EAP,fancied looked eminent landscape painter built...,"[eminent, landscape, painter, brush]"
19575,id08973,The lids clenched themselves together as if in...,EAP,lids clenched spasm,"[lid, spasm]"
19576,id05267,"Mais il faut agir that is to say, a Frenchman ...",EAP,mais il faut agir frenchman faints outright,"[mais, il, faut, agir, frenchman, faint, outri..."
19577,id17513,"For an item of news like this, it strikes us i...",EAP,item news like strikes coolly received,"[item, news, strike, coolly]"


In [203]:
data_EAP = data[data['author'] == 'EAP']

In [204]:
data_EAP

Unnamed: 0,id,text,author,text_clean,tokens
0,id26305,"This process, however, afforded me no means of...",EAP,process afforded means ascertaining dimensions...,"[process, mean, dimension, dungeon, circuit, r..."
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP,left hand gold snuff box capered hill cutting ...,"[hand, gold, snuff, box, hill, manner, fantast..."
6,id09674,"The astronomer, perhaps, at this point, took r...",EAP,astronomer point took refuge suggestion non lu...,"[astronomer, point, refuge, suggestion, non, l..."
7,id13515,The surcingle hung in ribands from my body.,EAP,surcingle hung ribands body,"[surcingle, body]"
8,id19322,I knew that you could not say to yourself 'ste...,EAP,knew stereotomy brought think atomies theories...,"[stereotomy, atomie, theory, subject, long, ag..."
...,...,...,...,...,...
19572,id03325,But these and other difficulties attending res...,EAP,difficulties attending respiration means great...,"[difficulty, respiration, great, peril, life, ..."
19574,id17718,"I could have fancied, while I looked at it, th...",EAP,fancied looked eminent landscape painter built...,"[eminent, landscape, painter, brush]"
19575,id08973,The lids clenched themselves together as if in...,EAP,lids clenched spasm,"[lid, spasm]"
19576,id05267,"Mais il faut agir that is to say, a Frenchman ...",EAP,mais il faut agir frenchman faints outright,"[mais, il, faut, agir, frenchman, faint, outri..."


In [205]:
# Create a Dictionary from the articles: dictionary
dictionary = Dictionary(data_EAP['tokens'])
# filter out low-frequency / high-frequency stuff, also limit the vocabulary to max 1000 words
dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=1000)
# construct corpus using this dictionary
corpus = [dictionary.doc2bow(doc) for doc in data_EAP['tokens']]

In [206]:
corpus

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)],
 [(7, 1),
  (8, 1),
  (9, 1),
  (10, 1),
  (11, 1),
  (12, 1),
  (13, 1),
  (14, 1),
  (15, 1),
  (16, 1)],
 [(4, 1), (17, 1), (18, 1), (19, 1)],
 [(20, 1)],
 [(21, 1),
  (22, 1),
  (23, 1),
  (24, 1),
  (25, 1),
  (26, 1),
  (27, 1),
  (28, 1),
  (29, 1),
  (30, 1),
  (31, 1),
  (32, 1),
  (33, 1),
  (34, 1),
  (35, 1)],
 [(36, 1)],
 [(37, 1), (38, 1)],
 [(27, 1),
  (39, 1),
  (40, 1),
  (41, 1),
  (42, 1),
  (43, 1),
  (44, 1),
  (45, 1),
  (46, 1),
  (47, 1),
  (48, 1),
  (49, 1)],
 [(23, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1)],
 [(57, 1), (58, 1)],
 [(10, 1)],
 [(10, 1),
  (16, 1),
  (26, 1),
  (49, 1),
  (59, 1),
  (60, 1),
  (61, 1),
  (62, 1),
  (63, 1),
  (64, 1),
  (65, 1),
  (66, 1),
  (67, 1)],
 [(68, 1), (69, 1)],
 [(52, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1)],
 [(9, 2),
  (26, 1),
  (40, 1),
  (77, 1),
  (78, 1),
  (79, 2),
  (80, 1),
  (81, 1),
  (82, 1),
  (83

In [207]:
# Training the model
lda_model = LdaMulticore(corpus, id2word=dictionary, num_topics=5, workers = 4, passes=10)

In [208]:
# Let's try to visualize
lda_display = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)

  by='saliency', ascending=False).head(R).drop('saliency', 1)


In [209]:
 # Let's Visualize
pyLDAvis.display(lda_display)