# Chapter 3 - Indexes
This notebook has the code I used to perform the classification experiments on Scottish Enlightenment philosophical histories. There are three different tests that I describe below. The unique aspect of this chapter is that I tried the classifier on the word categories from the Linguistic Inquiry and Word Count dictionary as way to examine the distinctive approach to history found in the Scottish texts.  

In [None]:
%matplotlib inline 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
import seaborn as sns
import matplotlib.pyplot as plt
import random
import time
import pickle
import numpy as np 
import pandas as pd
import bookFunctions

## Load corpora and organize corpora

I describe three different classification tests in Chapter 3. 
1. Systematic vs. historical texts
2. Philosophical histories vs. narrative histories
3. Indexes for philosophical histories vs. indexes for narrative histories

This experiment uses word categories from the [*Linguistic Inquiry and Word Count 2015*](https://liwc.app/), which is proprietary software. You'll need to purchase it or obtain it through your institution to use. There is a 2022 version of the word category dictionary now. The datasets below included texts that have already been transformed from word lists to LIWC word categories. If you'd like to perform the tokenization and word category transformation, there is another notebook for doing that.  

### Scottish Enlightenment Corpus
This is the full Scottish Enlightenment that I use in the chapter. It is stored as a list of dictionaries with keys for the metadata and the text, which is stored as a word list. Here you can examine the texts prior to being transformed into lists of word categories. 

In [None]:
scottishCorpus = pickle.load(open('scottishCorpus.p', 'rb'))

In [None]:
# Here are the fields for the corpus.
print(scottishCorpus[0].keys())

### Systematic versus Historical Scottish Enlightenment Texts
This test set is organized as a list of tuples: `(text class, liwc word categories lists)`. The classes are `systematic` and `historical`. 

In [None]:
sysHistCorpus = pickle.load(open('sysVsHistCorp.p', 'rb'))

### Philosophical histories versus Narrative Histories
The following dataset is organized as a list of tuples with a each item like this `(text class, liwc word category list)`. The text classes are `philhist` and `sample`. 

In [None]:
historyWordCats = pickle.load(open('philHistSampCats.p', 'rb'))

### Philosophical History indexes versus narrative history indexes
For the index analysis, I have the texts and metadata for the sample historical indexes, and the full set of philosophical history and sample index word categories in `indexCorpusWordCats`. Above the test set I also include `sampleIndexCorpus`, which is a list of dictionaries including the full texts and metadata for all the sample indexes. 

In [None]:
sampleIndexCorpus = pickle.load(open('eccoSampIndex.p', 'rb')) ### These are the indexes used in comparison with the philosophical history indexes. 
indexCorpusWordCats = pickle.load(open('philHistIndexCatsCorp.p', 'rb')) 

### Organize the datasets 
Here you need to update the `sysHistCorpus` variable to perform whichever group you want to perform the classification test with. 

In [None]:
corpus = sysHistCorpus ## This is the corpus for the first test. Replace it if you want to perform the second or third test.

In [None]:
categories = [] 
wordCatChunkList = [] 
for item in corpus:
    catList = item[1].split()
    splitText = bookFunctions.splitText(catList, 500) ## You can tinker with the chunk size
    for chunk in splitText:
        wordCatChunkList.append(chunk)
        categories.append(item[0])
print(len(wordCatChunkList))

In [None]:
d = {'category': categories, 'text': wordCatChunkList}
dfFull = pd.DataFrame(d)
dfFull['category_id'] = dfFull['category'].factorize()[0]

## Create word categories vectors
Create term vectors from the word category chunks. 

In [None]:
t0 = time.process_time() 
tfidf = TfidfVectorizer(sublinear_tf=True, ngram_range=(1,1), stop_words=None, min_df=.10, norm='l2', use_idf=False, 
                       max_features=60)
feats = tfidf.fit_transform(dfFull.text).toarray()
vocabulary = tfidf.get_feature_names()
labels = dfFull.category_id
print(time.process_time() - t0)

## Train classifier

Train four different classifiers to compare their degrees of accuracy before selecting one. 

In [None]:
models = [
    RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0),
    LinearSVC(),
    MultinomialNB(),
    LogisticRegression(random_state=0, multi_class='ovr', solver='liblinear',),
]
CV = 5
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
    model_name = model.__class__.__name__
    accuracies = cross_val_score(model, feats, labels, scoring='accuracy', cv=CV) 
    for fold_idx, accuracy in enumerate(accuracies):
        entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])

In [None]:
sns.boxplot(x='model_name', y='accuracy', data=cv_df)
sns.stripplot(x='model_name', y='accuracy', data=cv_df, 
              size=8, jitter=True, edgecolor="gray", linewidth=2)
plt.show()

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB


mod = LinearSVC() 
## Below are other models that you might try. 
#clf = MultinomialNB()
#rf = RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0)
#logR = LogisticRegression(random_state=0, multi_class='ovr', solver='lbfgs')
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(feats, labels, dfFull.index, 
                                                                                 test_size=0.33, random_state=0)
mod.fit(X_train, y_train)
y_pred = mod.predict(X_test)

### Results

In [None]:
### Linear SVC
from sklearn import metrics
print(metrics.classification_report(y_test, y_pred, 
                                    target_names=dfFull['category'].unique())) 

## Examining the classifier results 
These functions create a dataframe for sorting the most informative LIWC word categories for distinguishing the two groups. You need to update the text class label to change it from the first classifier test to the second or third. 

In [None]:
modTerms = bookFunctions.mostInformTerms(mod, topn=30)
dfTM = bookFunctions.docTermMatrix (feats, tfidf)
scotMeanDF, sampMeanDF = bookFunctions.buildMeanDF(dfTM, feats, 'systematic', categories) ## Update the label here.
scotWords, sampWords = bookFunctions.topGroupWord (scotMeanDF, sampMeanDF, modTerms, vocabulary)

In [None]:
### phil hist set
print(sorted(scotWords, key=lambda x:x[1], reverse=True))

In [None]:
### sample 2
print(sorted(sampWords, key=lambda x:x[1], reverse=True))