# PCA
- Brigitte Hogan (bwh5v@virginia.edu) & Jason Tiezzi (jbt5am@virginia.edu)  
- DS 5001: Exploratory Text Analytics  
- April 2020  
---

<font color = gray>

## Overview

This notebook

1. Creates a reduced `TFIDF` table; that is, select only the top 5,000 most significant terms.

2. Performs PCA on the reduced `TFIDF` "by hand," i.e. create a covariance matrix of features, apply eigen-decomposition, select components, etc. In the process, generate `COMPS`, `LOADINGS`, and `DCM` tables from your results (as in the in-class example).

3. Using whatever visualization libraries you can*, inspect the first three components and answer the following questions:

    (1) What `LIB` feature (author or genre) does the first principal component (PC) separate?

    (2) Based on the first PC, what two novelists are most opposite to (distant from) each other?

    (3) Based on the second PC, what two novelists are most opposite to each other?

    (4) Based on the third PC, what two novelists are most opposite to each other?

    (5) Based on your knowledge of linguistic annotations, what implicit feature do you think accounts for the clear separation of novels in our data?

---
# Set Up

## Config

In [1]:
data_dir = 'Tables/'
OHCO = ['book_id', 'vol_num', 'chap_num', 'recp_num', 'para_num', 'sent_num', 'token_num'] # define OHCO
#OHCO = OHCO[:5]
RECIPES = ['book', 'chapter'] # alternate OHCO

## Import

In [2]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from scipy.linalg import norm
import plotly_express as px
import seaborn as sns
from scipy.linalg import eigh

In [3]:
sns.set(style='ticks')
%matplotlib inline

## Functions

### tfidf()

In [4]:
def tfidf (token, ohco, bag='CHAPS', count_method='n', item_type='term_id', tf_method='sum', idf_method='standard'):
    ## Arguments -----------------------------------------------------------------------------------------
    # token (pandas dataframe): must have term_str and term_id or stem_porter
    # bag (string) = OHCO_level, either - BOOKS, CHAPS, PARAS, SENTS
    # count_method (string): either 'n' (default) for n tokens/ regular or 'c' for distinct tokens/ binary
    # item_type (string): type of item to count, either 'term' for terms or 'stem' for stems
    # tf_method (string): tf method - sum (default), max, log, double_norm, raw, binary
    # idf_method (string): idf method - standard (default), max, or smooth
    
    ## Create OHCO Dictionary for Bag --------------------------------------------------------------------
    #OHCOdict = {
    #    "BOOKS": ['book_id'],
    #    "CHAPS": ['book_id', 'chap_num'],
    #    "PARAS": ['book_id', 'chap_num', 'para_num'],
    #    "SENTS": ['book_id', 'chap_num', 'para_num', 'sent_num']
    #    }
    OHCOdict = {
        "BOOKS": [ohco[0]],
        "CHAPS": [ohco[0], ohco[1]],
        "PARAS": [ohco[0], ohco[1], ohco[2]],
        "SENTS": [ohco[0], ohco[1], ohco[2], ohco[3]]
        }
    theBag = OHCOdict[bag]
    
    ## Create Bag-of-Words/Stems -------------------------------------------------------------------------
    BOW = token.groupby(theBag + [item_type])[item_type].count().to_frame().rename(columns={item_type:'n'})
    
    ## Add Binary Count Column ---------------------------------------------------------------------------
    BOW['c'] = BOW.n.astype('bool').astype('int')
    
    ## Create Document Term Frequency Matrix -------------------------------------------------------------
    #DTCM = BOW[count_method].unstack().fillna(0)
    DTCM = BOW[count_method].unstack(fill_value=0) # Raf's
    
    ## Compute TF ----------------------------------------------------------------------------------------
    if tf_method == 'sum':
        TF = DTCM.T / DTCM.T.sum()
    elif tf_method == 'max':
        TF = DTCM.T / DTCM.T.max()
    elif tf_method == 'log':
        TF = np.log10(1 + DTCM.T)
    elif tf_method == 'raw':
        TF = DTCM.T
    elif tf_method == 'double_norm':
        tf_norm_k = .5
        TF = DTCM.T / DTCM.T.max()
        TF = tf_norm_k + (1 - tf_norm_k) * TF[TF > 0]
    elif tf_method == 'binary':
        TF = DTCM.T.astype('bool').astype('int')
    
    ## Compute IDF ---------------------------------------------------------------------------------------
    N = DTCM.shape[0]
    DF = DTCM[DTCM > 0].count()   
    
    if idf_method == 'standard':
        IDF = np.log10(N / DF)
    elif idf_method == 'max':
        IDF = np.log10(DF.max() / DF) 
    elif idf_method == 'smooth':
        IDF = np.log10((1 + N) / (1 + DF)) + 1 

    ## Compute TFIDF -------------------------------------------------------------------------------------
    TFIDF = TF.T * IDF
    
    return TFIDF
    

### get_tfidf()

In [5]:
def get_tfidf(TOKEN, bag=CHAPS, count_method='n', tf_method='sum', idf_method='standard', item_type='term_id'):
    
    # Create bag of items (terms or stems)
    BOW = TOKEN.groupby(bag + [item_type])[item_type].count().to_frame().rename(columns={item_type:'n'})

    # Add binary count column
    BOW['c'] = BOW.n.astype('bool').astype('int')
    
    # Create document-term matrix
    DTCM = BOW[count_method].unstack(fill_value=0)#.astype('int')
    
    # Compute TF
    if tf_method == 'sum':
        TF = DTCM.T / DTCM.T.sum()
    elif tf_method == 'max':
        TF = DTCM.T / DTCM.T.max()
    elif tf_method == 'log':
        TF = np.log10(1 + DTCM.T)
    elif tf_method == 'raw':
        TF = DTCM.T
    elif tf_method == 'double_norm':
        TF = DTCM.T / DTCM.T.max()
        TF = tf_norm_k + (1 - tf_norm_k) * TF[TF > 0] 
    elif tf_method == 'binary':
        TF = DTCM.T.astype('bool').astype('int')  
    
    # Compute IDF
    N = DTCM.shape[0]
    DF = DTCM[DTCM > 0].count()
    if idf_method == 'standard':
        IDF = np.log10(N / DF)
    elif idf_method == 'max':
        IDF = np.log10(DF.max() / DF) 
    elif idf_method == 'smooth':
        IDF = np.log10((1 + N) / (1 + DF)) + 1
    
    # Compute TF-IDF
    TFIDF = TF.T * IDF
    return TFIDF

### vis_pcs()

In [4]:
def vis_pcs(M, a, b, label='author', prefix='PC'):
    fig = px.scatter(M, prefix + str(a), prefix + str(b), 
                     color=label, 
                     hover_name='doc', 
                     marginal_x='box',
                     marginal_y='box',
                     width=1000, height = 600)
    fig.show()

---
# Prepare the Data

## Import Tables

In [30]:
LIB   = pd.read_csv(data_dir + 'LIB.csv')                          # book_id, author_last, book_year, period
VOCAB = pd.read_csv(data_dir + 'VOCAB.csv').set_index('term_id')   # term_id, term_str, n, stem_porter, stem_porter
#TOKEN = pd.read_csv(data_dir + 'TOKEN.csv')                        # OHCO, pos, token_str, term_str, (term_id)
TOKENS = pd.read_csv(data_dir + 'TOKEN2.csv')                      # OHCO, pos, token_str, term_str, term_id

In [6]:
TFIDF_book = pd.read_csv(data_dir + 'TFIDF_book.csv')                     # period, book_year, book_id
TFIDF_recp = pd.read_csv(data_dir + 'TFIDF_recp.csv')                     #
TFIDF_time = pd.read_csv(data_dir + 'TFIDF_timeperiod.csv')               # period

## Format Tables

In [31]:
LIB = LIB.set_index('book_id')
TOKENS = TOKENS.set_index(OHCO)

In [8]:
LIB.head()

Unnamed: 0_level_0,author_last,author_full,book_year,book_title,book_file,period
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
9935,Cookbooks\WIDAS,Woman's Institute of Domestic Arts and Sciences,1923,"Woman's Institute Library of Cookery, Vol. 1",Cookbooks\WIDAS1923_WILCV01_pg9935.txt,1900s
9936,Cookbooks\WIDAS,Woman's Institute of Domestic Arts and Sciences,1923,"Woman's Institute Library of Cookery, Vol. 2",Cookbooks\WIDAS1923_WILCV02_pg9936.txt,1900s
9937,Cookbooks\WIDAS,Woman's Institute of Domestic Arts and Sciences,1923,"Woman's Institute Library of Cookery, Vol. 3",Cookbooks\WIDAS1923_WILCV03_pg9937.txt,1900s
9938,Cookbooks\WIDAS,Woman's Institute of Domestic Arts and Sciences,1923,"Woman's Institute Library of Cookery, Vol. 4",Cookbooks\WIDAS1923_WILCV04_pg9938.txt,1900s
9939,Cookbooks\WIDAS,Woman's Institute of Domestic Arts and Sciences,1923,"Woman's Institute Library of Cookery, Vol. 5",Cookbooks\WIDAS1923_WILCV05_pg9939.txt,1900s


In [9]:
TOKENS.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,pos_tuple,pos,token_str,term_str,term_id
book_id,vol_num,chap_num,recp_num,para_num,sent_num,token_num,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
9935,1,1,1.0,0,0,0,"('1', 'CD')",CD,1,1,14
9935,1,1,1.0,0,1,0,"('Without', 'IN')",IN,Without,without,16577
9935,1,1,1.0,0,1,1,"('doubt', 'NN')",NN,doubt,doubt,5252
9935,1,1,1.0,0,1,3,"('the', 'DT')",DT,the,the,15108
9935,1,1,1.0,0,1,4,"('greatest', 'JJS')",JJS,greatest,greatest,7253


In [10]:
VOCAB.head()

Unnamed: 0_level_0,term_str,n,num,has_int,stop,stem_porter,stem_snowball
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
15108,the,60407,0,0,1,the,the
10502,of,35149,0,0,1,of,of
1546,and,33319,0,0,1,and,and
1062,a,28726,0,0,1,a,a
8071,in,22204,0,0,1,in,in


In [11]:
TFIDF_book.head()

Unnamed: 0,period,book_year,book_id,0,000,001,002,01,02,020,...,œuvre,καλον,τεμνω,το,ἁ,⅓,⅔,⅕,⅙,⅜
0,1900s,1909,19077,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1900s,1918,15464,0.0,0.0,0.0,0.0,4.5e-05,0.000103,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1900s,1918,32472,0.000281,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1900s,1923,9935,1e-05,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1900s,1923,9936,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Pre-Processing

In [32]:
VOCAB = VOCAB[~VOCAB.term_str.isna()]

### Add Max POS to VOCAB table

In [33]:
VOCAB['pos_max'] = TOKENS.groupby(['term_id', 'pos']).pos.count().unstack().idxmax(1)

### Add Term Rank to VOCAB

In [34]:
if 'term_rank' not in VOCAB.columns:
    VOCAB = VOCAB.sort_values('n', ascending=False).reset_index()
    VOCAB.index.name = 'term_rank'
    VOCAB = VOCAB.reset_index()
    VOCAB = VOCAB.set_index('term_id')
    VOCAB['term_rank'] = VOCAB['term_rank'] + 1

### Add Alternate Term Rank to VOCAB

In [35]:
new_rank = VOCAB.n.value_counts()\
    .sort_index(ascending=False).reset_index().reset_index()\
    .rename(columns={'level_0':'term_rank2', 'index':'n', 'n':'nn'})\
    .set_index('n')

In [36]:
VOCAB['term_rank2'] = VOCAB.n.map(new_rank.term_rank2) + 1

In [37]:
VOCAB['p'] = VOCAB.n / VOCAB.shape[0]

### Compute Zipf's K

In [38]:
VOCAB['zipf_k'] = VOCAB.n * VOCAB.term_rank
VOCAB['zipf_k2'] = VOCAB.n * VOCAB.term_rank2
VOCAB['zipf_k3'] = VOCAB.p * VOCAB.term_rank2

In [39]:
VOCAB.head()

Unnamed: 0_level_0,term_rank,term_str,n,num,has_int,stop,stem_porter,stem_snowball,pos_max,term_rank2,p,zipf_k,zipf_k2,zipf_k3
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
15108,1,the,60407,0,0,1,the,the,DT,1,3.598654,60407,60407,3.598654
10502,2,of,35149,0,0,1,of,of,IN,2,2.093947,70298,70298,4.187895
1546,3,and,33319,0,0,1,and,and,CC,3,1.984928,99957,99957,5.954784
1062,4,a,28726,0,0,1,a,a,DT,4,1.711307,114904,114904,6.845228
8071,5,in,22204,0,0,1,in,in,IN,5,1.322769,111020,111020,6.613845


In [22]:
TFIDF_book.head()

Unnamed: 0,period,book_year,book_id,0,000,001,002,01,02,020,...,œuvre,καλον,τεμνω,το,ἁ,⅓,⅔,⅕,⅙,⅜
0,1900s,1909,19077,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1900s,1918,15464,0.0,0.0,0.0,0.0,4.5e-05,0.000103,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1900s,1918,32472,0.000281,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1900s,1923,9935,1e-05,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1900s,1923,9936,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Add TFIDF sums

In [41]:
VOCAB.shape

(16786, 14)

In [42]:
print(TFIDF_book.sum().shape)
print(TFIDF_recp.sum().shape)
print(TFIDF_time.sum().shape)

(16789,)
(16792,)
(16787,)


In [21]:
VOCAB['tfidf_sum_book'] = TFIDF_book.sum()

In [None]:
VOCAB['tfidf_sum_recp'] = TFIDF_recp.sum()

In [None]:
VOCAB['tfidf_sum_perd'] = TFIDF_time.sum()

# 1. Create reduced TFIDF Matrix

### Select Top 5,000 Significant Words

#### Using tfidf_sum for significance measure

In [18]:
VOCAB = VOCAB.sort_values('tfidf_sum', ascending=False)[0:5000].sort_index()

KeyError: 'tfidf_sum'

In [16]:
TOKEN = TOKEN[TOKEN.term_id.isin(VOCAB.index)]

In [17]:
TFIDF = TFIDF.loc[:, TFIDF.columns.isin(VOCAB.index)]

In [18]:
TFIDF.shape  # reduced TFIDF, just significant vocab

(320, 5000)

# Pre-process TFIDF Matrices

In [None]:
TFIDF.head()

## Normalize doc vector lengths

In [None]:
TFIDF = TFIDF.apply(lambda x: x / norm(x, 2), 1) # L2 normalization

In [None]:
TFIDF.head()

# Compute Covariance Matrix

In [None]:
COV = TFIDF.cov()

In [None]:
COV.iloc[:5,:10].style.background_gradient() # limit this so it doesn't crash your system

## Decompose the Matrix

In [None]:
%time eig_vals, eig_vecs = eigh(COV)

## Convert eigen data to dataframes

In [None]:
TERM_IDX = COV.index # for convenience

In [None]:
EIG_VEC = pd.DataFrame(eig_vecs, index=TERM_IDX, columns=TERM_IDX)

In [None]:
EIG_VAL = pd.DataFrame(eig_vals, index=TERM_IDX, columns=['eig_val'])
EIG_VAL.index.name = 'term_id'

In [None]:
EIG_VEC.iloc[:5, :10].style.background_gradient()

In [None]:
EIG_VAL.iloc[:5] # this is the ranking principal

# Select Principal Components

Associate each eigenvalue with its corresponding *column* in the eigenvalue matrix by transposing the  `EIG_VEC` dataframe.

## Combine eigenvalues and eignvectors

In [None]:
EIG_PAIRS = EIG_VAL.join(EIG_VEC.T) # join into table

In [None]:
EIG_PAIRS.head()                    # term_ids ~ components

## Compute and Show Explained Variance

We might have usd this value to sort our components.

In [None]:
EIG_PAIRS['exp_var'] = np.round((EIG_PAIRS.eig_val / EIG_PAIRS.eig_val.sum()) * 100, 2)

In [None]:
EIG_PAIRS.exp_var.sort_values(ascending=False).head().plot.bar(rot=45);

## Pick Top 3 Components

We pick these based on explained variance.

In [None]:
COMPS = EIG_PAIRS.sort_values('exp_var', ascending=False).head(3).reset_index(drop=True)
COMPS.index.name = 'comp_id'
COMPS.index = ["PC{}".format(i) for i in COMPS.index.tolist()]

In [None]:
COMPS # each term associated with component and weight

# Inspect terms associated with eigenvectors

In [None]:
VOCAB.loc[[int(x) for x in EIG_PAIRS.sort_values('exp_var').head(10).index], 'term_str']

## Show Loadings

In [None]:
LOADINGS = COMPS[TERM_IDX].T
LOADINGS.index.name = 'term_id'

In [None]:
LOADINGS.head(20).style.background_gradient()

In [None]:
LOADINGS['term_str'] = LOADINGS.apply(lambda x: VOCAB.loc[int(x.name)].term_str, 1)

In [None]:
l0_pos = LOADINGS.sort_values('PC0', ascending=True).head(10).term_str.str.cat(sep=' ') # looking at max pos and neg for 1st three components
l0_neg = LOADINGS.sort_values('PC0', ascending=False).head(10).term_str.str.cat(sep=' ')
l1_pos = LOADINGS.sort_values('PC1', ascending=True).head(10).term_str.str.cat(sep=' ')
l1_neg = LOADINGS.sort_values('PC1', ascending=False).head(10).term_str.str.cat(sep=' ')
l2_pos = LOADINGS.sort_values('PC2', ascending=True).head(10).term_str.str.cat(sep=' ')
l2_neg = LOADINGS.sort_values('PC2', ascending=False).head(10).term_str.str.cat(sep=' ')

In [None]:
print('Books PC0+', l0_pos)
print('Books PC0-', l0_neg)
print('Books PC1+', l1_pos)
print('Books PC1-', l1_neg)
print('Books PC2+', l2_pos)
print('Books PC2-', l2_neg)

# Project Docs onto New Subspace

Get Document-Component Matrix (DCM)

In [None]:
DCM = TFIDF.dot(COMPS[TERM_IDX].T)

In [None]:
DCM # each doc/chapter has distribution of components

#### Add Labels for Display

In [None]:
LIB = LIB.reset_index()
LIB["title"] = LIB.book_id
LIB = LIB.set_index('book_id')
LIB.head()

In [None]:
DCM = DCM.join(LIB[['author','genre_full','title']], on='book_id')

In [None]:
DCM['doc'] = DCM.apply(lambda x: "{}-{}-{}".format(x.author, x.title, x.name[1]), 1)

In [None]:
DCM.head()

In [None]:
DCM.head(10).style.background_gradient() # Note: Components become features for VOCAB and DOC tables

# Visualize

## PC 0 and 1

In [None]:
vis_pcs(DCM, 0, 1) # by author

vis_pcs(DCM, 0, 1, label='genre_full')

In [None]:
#vis_pcs(DCM, 0, 1, label='title')

## PC 1 and 2

In [None]:
vis_pcs(DCM, 1, 2) # by author

In [None]:
vis_pcs(DCM, 1, 2, label='genre_full')

In [None]:
#vis_pcs(DCM, 1, 2, label='title')

## PC 0 and 2

In [None]:
vis_pcs(DCM, 0, 2) # author

In [None]:
vis_pcs(DCM, 0, 2, label='genre_full')

In [None]:
#vis_pcs(DCM, 0, 2, label='title')

---
## Results

**1. What `LIB` feature (author or genre) does the first principal component (PC) separate?**  
The first principal component (PC0) separates primarily on genre. The second principal component (PC1) does a better job of separating author. 

**2. Based on the first PC (PC0), what two novelists are most opposite to (distant from) each other?**  
Radcliffe & Christie

**3. Based on the second PC (PC1), what two novelists are most opposite to each other?**  
Austen & Christie

**4. Based on the third PC (PC2), what two novelists are most opposite to each other?**  
Collins & Austen

**5. Based on your knowledge of linguistic annotations, what implicit feature do you think accounts for the clear separation of novels in our data?**  
By looking at the loadings, it appears the novels are being separated by proper nouns, most of which are the names of the principal characters.