# Final Project Notebook

DS 5001 Exploratory Text Analytics | Spring 2024

# Metadata

- Full Name: Ashley Huang
- Userid: ajh5ae
- GitHub Repo URL: https://github.com/ashleyjhuang/DS5001_Final_Project/tree/main
- UVA Box URL: https://virginia.app.box.com/folder/261887899135

# Overview

The goal of the final project is for you to create a **digital analytical edition** of a corpus using the tools, practices, and perspectives you’ve learning in this course. You will select a corpus that has already been digitized and transcribed, parse that into an F-compliant set of tables, and then generate and visualize the results of a series of fitted models. You will also draw some tentative conclusions regarding the linguistic, cultural, psychological, or historical features represented by your corpus. The point of the exercise is to have you work with a corpus through the entire pipeline from ingestion to interpretation. 

Specifically, you will acquire a collection of long-form texts and perform the following operations:

- **Convert** the collection from their source formats (F0) into a set of tables that conform to the Standard Text Analytic Data Model (F2).
- **Annotate** these tables with statistical and linguistic features using NLP libraries such as NLTK (F3).
- **Produce** a vector representation of the corpus to generate TFIDF values to add to the TOKEN (aka CORPUS) and VOCAB tables (F4).
- **Model** the annotated and vectorized model with tables and features derived from the application of unsupervised methods, including PCA, LDA, and word2vec (F5).
- **Explore** your results using statistical and visual methods.
- **Present** conclusions about patterns observed in the corpus by means of these operations.

When you are finished, you will make the results of your work available in GitHub (for code) and UVA Box (for data). You will submit to Gradescope (via Canvas) a PDF version of a Jupyter notebook that contains the information listed below.

# Some Details

- Please fill out your answers in each task below by editing the markdown cell. 
- Replace text that asks you to insert something with the thing, i.e. replace `(INSERT IMAGE HERE)` with an image element, e.g. `![](image.png)`.
- For URLs, just paste the raw URL directly into the text area. Don't worry about providing link labels using `[label](link)`.
- Please do not alter the structure of the document or cell, i.e. the bulleted lists. 
- You may add explanatory paragraphs below the bulleted lists.
- Please name your tables as they are named in each task below.
- Tasks are indicated by headers with point values in parentheses.

# Raw Data

## Source Description (1)

The source materials are sourced from the Project Gutenberg. The source materials are long-form books written by Lewis Caroll and Fyodor Dostoevsky throughout their their lifetime. 

## Source Features (1)

Add values for the following items. (Do this for all following bulleted lists.)

- Source URL: 
1. https://www.gutenberg.org/cache/epub/11/pg11.txt (Alice in Wonderland)
2. https://www.gutenberg.org/files/29888/29888-h/29888-h.htm (Hunting of the Snark)
3. https://www.gutenberg.org/files/48795/48795-h/48795-h.htm (Slvie and Bruno pt 2.)
4. https://www.gutenberg.org/files/48630/48630-h/48630-h.htm (Syvie and Bruno)
5. https://www.gutenberg.org/files/29042/29042-h/29042-h.htm (A Tangled Tale)
6. https://www.gutenberg.org/files/55040/55040-h/55040-h.htm (The Nursery)
7. https://www.gutenberg.org/cache/epub/12/pg12.txt (Through the Looking Glass)
8. https://www.gutenberg.org/cache/epub/2554/pg2554.txt (Crime and Punishment)
9. https://www.gutenberg.org/cache/epub/28054/pg28054.txt (The Brothers Karamazov)
10. https://www.gutenberg.org/cache/epub/8117/pg8117.txt (The Devils)

- UVA Box URL: https://virginia.app.box.com/folder/261894445533
- Number of raw documents: 10 documents
- Total size of raw documents (e.g. in MB): (170 + 53.6 + 456 + 328 + 173 + 67.2 + 191 + 1140 + 1940 + 1410) 5928.8 Mb
- File format(s), e.g. XML, plaintext, etc.: TXT files

## Source Document Structure (1)

This is a corpus of different books written by Lewis Carroll and Fyodor Dostoevsky. I chose these two authors since they are some of the most famous authors of the 19th century, and their works overlap chronologically. The books in the corpus have a list of chapters names, followed by a set of content chapters (some with chapter titles, some without), and an ending. Additionally, some of the books written by Lewis Carroll have a preface section before the chapter list.

# Parsed and Annotated Data

Parse the raw data into the three core tables of your addition: the `LIB`, `CORPUS`, and `VOCAB` tables.

These tables will be stored as CSV files with header rows.

You may consider using `|` as a delimitter.

Provide the following information for each.

## LIB (2)

The source documents the corpus comprises. These may be books, plays, newspaper articles, abstracts, blog posts, etc. 

Note that these are *not* documents in the sense used to describe a bag-of-words representation of a text, e.g. chapter.

- UVA Box URL: https://virginia.app.box.com/file/1520175318622
- GitHub URL for notebook used to create: https://github.com/ashleyjhuang/DS5001_Final_Project
- Delimitter: None
- Number of observations: 10 observations
- List of features, including at least three that may be used for model summarization (e.g. date, author, etc.): 

[book_id, source_file_path, popularity, era, author_id, title, chap_regex, book_len, n_chaps]

- Average length of each document in characters: 

[Alice in Wonderland (26559), Hunting of the Snark (5035), Slyvie and Bruno pt 2 (71210), Slyvie and Bruno (65692), A Tangled Tale (27864), The Nursery (8342), Through The Looking Glass (29462), Crime and Punishment (204192), The Brothers Karamazov (349447), The Devils (255110)]

## CORPUS (2)

The sequence of word tokens in the corpus, indexed by their location in the corpus and document structures.

- UVA Box URL: https://virginia.app.box.com/file/1520175066765
- GitHub URL for notebook used to create: https://github.com/ashleyjhuang/DS5001_Final_Project
- Delimitter: None
- Number of observations Between (should be >= 500,000 and <= 2,000,000 observations.): 1042819
- OHCO Structure (as delimitted column names): ['book_id', 'chap_num', 'para_num', 'sent_num', 'token_num']
- Columns (as delimitted column names, including `token_str`, `term_str`, `pos`, and `pos_group`): 
['book_id', 'chap_num', 'para_num', 'sent_num', 'token_num', 'pos_tuple', 'pos', 'token_str', 'term_str', 'pos_group']

## VOCAB (2)

The unique word types (terms) in the corpus.

- UVA Box URL: https://virginia.app.box.com/file/1520169789259
- GitHub URL for notebook used to create: https://github.com/ashleyjhuang/DS5001_Final_Project
- Delimitter: None
- Number of observations: 24344
- Columns (as delimitted names, including `n`, `p`', `i`, `dfidf`, `porter_stem`, `max_pos` and `max_pos_group`, `stop`): 

['n', 'n_chars', 'p', 'i', 'max_pos', 'max_pos_group', 'n_pos_group','cat_pos_group', 'n_pos', 'cat_pos', 'stop', 'stem_porter',
'stem_snowball', 'stem_lancaster', 'dfidf', 'mean_tfidf', 'df']

- Note: Your VOCAB may contain ngrams. If so, add a feature for `ngram_length`.
- List the top 20 significant words in the corpus by DFIDF.

['rushed', 'laugh', 'laughing', 'roubles', 'object', 'youd', 'surprised', 'ground', 'fell', 'friends', 'haste', 'says', 'nonsense', 'soul', 'espicially', 'wonder', 'position', 'person', 'carried', 'hair'] 

# Derived Tables

## BOW (3)

A bag-of-words representation of the CORPUS.

- UVA Box URL: https://virginia.app.box.com/file/1520175610892
- GitHub URL for notebook used to create: https://github.com/ashleyjhuang/DS5001_Final_Project
- Delimitter: None
- Bag (expressed in terms of OHCO levels): bag = ['book_id', 'chap_id']
- Number of observations: 250288
- Columns (as delimitted names, including `n`, `tfidf`): ['book_id', 'chap_id', 'term_str', 'n']

## DTM (3)

A represenation of the BOW as a sparse count matrix.

- UVA Box URL: https://virginia.app.box.com/file/1520171371081
- UVA Box URL of BOW used to generate (if applicable): https://virginia.app.box.com/file/1520175610892
- GitHub URL for notebook used to create: https://github.com/ashleyjhuang/DS5001_Final_Project
- Delimitter: None
- Bag (expressed in terms of OHCO levels): bag = ['book_id', 'chap_id']

## TFIDF (3)

A Document-Term matrix with TFIDF values.

- UVA Box URL: https://virginia.app.box.com/file/1520180248225
- UVA Box URL of DTM or BOW used to create: https://virginia.app.box.com/file/1520175610892
- GitHub URL for notebook used to create: https://github.com/ashleyjhuang/DS5001_Final_Project
- Delimitter: None
- Description of TFIDIF formula ($\LaTeX$ OK):  TF = (DTCM.T / DTCM.T.sum()).T, IDF = np.log2(N_docs/DF)

## Reduced and Normalized TFIDF_L2 (3)

A Document-Term matrix with L2 normalized TFIDF values.

- UVA Box URL: https://virginia.app.box.com/file/1520178479327
- UVA Box URL of source TFIDF table: https://virginia.app.box.com/file/1520180248225
- GitHub URL for notebook used to create: https://github.com/ashleyjhuang/DS5001_Final_Project
- Delimitter: None
- Number of features (i.e. significant words): 1000
- Principle of significant word selection: Ecludiean (M.apply(lambda x: x / norm(x), 1))

# Models

## PCA Components (4)

- UVA Box URL: https://virginia.app.box.com/file/1520171077409
- UVA Box URL of the source TFIDF_L2 table: https://virginia.app.box.com/file/1520180248225
- GitHub URL for notebook used to create: https://github.com/ashleyjhuang/DS5001_Final_Project
- Delimitter: None
- Number of components: 9
- Library used to generate: TFIDF (and TFIDF[VIDX])
- Top 5 positive terms for first component: bruno, sylvie, lady, replied, song	
- Top 5 negative terms for second component: alyosha, ivan, mitya, pyotr, dmitri	

## PCA DCM (4)

The document-component matrix generated.

- UVA Box URL: https://virginia.app.box.com/file/1520171922327
- GitHub URL for notebook used to create: https://github.com/ashleyjhuang/DS5001_Final_Project
- Delimitter: None

## PCA Loadings (4)

The component-term matrix generated.

- UVA Box URL: https://virginia.app.box.com/file/1520175151046
- GitHub URL for notebook used to create: https://github.com/ashleyjhuang/DS5001_Final_Project
- Delimitter: None

## PCA Visualization 1 (4)

Include a scatterplot of documents in the space created by the first two components.

Color the points based on a metadata feature associated with the documents.

Also include a scatterplot of the loadings for the same two components. (This does not need a feature mapped onto color.)

Scatterplot of documents created by the first two components:

![](pca_dcm_v1.png)

Scatterplot of loadings created by the first two components:

![](pca_loadings_v1.png)

Briefly describe the nature of the polarity you see in the first component:

This visualization shows the scatterplot of the first two component loadings. We can see from the scatterplot that the late group of the era category has the most diverse/least similiar features, compared to other groups. This suggests that they are more differences within books published in the late writing era of both authors. We also see clustering of the other era categories in the scatterplot, which indicates similiarity amongst them. Finally, the two long "arms" in the dataset are comprised of three specific books: Slyvie and Bruno & Slyvie and Bruno (right arm) and Crime and Punishment (left arm). The positioning of these two documents (nearly orthogonal) suggests possible similarity between the two documents. The orthogonal structure (unlike the point clustering for similarity) could arise due to different words used in each document.

## PCA Visualization 2 (4)

Include a scatterplot of documents in the space created by the second two components.

Color the points based on a metadata feature associated with the documents.

Also include a scatterplot of the loadings for the same two components. (This does not need a feature mapped onto color.)

Scatterplot of documents in the space created by the second two components:

![](pca_dcm_v2.png)

Scatterplot of loadings for the second two components:

![](pca_loadings_v2.png)

Briefly describe the nature of the polarity you see in the second component:

In the second PCA documents scatterplot, we see similiar trends as the first scatterplot of the documents. The late era has a greater variance than the other genres, across the x-axis (PCA component 2). This suggest higher intra-group variance/dis-similarity than other eras. However, across the PCA component 2, we see that middle era also displays higher variance compared to the previous visualization. There seems to some clustering/similiarity in the early era documents, while the middle era documents are more dispersed/different. 

In the second PCA loadings scatterplot, there generally seems to be a balanced amount of words that contribute positively (positive loading values) and negatively (negative loading values) toward PCA component 2 since most of the values are centered around zero. However, we can see that there are some outliers: mitya has a relatively high negative contribution to PC 2 and Alyosha has a relatively high positiev contribution to PC2. This is also the case with PC 3, to less of an extent. We only see one stark outlier (the term pyotr, which very negatively influences PC3.

## LDA TOPIC (4)

- UVA Box URL: https://virginia.app.box.com/file/1520172138157
- UVA Box URL of count matrix used to create: Not Applicable
- GitHub URL for notebook used to create:
- Delimitter: None
- Libary used to compute: LatentDirichletAllocation
- A description of any filtering, e.g. POS (Nouns and Verbs only): No filtering
- Number of components: 20
- Any other parameters used: None
- Top 5 words and best-guess labels for topic five topics by mean document weight:
  - T00: time man way room face
  - T01: man time money yes moment
  - T02: time man course people money
  - T03: gentlemen prosecutor man door time
  - T04: man people men elder life

## LDA THETA (4)

- UVA Box URL: https://virginia.app.box.com/file/1520176213597
- GitHub URL for notebook used to create:
- Delimitter: None

## LDA PHI (4)

- UVA Box URL: https://virginia.app.box.com/file/1520170144260
- GitHub URL for notebook used to create:
- Delimitter: None

## LDA + PCA Visualization (4)

Apply PCA to the PHI table and plot the topics in the space opened by the first two components.

Size the points based on the mean document weight of each topic (using the THETA table).

Provide a brief interpretation of what you see.

![](LDA_PCA_plot.png)

Generally, looking at the scatterplot, we don't see immediate clusters of data points or orthogonal structures, which suggests that the probability distributions of words in each topic id are not similiar across the two principal components. Looking at the topic space of the first two components of the PCA PHI table, we can see that across the first principal component (PC0), there are more topics that negatively influences the principal component, which means that the those topics are negatively correlated with the principle component. However, looking at second principal component (PC1), the majority of data points have positive values, which means that substaintially more topics have a positive correlation with the principal component. This dichotomy between the principal components is interesting to see and might be attributed to differences between books in the corpus. 

In the PCA loadings scatterplot, most of the points are centered around zero for the first principal component (PC0). However, we can see that the data points leave a negative tail. This means that there are a number of words that have a negative influence or have a negative correlation with the first principal component. Some examples of these words are Ivan and Mitya. There are also some positive outliers across the first component. Words like Bruno and Slyvie positive contribute (have a positive correlation) to the first principal component. These general outlying trends are also seen across the second principal component (PC1).  

## Sentiment VOCAB_SENT (4)

Sentiment values associated with a subset of the VOCAB from a curated sentiment lexicon.

- UVA Box URL: https://virginia.app.box.com/file/1520171046474
- UVA Box URL for source lexicon: https://virginia.app.box.com/file/1520193278263 
- GitHub URL for notebook used to create: https://github.com/ashleyjhuang/DS5001_Final_Project
- Delimitter: None

## Sentiment BOW_SENT (4)

Sentiment values from VOCAB_SENT mapped onto BOW.

- UVA Box URL: https://virginia.app.box.com/file/1520180439984
- GitHub URL for notebook used to create: https://github.com/ashleyjhuang/DS5001_Final_Project
- Delimitter: None

## Sentiment DOC_SENT (4)

Computed sentiment per bag computed from BOW_SENT.

- UVA Box URL: https://virginia.app.box.com/file/1520169645317
- GitHub URL for notebook used to create: https://github.com/ashleyjhuang/DS5001_Final_Project
- Delimitter: None
- Document bag expressed in terms of OHCO levels: ['book_id', 'chap_id']

## Sentiment Plot (4)

Plot sentiment over some metric space, such as time.

If you don't have a metric metadata features, plot sentiment over a feature of your choice.

You may use a bar chart or a line graph.

![](Sentiment_plot.png)

## VOCAB_W2V (4)

A table of word2vec features associated with terms in the VOCAB table.

- UVA Box URL: https://virginia.app.box.com/file/1520176271146
- GitHub URL for notebook used to create: https://github.com/ashleyjhuang/DS5001_Final_Project
- Delimitter: None
- Document bag expressed in terms of OHCO levels: ['book_id', 'chap_id', 'para_num']
- Number of features generated: 256
- The library used to generate the embeddings: CORPUS

## Word2vec tSNE Plot (4)

Plot word embedding featues in two-dimensions using t-SNE.

Describe a cluster in the plot that captures your attention.

![](TNSE_plot.png)

![](TNSE_plot2.png)

A cluster in the t-SNE plot that really captures my attention is the character-cluster, as I like to call it. This cluster of points, located at the top of the image, comprises various names of characters that appear throughout the corpus. This cluster in particular caught my eye because of massive number of data points it contains (i.e. the cluster looked very concentrated and dark), and its observable separation from other points. These characteristics made me curious about what the cluster may mean and relationships in the cluster. One thing I noticed about this cluster was that it comprised mostly russian names (e.g. pavlovitch), which could be a nod to the various characters mentioned in dostoevsky's novels in the corpus.

# Riffs

Provde at least three visualizations that combine the preceding model data in interesting ways.

These should provide insight into how features in the LIB table are related. 

The nature of this relationship is left open to you -- it may be correlation, or mutual information, or something less well defined. 

In doing so, consider the following visualization types:

- Hierarchical cluster diagrams
- Heatmaps
- Scatter plots
- KDE plots
- Dispersion plots
- t-SNE plots
- etc.

## Riff 1 (5)

![](Riff_1a.png)

![](Riff_1b.png)

![](Riff_1c.png)

![](Riff_1d.png)

![](Riff_1e.png)

The first riff shows the hierarchial models, depicting the similiarity of books in the corps. Looking at these hierarchial models, we see that the different linkage methods group the different books in the corpus differently. All but one of the different linkage methods grouped the documents in the corpus along their respective author (e.g. the different books within a cluster were from one author). This might imply distinctive writing styles between the two authors. The only method that didn't group documents along author was the jaccard-weighted model. The jaccard-weighted method also seems to be the least sensitive since it picked up only one cluster, where other models picked up three different clusters. The other models clustered generally how we would expect it to. For instance, all of them grouped Slyvie and Bruno and Slyvie and Bruno Pt 2 together. This makes sense since they're a duology, therefore they would exihibit high levels of similiarity. 

## Riff 2 (5)

![](Riff_2.png)

The second riff shows a heat-map of the correlation between the different books in the corpus. This is computed from the correlation matrix of the TFIDF table (where the bag is set to the book_id) transpose. We specifically looked the the kendall-correlation since it is more lenient about variable distributions. Looking at this heat-map, we can see similiar aspects that are shared with the hierarchial models. For instance, two of the most highly correlated books are book_id 1 (Alice in Wonderland) and book_id 7 (Through the Looking Glass). This is also seen in the hierarhcial models, where all the methods except jaccard-weighted grouped them together too. In the heat-map, book_id 8 (Crime and Punishment) and book_id 10 (The Devils) also had higher correlation values. This is also expected since Crime and Punishment and The Devils are unanimously grouped together across all hierarhcial model linkage methods.

## Riff 3 (5)

![](Riff_3a.png)

The third riff shows a set of heat-map of the correlation between topic models for Crime and Punishment. I wanted to look at the models for one book, rather than the whole corpus, for more focused/interpretable topics. I specifically wanted to look at Crime and Punishment since it is the lengthiest book in the corpus. This first heat map shows the correlation between one topic model with itself for Crime and Punishment. There's not that much similiarity among the topics of the model (not a lot of red squares seen). We see that the diagonal has the highest similiarity, which makes sense since topics that comprise the diagonal are identical. Beyond the diagonal, there is a lot less similiarity. Although, there seems to be some similiarity among topic 0 and topic 3 of the model. Next, we will see if there is more similiarity across topics from two different models.

![](Riff_3b.png)

Looking at the heat-map, there is still not a lot of similiarity between the topics generated by the two models. However, it seems that there is more similiarity among the topics generated by two different models (more red squares), which is inteesting since it seems counter-intuitive that topics from two different models would have more similiarity than topics from within one model.

# Interpretation (4)

Describe something interesting about your corpus that you discovered during the process of completing this assignment.

At a minumum, use 250 words, but you may use more. You may also add images if you'd like.

One interesting thing that I've learned about my corpus while doing this assignment is that children read depressing books in the 19th century. I was interested in the differences in sentiments of each of the books in the corpus. I thought that it would be interesting since the primary audience/primary genres of the authors are quite different. Lewis Carroll mostly wrote children's stories that had a lot of elements of fantasy, playfulness, and imagination. Therefore, there should be more words that have positive connotations and words that have a high positive connotation values. 

Fyodor Dostoevsky mostly wrote books that explored darker and heavier themes such as corruption, murder, betrayal, etc (e.g. The Devils). Therefore, he had an older audience that were familiar with these themes. Therefore, Dostoevsky's books should have more words with negative connotations and more words with high negative connotation values. 

To see the trends in the sentiment analysis of each book in the corpus, I plotted a bar-plot looking at the emotions of each book (calculated from an external emotion lexicon and weighted by the average TFIDF). 

![](Interpretation_1.png)

Looking at the bar-plot, I focused specifically on negative and positive sentiment because the external sentiment lexicon had a wide range of emotional representation (e.g. anger, disgust, fear, etc.), so I thought positive/negative numerical values would succinctly summarize the general sentiment without going into very granular details. We see that Lewis Carroll's books tend to have higher values for the positive category than Dostoevsky's books, which is not surprising given his writing genre and audience. However, many of Lewis Carroll's books had substantially higher negative sentiment values, i.e. many of Lewis Carroll's children's books essentially "out-negatived" Dostovesky's books, which was really surprising. 

One possible explanation lies in the external sentiment lexicon. It is possible that using a different lexicon might change these trends. However, it is unlikely to completely change the trends that we see. Instead, it is likely to shift, rather than completely transform, the sentiment trends present in the bar-plot.

A second explanation lies in language itself. Dostoevsky was a Russian author, who wrote his books in Russian. The source data that is used to derive the sentiment bar-plots is in English (his work was translated to English). Therefore, it is possible that there are Russian words lost/altered in translation due to the absence of equivalent English words. Therefore, the emotional weight behind some words may not be completely accounted for in the sentiment analysis. This might be a reason why we see the trends in the sentiment bar-plot. 

