# Final Project Notebook

DS 5001 Exploratory Text Analytics | Spring 2024

# Metadata

- Full Name: Elizabeth Burrell
- Userid: egb3kqk
- GitHub Repo URL: https://github.com/burrelllizzie/DS5001-Final-Project
- UVA Box URL: https://www.dropbox.com/home/DS5001%20Final%20Project

# Overview

The goal of the final project is for you to create a **digital analytical edition** of a corpus using the tools, practices, and perspectives you’ve learning in this course. You will select a corpus that has already been digitized and transcribed, parse that into an F-compliant set of tables, and then generate and visualize the results of a series of fitted models. You will also draw some tentative conclusions regarding the linguistic, cultural, psychological, or historical features represented by your corpus. The point of the exercise is to have you work with a corpus through the entire pipeline from ingestion to interpretation. 

Specifically, you will acquire a collection of long-form texts and perform the following operations:

- **Convert** the collection from their source formats (F0) into a set of tables that conform to the Standard Text Analytic Data Model (F2).
- **Annotate** these tables with statistical and linguistic features using NLP libraries such as NLTK (F3).
- **Produce** a vector representation of the corpus to generate TFIDF values to add to the TOKEN (aka CORPUS) and VOCAB tables (F4).
- **Model** the annotated and vectorized model with tables and features derived from the application of unsupervised methods, including PCA, LDA, and word2vec (F5).
- **Explore** your results using statistical and visual methods.
- **Present** conclusions about patterns observed in the corpus by means of these operations.

When you are finished, you will make the results of your work available in GitHub (for code) and UVA Box (for data). You will submit to Gradescope (via Canvas) a PDF version of a Jupyter notebook that contains the information listed below.

# Some Details

- Please fill out your answers in each task below by editing the markdown cell. 
- Replace text that asks you to insert something with the thing, i.e. replace `(INSERT IMAGE HERE)` with an image element, e.g. `![](image.png)`.
- For URLs, just paste the raw URL directly into the text area. Don't worry about providing link labels using `[label](link)`.
- Please do not alter the structure of the document or cell, i.e. the bulleted lists. 
- You may add explanatory paragraphs below the bulleted lists.
- Please name your tables as they are named in each task below.
- Tasks are indicated by headers with point values in parentheses.

# Raw Data

## Source Description (1)


The source material includes five gothic novels published around the late 19th century: Bram Stoker's "Dracula"; Emily Bronte's "Wurthering Heights"; Oscar Wilde's "The Picture of Dorian Gray"; and Sir Arthur Conan Doyle's "The Hound of Baskervilles" and "The Adventures of Sherlock Holmes". Both of Doyle's novels include Sherlock Holmes and Watson as protagonists and could also be included in the detective genre, along with "Dracula" as all three novels follow benevolent protagonists who attempt to unravel the sinister plots of an initially mysterious and evil force. While "The Portrait of Dorian Gray" is included in the gothic genre, its object of fear is much more elusive and abstract, tied to human degeneracy and moral decay rather than a tangible threat like in "Dracula". Among other gothic tropes, "Dracula", "Wurthering Heights", and "The Hound" all feature isolated, gloomy estates/castles which contrast the buslting, urban life of cities like London. 

The text for all five novels was sourced from Project Gutenberg and the Book ID numbers correspond with their Gutenberg ID's.

## Source Features (1)

Add values for the following items. (Do this for all following bulleted lists.)

- Source URL: 
    - Dracula: https://www.gutenberg.org/cache/epub/345/pg345.txt
    - The Picture of Dorian Gray: https://www.gutenberg.org/cache/epub/4078/pg4078.txt
    - The Hound of Baskervilles: https://www.gutenberg.org/cache/epub/3070/pg3070.txt
    - Wurthering Heights: https://www.gutenberg.org/cache/epub/768/pg768.txt
    - The Adventures of Sherlock Holmes: https://www.gutenberg.org/cache/epub/1661/pg1661.txt
- UVA Box URL: https://www.dropbox.com/home/DS5001%20Final%20Project/source%20txt%20files?di=left_nav_browse
- Number of raw documents: 5
- Total size of raw documents (e.g. in MB): 2.9 MB
- File format(s), e.g. XML, plaintext, etc.: plaintext / txt files

## Source Document Structure (1)

All documents follow the structure of a novel, and are broken down by chapters, paragraphs, and sentences - following the OHCO structure typically used in labs. The chapter regexes for each novel can be found in the library. Some novels include an introduction or table of contents which are clipped when generating the CORPUS. All texts also contain disclosure information following the body of the novel which is also clipped. 

# Parsed and Annotated Data

Parse the raw data into the three core tables of your addition: the `LIB`, `CORPUS`, and `VOCAB` tables.

These tables will be stored as CSV files with header rows.

You may consider using `|` as a delimitter.

Provide the following information for each.

## LIB (2)

The source documents the corpus comprises. These may be books, plays, newspaper articles, abstracts, blog posts, etc. 

Note that these are *not* documents in the sense used to describe a bag-of-words representation of a text, e.g. chapter.

- UVA Box URL: https://www.dropbox.com/scl/fi/g6wcyc813me7ha7r9mosy/gothic-texts-LIB.csv?rlkey=ltvtwh5mox5qmd9vzgpzgrrjd&st=ztv468sr&dl=0
- GitHub URL for notebook used to create: https://github.com/burrelllizzie/DS5001-Final-Project/blob/main/Parsing%20Gothic%20Texts.ipynb
- Delimitter: none
- Number of observations: 5
- List of features, including at least three that may be used for model summarization (e.g. date, author, etc.): author, title, chapter regex, number of chapters, book length (in terms), date
- Average length of each document in characters: 429,516

## CORPUS (2)

The sequence of word tokens in the corpus, indexed by their location in the corpus and document structures.

- UVA Box URL: https://www.dropbox.com/scl/fi/9qgz3dnwivp9pa7mie88t/gothic-texts-CORPUS.csv?rlkey=mmuredbke3bt62wbizkz58bov&st=ekhyrs19&dl=0
- GitHub URL for notebook used to create: https://github.com/burrelllizzie/DS5001-Final-Project/blob/main/Parsing%20Gothic%20Texts.ipynb
- Delimitter: none
- Number of observations Between (should be >= 500,000 and <= 2,000,000 observations.): 522,688
- OHCO Structure (as delimitted column names):'book_id','chap_id','para_num', 'sent_num', 'token_num'
- Columns (as delimitted column names, including: 'pos_tuple', 'pos', 'token_str','term_str', 'pos_group'

## VOCAB (2)

The unique word types (terms) in the corpus.

- UVA Box URL: https://www.dropbox.com/scl/fi/ib4a2uilo0gw9zq125te5/gothic-texts-VOCAB.csv?rlkey=66am9thqykjd52cymt1kq7jd5&st=79yamooq&dl=0
- GitHub URL for notebook used to create: https://github.com/burrelllizzie/DS5001-Final-Project/blob/main/Parsing%20Gothic%20Texts.ipynb
- Delimitter: none
- Number of observations: 19,990
- Columns: 'n','n_chars','p','i','max_pos','max_pos_group', 'stop','stem_porter','dfidf'
- List the top 20 significant words in the corpus by DFIDF.
![image.png](attachment:image.png)

# Derived Tables

## BOW (3)

A bag-of-words representation of the CORPUS.

- UVA Box URL: https://www.dropbox.com/scl/fi/k8ezle04z9m78alg2a8nw/gothic-texts-BOW.csv?rlkey=5f3415x082twi9gj3da4f7g8k&st=7tut2br8&dl=0
- GitHub URL for notebook used to create:
https://github.com/burrelllizzie/DS5001-Final-Project/blob/main/Deriving%20Tables%20Gothic%20Texts.ipynb
- Delimitter: none
- Bag (expressed in terms of OHCO levels):  CHAPS = OHCO[:2]
- Number of observations: 125,153
- Columns: 'n','tfidf'

## DTM (3)

A represenation of the BOW as a sparse count matrix.

- UVA Box URL: https://www.dropbox.com/scl/fi/v95kymb0kluuqjod6ym35/gothic-texts-DTM.csv?rlkey=goqourjs8s1p05r378r00b3po&st=o8lbpi5p&dl=0
- UVA Box URL of BOW used to generate (if applicable): https://www.dropbox.com/scl/fi/k8ezle04z9m78alg2a8nw/gothic-texts-BOW.csv?rlkey=5f3415x082twi9gj3da4f7g8k&st=7tut2br8&dl=0
- GitHub URL for notebook used to create:
https://github.com/burrelllizzie/DS5001-Final-Project/blob/main/Deriving%20Tables%20Gothic%20Texts.ipynb
- Delimitter: none
- Bag (expressed in terms of OHCO levels): CHAPS = OHCO[:2]

## TFIDF (3)

A Document-Term matrix with TFIDF values.

- UVA Box URL: https://www.dropbox.com/scl/fi/myvoixb1qkkjjrnwzrbso/gothic-texts-TFIDF.csv?rlkey=8r88iglnl1947iegr39srt4sa&st=l37fz2s0&dl=0
- UVA Box URL of DTM or BOW used to create: https://www.dropbox.com/scl/fi/v95kymb0kluuqjod6ym35/gothic-texts-DTM.csv?rlkey=goqourjs8s1p05r378r00b3po&st=o8lbpi5p&dl=0
- GitHub URL for notebook used to create:
https://github.com/burrelllizzie/DS5001-Final-Project/blob/main/Deriving%20Tables%20Gothic%20Texts.ipynb
- Delimitter: none
- Description of TFIDIF formula ($\LaTeX$ OK): 
    - using the 'max' method: 
    - TF = DTCM.T / DTCM.T.max()
    - IDF = np.log2(N / DF)
    - TFIDF = TF * IDF

## Reduced and Normalized TFIDF_L2 (3)

A Document-Term matrix with L2 normalized TFIDF values.

- UVA Box URL: https://www.dropbox.com/scl/fi/z1vt39cqr3y6efv0jcbat/gothic-texts-TFIDF_L2.csv?rlkey=cwsytu8mu73l3302j0avklf9u&st=i7dbhgv7&dl=0
- UVA Box URL of source TFIDF table: https://www.dropbox.com/scl/fi/myvoixb1qkkjjrnwzrbso/gothic-texts-TFIDF.csv?rlkey=8r88iglnl1947iegr39srt4sa&st=l37fz2s0&dl=0
- GitHub URL for notebook used to create:
https://github.com/burrelllizzie/DS5001-Final-Project/blob/main/Deriving%20Tables%20Gothic%20Texts.ipynb
- Delimitter: none
- Number of features (i.e. significant words): 1000
- Principle of significant word selection: Top 1000 significant words, ranked by DFIDF

# Models

## PCA Components (4)

- UVA Box URL: https://www.dropbox.com/scl/fi/k7af1fsfdp9emds7oa4q8/gothic-texts-COMPONENTS.csv?rlkey=9ztoig791rckpawrio9rmkg2c&st=0lvign1u&dl=0
- UVA Box URL of the source TFIDF_L2 table: https://www.dropbox.com/scl/fi/z1vt39cqr3y6efv0jcbat/gothic-texts-TFIDF_L2.csv?rlkey=cwsytu8mu73l3302j0avklf9u&st=i7dbhgv7&dl=0
- GitHub URL for notebook used to create:
https://github.com/burrelllizzie/DS5001-Final-Project/blob/main/Deriving%20Tables%20Gothic%20Texts.ipynb
- Delimitter: none
- Number of components: 10
- Library used to generate:(all novels included) https://www.dropbox.com/scl/fi/g6wcyc813me7ha7r9mosy/gothic-texts-LIB.csv?rlkey=ltvtwh5mox5qmd9vzgpzgrrjd&st=ztv468sr&dl=0 
- Top 5 positive terms for first component:
    - moor
    - hound
    - baronet
    - hall
    - hill
- Top 5 negative terms for second component:
    - whilst
    - ship
    - sleep
    - boxes
    - castle

## PCA DCM (4)

The document-component matrix generated.

- UVA Box URL: https://www.dropbox.com/scl/fi/kb5pq9edp0qt6xbkoxgln/gothic-texts-DCM.csv?rlkey=iudoth4d9g0m6tlf7v5eq3o4o&st=6foh6vkd&dl=0
- GitHub URL for notebook used to create:
https://github.com/burrelllizzie/DS5001-Final-Project/blob/main/Deriving%20Tables%20Gothic%20Texts.ipynb
- Delimitter: none

## PCA Loadings (4)

The component-term matrix generated.

- UVA Box URL: https://www.dropbox.com/scl/fi/qclf4dh3s8h6va73zh3ga/gothic-texts-LOADINGS.csv?rlkey=v9g7su3zdjm85i2efub6kre97&st=na6mbws2&dl=0
- GitHub URL for notebook used to create:
https://github.com/burrelllizzie/DS5001-Final-Project/blob/main/Deriving%20Tables%20Gothic%20Texts.ipynb
- Delimitter: none

## PCA Visualization 1 (4)

Include a scatterplot of documents in the space created by the first two components.

Color the points based on a metadata feature associated with the documents.

Also include a scatterplot of the loadings for the same two components. (This does not need a feature mapped onto color.)

![newplot%20%281%29.png](attachment:newplot%20%281%29.png)



![newplot%20%282%29.png](attachment:newplot%20%282%29.png)

Briefly describe the nature of the polarity you see in the first component:

The top left side contains words like "case", "facts", "evidence", "cases", "police", "report", "track", and "points" which indicates associations with deductive work and reasoning with tangible clues. The bottom left corner of the triangle includes words like "passions", "beauty", "senses", "soul", "form", "secret", "sin", "mist", "pain", "romance" and "vanity". All these words seem associated with the more mysterious, elusive, or manevolent forces within the novels. Particularly the words "vanity" and "beauty" are related to Dorian Gray's moral and social downfall within Wilde's novel as his portrait erodes, aging and decaying from Dorian's sinful activities while his physical body remains youthful and flawless. The polarity exhibited in the first component seems to be an opposition between the tangible/investigated and the abstract/mysterious. 

## PCA Visualization 2 (4)

Include a scatterplot of documents in the space created by the second two components.

Color the points based on a metadata feature associated with the documents.

Also include a scatterplot of the loadings for the same two components. (This does not need a feature mapped onto color.)

![newplot%20%283%29.png](attachment:newplot%20%283%29.png)



![newplot%20%284%29.png](attachment:newplot%20%284%29.png)

Briefly describe the nature of the polarity you see in the second component:


The left side of the cluster includes words like "blood", "tomb", "coffin", "sunset", "moonlight", "teeth", "fear", "churchyard", "stake", and "monster". All these words are especially prominent in "Dracula" and descriptions of vampires and their resting places. The right side of the cluster contains words like "friendship", "laughing", "smiling", "youth", "romance", "pleasure", "passion", and "garden". These words are clearly more positive and lighthearted than the opposite side, and are associated with the domestic and romantic relationships. This polarity is likely not a reflection of the conflict within "Dracula", but reflects on the differences in preoccupations of the novels within the library. 

## LDA TOPIC (4)

- UVA Box URL: https://www.dropbox.com/scl/fi/mecfugpudg2ds72rujlxg/gothic-texts-LDA_TOPICS.csv?rlkey=scvk5g0hrtzc5fi6w5239ubfx&st=swdrhrcj&dl=0
- UVA Box URL of count matrix used to create:
https://www.dropbox.com/scl/fi/ex2azus21q4t5rxfaef2f/gothic-texts-LDA_DTM.csv?rlkey=b9kbys6gno6pjd02zxg935wgz&st=qlos0kb4&dl=0
- GitHub URL for notebook used to create:
https://github.com/burrelllizzie/DS5001-Final-Project/blob/main/Generating%20Models.ipynb
- Delimitter: none
- Libary used to compute: (all texts included)
https://www.dropbox.com/scl/fi/g6wcyc813me7ha7r9mosy/gothic-texts-LIB.csv?rlkey=ltvtwh5mox5qmd9vzgpzgrrjd&st=ztv468sr&dl=0
- A description of any filtering, e.g. POS (Nouns and Verbs only): no filtering - all words included from CORPUS
- Number of components: 20
- Any other parameters used: none
- Top 5 words and best-guess labels for topic five topics by mean document weight:
![image.png](attachment:image.png)
  - T00: ambiguous/existential thematic
  - T01: positive relationships
  - T02: domestic spaces
  - T03: movement
  - T04: clues/evidence

## LDA THETA (4)

- UVA Box URL:
https://www.dropbox.com/scl/fi/fzlycgzbwl857upqkfx4e/gothic-texts-LDA_THETA.csv?rlkey=cv5lcb2wb2fy33w38g8kx4e9s&st=p6evf25f&dl=0
- GitHub URL for notebook used to create:
https://github.com/burrelllizzie/DS5001-Final-Project/blob/main/Generating%20Models.ipynb
- Delimitter:none

## LDA PHI (4)

- UVA Box URL:
https://www.dropbox.com/scl/fi/g936augjejobcbyc7hnic/gothic-texts-LDA_PHI.csv?rlkey=h4ujldzg72vyxfaufnzcsi3kl&st=9tr12bdu&dl=0
- GitHub URL for notebook used to create:
https://github.com/burrelllizzie/DS5001-Final-Project/blob/main/Generating%20Models.ipynb
- Delimitter:none

## LDA + PCA Visualization (4)

Apply PCA to the PHI table and plot the topics in the space opened by the first two components.

Size the points based on the mean document weight of each topic (using the THETA table).

Color the points basd on a metadata feature from the LIB table.

Provide a brief interpretation of what you see.

I was unable to complete this part, and did not find instructions or examples of completing similar tasks in previous labs or homeworks.

## Sentiment VOCAB_SENT (4)

Sentiment values associated with a subset of the VOCAB from a curated sentiment lexicon.

- UVA Box URL:
https://www.dropbox.com/scl/fi/p5thjetty3b1kjbu4wwo6/gothic-texts-VOCAB_SENT.csv?rlkey=w60hmp7mtfb27w2opl2ayak1l&st=et9ojcfk&dl=0
- UVA Box URL for source lexicon:
https://www.dropbox.com/scl/fi/50k6bcn6e6ws9s4waiw0i/gothic-texts-SOURCE_LEXICON_SALEX.csv?rlkey=88wxc3sof3pghg0geum6a08a6&st=0oq6gdk4&dl=0
- GitHub URL for notebook used to create:
https://github.com/burrelllizzie/DS5001-Final-Project/blob/main/Generating%20Models.ipynb
- Delimitter:none

## Sentiment BOW_SENT (4)

Sentiment values from VOCAB_SENT mapped onto BOW.

- UVA Box URL:
https://www.dropbox.com/scl/fi/xe2n5esktnh9ytib3gxn8/gothic-texts-BOW_SENT.csv?rlkey=wt7jke52ahqcnzrvw2rg16sl4&st=ozeo709x&dl=0
- GitHub URL for notebook used to create:
https://github.com/burrelllizzie/DS5001-Final-Project/blob/main/Generating%20Models.ipynb
- Delimitter: none

## Sentiment DOC_SENT (4)

Computed sentiment per bag computed from BOW_SENT.

- UVA Box URL:
https://www.dropbox.com/scl/fi/abvgc5ct08544g01a6afs/gothic-texts-DOC_SENT.csv?rlkey=e1tlvv1ftkocltr4k38fcebow&st=mgxv1nhu&dl=0
- GitHub URL for notebook used to create:
https://github.com/burrelllizzie/DS5001-Final-Project/blob/main/Generating%20Models.ipynb
- Delimitter: none
- Document bag expressed in terms of OHCO levels: CHAPS = OHCO[:2]

## Sentiment Plot (4)

Sentiment of "Dracula" plotted over the course of the novel: 
![newplot%20%285%29.png](attachment:newplot%20%285%29.png)

## VOCAB_W2V (4)

A table of word2vec features associated with terms in the VOCAB table.

- UVA Box URL:
https://www.dropbox.com/scl/fi/7s4bsghogv5a3kuivadjz/gothic-texts-VOCAB_W2V.csv?rlkey=q51ixocwy2zw2sv9gokz7gohi&st=ubkagocy&dl=0
- GitHub URL for notebook used to create:
https://github.com/burrelllizzie/DS5001-Final-Project/blob/main/Generating%20RIFFS%20%26%20W2V.ipynb
- Delimitter:none
- Document bag expressed in terms of OHCO levels: PARA = OHCO[:3]
- Number of features generated: 189
- The library used to generate the embeddings: Only "The Hound of Baskervilles" and "The Adventures of Sherlock Holmes" by Sir Arthur Conan Doyle were used to generate this W2V model

## Word2vec tSNE Plot (4)

Plot word embedding featues in two-dimensions using t-SNE.

Describe a cluster in the plot that captures your attention.

![newplot%20%286%29.png](attachment:newplot%20%286%29.png)

The bottom right corner (-2, -5) contains a cluster which seems to capture the more haunting, horrific elements of Doyle's gothic scenes. The cluster includes words like "moor", "house", and "night" which all describe settings in which the mysteries occur. The cluster also includes "hound" and "hands" closely plotted, both which act as perpetrators of the crimes within the text. There are also chilling words like "wind" and "death" which indicate that these words may be used to describe the more errie crime scenes in the novels. 

# Riffs

Provde at least three visualizations that combine the preceding model data in interesting ways.

These should provide insight into how features in the LIB table are related. 

The nature of this relationship is left open to you -- it may be correlation, or mutual information, or something less well defined. 

In doing so, consider the following visualization types:

- Hierarchical cluster diagrams
- Heatmaps
- Scatter plots
- KDE plots
- Dispersion plots
- t-SNE plots
- etc.

## Riff 1 (5)

### Exploring Sentiment by Book

GitHub URL for notebook used to create: https://github.com/burrelllizzie/DS5001-Final-Project/blob/main/Generating%20Models.ipynb

![sent_books.png](attachment:sent_books.png)

In the bar plot exploring average emotions across gothic novels in the library, the only novel with an overall positive sentiment (top bar) is "The Adventures of Sherlock Holmes". This is likely because these stories typically feature Holmes' triumph over a mysterious adversary through the power of his genius deductive reasoning. While this narrative is also exhibitied in "The Hound of Baskervilles", perhaps the negative sentiment value could be attributed to large amounts of the text being dedicated to describing the desolate moor landscape which the mystery is set. The latter novel also highlights themes of human degeneracy and atavism which may also contribute to a negative sentiment value. Across all novels, fear (red bar) has one of the highest values - the only novel which it doesn't trump all others is "The Picture of Dorian Gray". This may be because the protagonist, Gray, is depicted relishing his youth in high society before his downfall for a large portion of the novel's beginning - which is reflected in "joy" being the highest sentiment value. 

## Riff 2 (5)

### Dendograms/ Distance Measures

GitHub URL for notebook used to create: https://github.com/burrelllizzie/DS5001-Final-Project/blob/main/Deriving%20Tables%20Gothic%20Texts.ipynb

### Jaccard
![image.png](attachment:image.png)

### Cosine
![image-2.png](attachment:image-2.png)

Almost all the the distance measurement methods placed Doyle's novels ("The Hound of Baskervilles" and "The Adventure of Sherlock Holmes") closest to one another. The Jaccard method was the only one which did not, placing instead "Dracula" and "Wurthering Heights" closest to one another, and stemming from the same branch. This may be because the higher prominence of female characters in each of the books, while the remaining novels only include females in minor roles. In most of the dendograms, "Dracula" is placed in next closest proximity to Doyle's novels. This may be due to the investigative and detective narratives that drive the plots of the novels - much more so than in "Wurthering Heights" and "The Portrait of Dorian Gray" which also feature more scenes in domestic spaces. 

## Riff 3 (5)

GitHub URL for notebook used to create: https://github.com/burrelllizzie/DS5001-Final-Project/blob/main/Generating%20RIFFS%20%26%20W2V.ipynb

### Wurthering Heights Sentiment Map
![newplot%20%288%29.png](attachment:newplot%20%288%29.png)


### The Adventures of Sherlock Holmes Map
![newplot%20%289%29.png](attachment:newplot%20%289%29.png)

### The Hound of Baskervilles Map
![newplot%20%2810%29.png](attachment:newplot%20%2810%29.png)

### The Portrait of Dorian Gray Map
![newplot%20%2811%29.png](attachment:newplot%20%2811%29.png)

Nearly all the novels exhibit fluctuating, but relatively stablely negative sentiment throughout the course of the narrative. "The Portrait of Dorian Gray" is the only novel which follows a clear pattern of descending sentiment towards the final third of the book, with fear also reaching its peak at this point. The rapidly fluctuating sentiment levels of the other novels likely reflect a repeating pattern of investigative triumph over malevolent powers, and then their responding upheaval. This is especially characteristic of Dracula's plot in which the evil Count is typically a few steps ahead of the benevolent protagonists. 

# Interpretation (4)

Describe something interesting about your corpus that you discovered during the process of completing this assignment.

While conducting sentiment analysis across these gothic novels all published around the same period, it was interesting to find that they all exhibited similar levels of each emotion, as shown in the first Riff, Exploring Sentiment by Book. I was curious whether this emotive pattern of low overall sentiment and high fear was a reflection on the gothic genre, or if it was typical of most novels. To further investigate this, it would be interesting to compare emotion bar charts between gothic and romantic novels written during the 19th century. One curious thing which leads me to question the validity of the emotional bar charts is that the 'surprise' emotion is low across nearly all novels, even though surprise is something that garners reader engagement especially in detective novels like Doyle's and "Dracula". 

While exploring polarity in the PCA Visualizations, it was not surprising to find the largest distinction between clusters of words which described tanglible evidence that could be used to solve mysteries, and those elements of the uncanny that continued to evade protagonists' deduction and curiousity. In the first visualization, there was also a distinct cluster of words describing domestic spaces and familial/romantic relationships, which is also characteristic of the gothic genre. Since all novels were included in the visualizations, it was difficult to disentangle the unigue gothic conflicts (urban life & progression vs rural inhabitants & degeneration; domestic spaces vs wild places; science and reason vs passions and spirituality) that may have been represented in each novel. In further explorations, I would recreate similar visualizations for each novel to better capture the thematic conflicts at the heart of their plot. 