# Exploratory Text Analysis of Theater Performance Reviews

### Ami Kano

## Introduction

Theatrical performances are an art form that are expressed through rich, live storytelling. But how are those expressions received? What kind of discourse occurs from them? This project analyzes approximately 300 theatre performance reviews using Principle Component Analysis, Topic Models, Word Embeddings (word2vec) and Sentiment Analysis. Through these analyses, insights into theatrical performance reviews are uncovered.

## Source Data

The source data for this analysis was collected from the website **https://www.playshakespeare.com/**. This website describes itself as 'the best Shakespeare source for plays, news, reviews, and discussion' and contains, amongst other contents, reviews of theatrical performances based on the works of William Shakespeare.

As the website was not web-scraping friendly, the webpages containing the reviews were manually downloaded as HTML files. In total, there were 292 HTML files which were later parsed into more analysis-appropriate formats. The code developed to parse through the source files are stored in the `parser.ipynb` file. This file, along with the 292 HTML files, are available on the GitHub page for this project:

* `parser.ipynb` : https://github.com/ak7ra/eta_theatre_reviews/blob/main/parser.ipynb
* Source HTML files: https://github.com/ak7ra/eta_theatre_reviews/tree/main/data

The distribution of Shakespeare's work that the reviewed performances were based on, along with their genre, are as follows:

<table><tr>
<td> <img src="output/images/og_work_count.png"  style="width: 200px;"/> </td>
<td> <img src="output/images/genre_count.png"  style="width: 200px;"/> </td>
</tr></table>

The original works have an unequal distribution, but the genre is equally distributed within the corpus.

Below is a histogram of the word counts of each review, post-parsing:

<img src="output/images/corpus_length_histogram.png"  style="width: 600px;"/> </td>

From the histogram, it can be seen that most reviews are around 1000 words in length. The outliers, such as the review with approximately 5000 words, were reviews written by those who were more emotionally invested in the performance in question.  

## Data Model

The tables generated through the analysis will be discussed here briefly. The finer details and the URLs of the tables are listed in the appendix of this paper. 

#### Core Tables
These tables were generated from the source files with the `parser.ipynb` file:

* `CORPUS.csv`
* `LIB.csv`
* `VOCAB.csv`

It should be noted that the `CORPUS` table has the indices `review_id`, `sent_id`, and `token_id` unlike OHCO, which is more commonly used. This is because the HTML format of the source documents made it impossible to separate paragraphs. In addition, most reviews had two or less paragraphs, suggesting that having the additional paragraph index would not be so helpful in the analysis. 

In addition, the `CORPUS` table was filtered to exclude the names of characters present in Shakespeare's works. This was done to improve the results of the analyses, as the character names would be given a disproportionate level of importance due to their frequency.

The `LIB` table contains a few unique fields: `Original Work`, `Overall Rating`, `Genre` and `Rating Category`. The first two fields were extracted from the source file. `Original Work` refers to the Shakespeare work that the performance was based on, and contains values such as 'Twelfth Night' and 'MacBeth'. `Overall Rating` is a numerical variable containing integers from 1 to 5, with 1 as the lowest rating and 5 as the highest. The `Overall Rating` variable has a right-skewed distribution that looks like:

<img src="output/images/rating_histogram_plain.png"  style="width: 600px;"/> </td>

And this distribution is shared amongst different genres:

<img src="output/images/rating_histogram.png"  style="width: 600px;"/> </td>

The `Rating Category` variable was generated based on the `Overall Rating` variable. Reviews with `Overall Rating` values of 3 or less were given a `Rating Category` value of 'Negative', and those with `Overall Rating` values of 4 or higher were given a `Rating Category` value of 'Positive'.

The `Genre` variable was generated based on the `Original Work` variable and has the values 'Comedy' and 'Tragedy'. It is a categorical variable indicating the genre of the referenced Shakespeare's work. 

Lastly, it should be noted that the `VOCAB` table had the field `dfidf` added onto itself during one of the analyses.

#### Tables Generated During Analyses

The tables below were generated through various text analysis methods employed in the `analysis.ipynb` file:

##### Principal Components Analysis (PCA)

* `reduced_tfidf.csv`
* `pca_dcm.csv`
* `pca_loadings.csv`
* `pca_compinf.csv`

##### Topic Models (LDA)

* `topic.csv`
* `topic_theta.csv`
* `topic_phi.csv`

##### Word Embeddings (word2vec)

* `word2vec.csv`

##### Sentiment Analysis

* `sentiment.csv`

## Exploration/Interpretation

Here, the exploration of the corpus will be discussed. Each text analysis methods employed will be discussed in separate sections.

### Principal Components Analysis (PCA)

For this analysis, the `VOCAB` table was filtered down to the 1000 most important nouns as ranked by DFIDF. 

The filtered down `VOCAB` table was then used to produce a TFIDF table with `review_id` as the index. 

With this TFIDF table and the hyperparameters `k=10`, `norm_docs=True`, `center_by_mean=False`,`center_by_variance=False`, three tables were created: the DCM table, the loadings table, and the compinf table. These three tables contain information on the principal components, either on word or document level.

In order to explore the corpus, visualizations were made based on the DCM table.

Below is a scatterplot of the reviews with PC0 and PC1 as the x-axis and y-axis respectively, with colors indicating the genre of the referenced Shakespeare play:

<img src="output/images/vis_pcs_0_1_Genre.png"  style="width: 600px;"/> </td>

It can be observed that the reviews of plays with different genres overlap to some degree, but those in the 'Comedy' category tend to have higher PC1 values.

This can be further explored in detail by changing the colors to indicate the referenced work:

<img src="output/images/vis_pcs_0_1_Original Work.png"  style="width: 600px;"/> </td>

The non-overlapping reviews seem to be limited to reviews of performances of *Midsummer Night's Dream* and *MacBeth*.

PC0 and PC1 are further examined with information on the loadings table:

<table><tr>
<td> <img src="output/images/pc0_sorted.png"  style="width: 130px;"/> </td>
<td> <img src="output/images/pc1_sorted.png"  style="width: 130px;"/> </td>
</tr></table>

PC0 has less distinguishable themes, but PC1 seems to be differenciating between positive and dreamy characters/nouns like 'fairies' and 'dream' and suspenseful ones like 'witches' and 'murder'. It makes sense that PC1 is differentiating between reviews of comedy performances and tragedy performances. 

It also makes sense that the works that involve these words the most are on polar ends. Although positive/dreamy words like 'fairies' and 'dream' are associated with many Shakespearian comedies to some degree, they are most prevalent in *Midsummer Night's Dream* - especially the word 'fairies'. This placed review documents pertaining to *Midsummer Night's Dream* performances on the high end of PC1. Similarly, words that insinuate fear and suspense like 'witch' and 'murder' are common in tragedies, but especially so in 'MacBeth'.

Similar results are made from principal components 2 and 3:

<table><tr>
<td> <img src="output/images/vis_pcs_2_3_Genre.png"  style="width: 350px;"/> </td>
    <td> <img src="output/images/vis_pcs_2_3_Original Work.png"  style="width: 350px;"/> </td>
<td> <img src="output/images/pc2_top10.png"  style="width: 130px;"/> </td>
</tr></table>

PC2 differentiates reviews on tragedy and comedy, as reviews of Shakespearian comedy performances have a significantly larger distribution compared to those of tragedies. However, upon further inspection, it can be seen that this differentiation is mostly based on reviews of 'Twelfth Night' performances. The lower end of PC2 includes words strongly associated with this particular play. This further proves that the principal components in this model can distinguish between different categories but are mostly driven by the original works that the reviewed performances are based on.

One side note is that the rating of a performance were not distinguishable by the principal components.

<img src="output/images/vis_pcs_0_1_Rating Category.png"  style="width: 500px;"/>

As one can see, there seems to be no distinction between positive and negative reviews.

### Topic Models (LDA)

For topic modeling, the contents of the `CORPUS` table and `VOCAB` talbe were filtered to include only nouns and verbs. In addition, the LDA was done with the hyperparameters `max_iter=5`, `learning_offset=50`, and `random_state=0`.

Below are the 5 most prominent topics in the corpus:

<img src="output/images/topic_top5.png"  style="width: 750px;"/>

Reviews of performances primarily discuss the components of a theatre performance like the acting, directing and technical design choices. 

There were minial differences in the prominence of each topic by category. The importance of each topic by original work and rating category are explored in `analysis.ipynb` but not included in this report.

The importance of the topics to each genre is shown by the heatmap below:

<img src="output/images/topic_by_genre.png"  style="width: 750px;"/>

The only significant difference in importance is that of topic 19, which has the labels 'act', 'fight', 'come', 'play', 'scenes', 'speech', and 'portrays'. Topic 19 is heavily attributed to the Tragedy genre, which may be explained by one of the labels 'fight' as Shakespearian tragedies often depict violence, fighting and death.

### Word Embeddings (word2vec)

Prior to executing this analysis, the corpus was filtered to contain only nouns and verbs. The word2vec model was created with the hyperparameters `window = 2`, `vector_size = 256`, `min_count = 50`, and `workers = 4`, and the coordinates of each word was generated with the hyperparameters `learning_rate = 200`, `perplexity = 20`, `n_components = 2`, `init = 'random'`, `n_iter = 1000`, and `random_state = 42`.

The plot of the word embeddings is as below:

<img src="output/images/gensim.png"  style="width: 600px;"/>

The word2vec model plots words pertaining to the production, such as those involved with casting and design choices, in vicinity of each other. For example, the word 'design' is most associated with the words below:

<img src="output/images/gensim_similar_design.png"  style="width: 130px;"/>

And the word 'actor' with the ones below:

<img src="output/images/gensim_similar_actor.png"  style="width: 130px;"/>

However, the word2vec model does not quite encapsulate the content of the plays themselves. For example, when the word2vec model attempts to make an analogy between 'comedy' and 'tragedy', it associates the word 'life' with:

<img src="output/images/gensim_comedy_life_tragedy.png"  style="width: 130px;"/>

One would think that words like 'death' would be chosen as an appropriate answer, but the closest word that the model outputs is 'fight'.

### Sentiment Analysis

For sentiment analysis, the corpus was filtered to only nouns and verbs and then combined with a pre-made table containing sentiment lexicons of vocabularies (It should be noted that the sentiment 'trust' was omitted from this analysis). The sentiment of each vocabulary was then weighed by TFIDF.

The average sentiments expressed in the entire corpus is as below:

<img src="output/images/sentiment.png"  style="width: 500px;"/>

When grouped by genre, one is able to see the difference in sentiment between reviews of comedy performances and tragedy performances.

<img src="output/images/sentiment_genre.png"  style="width: 500px;"/>

Reviews of tragedy plays have a negative sentiment, with emotions such as fear and anger having the highest values. In contrast, reviews of comedy plays have an overall positive sentiment with joy having the highest value.

Is this because of the emotions associated with the content of the plays, or is it because of how the performance was perceived by the reviewer? Further analysis suggests the former to be true. Below are sentiments of the corpus grouped by the genre of the referenced Shakespeare play and thhe rating category.

<img src="output/images/sentiment_rating_genre.png"  style="width: 750px;"/>

For both tragedies and comedies, the sentiment expressed in the reviews do not change in terms of the ratio between individual emotions. In fact, reviews that negatively rate the performances do not express particularly more negative sentiments; Instead, they simply do not express as much emotion. 

To the author of this paper, personally this makes sense. Theatrical performances are considered to be of high quality when they invoke strong emotions within the audience; And a reviewer cannot write an emotionally charged review if they did not feel emotions from a performance.

An exception to this is when a performance is extremely negatively received. Below is the sentiment of the corpus, grouped by the rating value of 1 to 5:

<img src="output/images/sentiment_rating_detail.png"  style="width: 500px;"/>

Reviews that gave performances a rating of 1 have an extremely negative overall sentiment value and a high anger value compared to the other lower rating values like 2 and 3. Those who were angered enough to write about a performance that they deemed to be the lowest quality possible would be quite vocal in their reviews.

## Conclusion

As indicated by topic modeling and word embedding, theatre performance reviews are primarily concerned with the components of the performance itself, such as the acting and technical designs, rather than the plot of the show. However, PCA and sentiment analysis were sensitive to the plot of the Shakespeare play that the performance referenced. 

This is because both topic modeling and word embedding are methods for interpreting the 'internal map' of the corpus by using the distributions of words in it. On the other hand, PCA identifies and combines features with maximum variance; it must have captured the difference between reviews of a particular Shakespeare play against another. As for the sentiment analysis, the one employed in this project is lexicon-based, so it makes sense that reviews discussing performances of tragic plays would be inteerpreted as tragic. 

## Appendix

* `CORPUS.csv`
  * URL: https://github.com/ak7ra/eta_theatre_reviews/blob/main/output/CORPUS.csv
  * Columns:
    * `review_id` : string variable used to identify each document in corpus
    * `sent_id` : numerical variable used to identify each sentence in a document
    * `token_id` : numerical variable used to identify each token in a sentence
    * `term_str` : string variable containing the term of each token
    * `pos_tuple` : tuple variable containing the term string and the associated part-of-speech
    * `pos` : string variable indicating the part-of-speech of term string

* `LIB.csv`
  * URL: https://github.com/ak7ra/eta_theatre_reviews/blob/main/output/LIB.csv
  * Columns:
    * `Original Work` : string variable indicating the Shakespeare play that the reviewed performance was referencing
    * `Review Title` : string variable containing the title of each review document
    * `Review Author` : string variable indicating thhe author of review document
    * `Content` : string variable containing minimally parsed through content of review document
    * `Overall Rating` : numerical variable indicating the rating given to the reviewed performance. 1 is be the lowest rating, and 5 is the highest rating
    * `Genre` : string variable indicating the genre of the referenced Shakespeare play
    * `Rating Category` : string variable indicating whether a review positively or negatively rated a performance 

* `VOCAB.csv`
  * URL: https://github.com/ak7ra/eta_theatre_reviews/blob/main/output/VOCAB.csv
  * Columns:
    * `term_str` : string variable containing the vocabulary in the corpus
    * `n` : numerical variable indicating the number of occurrence of each term
    * `n_chars` : numerical variable indicating the number of characters in each term
    * `p` : numerical variable indicating the probability of the term occurring in the corpus
    * `i` : numerical variable containing the negative log value of the variable `p`
    * `max_pos` : string variable indicating the most common part-of-speech for each term
    * `n_pos` : numerical variable indicating the number of part-of-speech associated with each term
    * `stop` : numerical/boolean variable indicating whether a term is a stop word or not
    * `stem_porter` : string variable containing the stem of each term as defined by Porter
    * `stem_snowball` : string variable containing the stem of each term as defined by Snowball
    * `stem_lancaster` : string variable containing the stem of each term as defined by Lancaster
    * `dfidf` : numerical variable containing the DFIDF value for each term

* `reduced_tfidf.csv`
  * URL: https://github.com/ak7ra/eta_theatre_reviews/blob/main/output/reduced_tfidf.csv
  * Columns:
    * `review_id` : string variable used to identify each document in corpus
    * This table contains as columns 1000 terms that were selected as most important for PCA. They contain the TFIDF values associated with the terms
* `pca_dcm.csv`
  * URL: https://github.com/ak7ra/eta_theatre_reviews/blob/main/output/pca_dcm.csv
  * Columns:
    * `review_id` : string variable used to identify each document in corpus
    * This table contains 10 columns: `PC0`, `PC1`, `PC2`, `PC3`, `PC4`, `PC5`, `PC6`, `PC7`, `PC8`, `PC9`. These columns contain information on the 10 principal components generated through PCA, and indicates the extent to which a review document is associated with a principal component
    * This table contains the columns `Genre`, `Original Work`, `Review Authohr` and `Rating Category` which are taken from and explained under the `LIB` table
* `pca_loadings.csv`
  * URL: https://github.com/ak7ra/eta_theatre_reviews/blob/main/output/pca_loadings.csv
  * Columns:
    * `term_str` : string variable used to identify 1000 most important terms in corpus
    * This table contains 10 columns: `PC0`, `PC1`, `PC2`, `PC3`, `PC4`, `PC5`, `PC6`, `PC7`, `PC8`, `PC9`. These columns contain information on the 10 principal components generated through PCA, and indicates the extent to which a term is associated with a principal component
* `pca_compinf.csv`
  * URL: https://github.com/ak7ra/eta_theatre_reviews/blob/main/output/pca_compinf.csv
  * Columns:
    * `pc_id` : string variable identifying each unique principal component
    * `eig_val` : numerical variable containing the eigenvalues associated with each principal component
    * This table contains as columns 1000 terms that were selected as most important for PCA. They contain the loadings of each term in relation to the principal component

* `topic.csv`
  * URL: https://github.com/ak7ra/eta_theatre_reviews/blob/main/output/topic.csv
  * Columns:
    * `topic_id` : string variable identifying the topics generated by topic modeling
    * This table contains 7 columns numbered from 0 to 6. Each column contains one string associated with the topic
    * `label` : string variable combining the values of `topic_id` and the 7 numbered columns discussed above
    * `doc_weight_sum` : numerical variable containing the sum of weights of each topic
    * `term_freq` : numerical variable containing the entrophy of each topic
    * This table contains as columns the unique values of the columns `Genre`, `Rating Category` and `Original Work` from the `LIB` table. They contain numerical values indicating the prevalence of each topic within each category
* `topic_phi.csv`
  * URL: https://github.com/ak7ra/eta_theatre_reviews/blob/main/output/topic_phi.csv
  * Columns:
    * `topic_id` : string variable identifying the topics generated by topic modeling
    * This table contains as columns 1000 most important terms that were selected as most important for topic modeling. They contain numerical values indicating the probability of the term appearing in each topic
* `topic_theta.csv`
  * URL: https://github.com/ak7ra/eta_theatre_reviews/blob/main/output/topic_theta.csv
  * Columns:
    * `review_id` : string variable used to identify each document in corpus
    * `sent_id` : numerical variable used to identify each sentence in a document
    * This table contains 25 columns, each associated with a topic generated by the topic model. They contain numerical values indicating the probablity of each topic appearing in a sentence of a review document

* `word2vec.csv`
  * URL: https://github.com/ak7ra/eta_theatre_reviews/blob/main/output/word2vec.csv
  * Columns:
    * `term_str` : string variable containing select terms from corpus
    * `vector` : list variable containing vectors for each term as generated by the word2vec model
    * `x` : numerical variable containing the x-value of each term
    * `y` : numerical variable containing the y-value of each term
    * `n` : numerical variable indicating the number of occurrences for each term
    * `pos_group` : string variable indicating the associated part-of-speech for each term
    * `log(freq)` : numerical variable containing the log value of the number of occurrence of each term

* `sentiment.csv`
  * URL: https://github.com/ak7ra/eta_theatre_reviews/blob/main/output/sentiment.csv
  * Columns:
    * `review_id` : string variable used to identify each document in corpus
    * `term_str` : string variable containing the term of each token
    * `tfidf` : numerical variable containing the TFIDF value of each term
    * This table contains the columns `anger`, `anticipation`, `disgust`, `fear`, `joy`, `sadness`, `surprise` and `sentiment`. The columns contain the numerical value indicating the level of their respective emotions expressed by each term, weighed by TFIDF
    * This table contains the columns `Genre`, `Original Work`, `Review Author`, `Overall Rating` and `Rating Category` from the `LIB` table