<a href="https://colab.research.google.com/github/programminghistorian/jekyll/blob/Issue-3052/assets/corpus-analysis-with-spacy/corpus-analysis-with-spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Female poets Analysis with spaCy
## by Yaran Lu

### Introduction

### Research Questions
This project use spacy to analyze the linguistic features of the poetic works of four female writers around the 19th century, such as word choice, syntactic structure, etc., and explore the correlation between these features and female themes (such as self-perception, emotional experience, female identity, etc.).

The importance of this research question is that by analyzing the poetic works of female writers and using spacy for in-depth research, it can help us gain a deeper understanding of female literature, female perspectives, and the manifestation of gender in literary creation. This helps unearth the literary achievements of female writers, expands awareness of female voices and perspectives, and may help eliminate gender bias and promote gender equality in the literary field.

### Brief Description
(1) The corpus itself
This corpus includes 4 files, which are excerpts from poems by 4 female writers. Poetry was chosen as the research object of this project because among many literary genres, poetry can best reflect the characteristics of language and diction. These four female writers all had an important position in the world literary world in the 19th century.
they are, respectively:
Sonnets from the Portuguese. It is a collection of 44 love sonnets written by the British poet Elizabeth Barrett Browning (1806.3.6-1861.6.29).
Poems written by the British poet Christina Rossetti (1830.12.5-1894.12.29). It contains her famous poem "Goblin Market" and other poetries.
Poems written by American poet Edna St. Vincent Millay (1892.2.22-1950.10.19). Millay won the 1923 Pulitzer Prize for Poetry.
Poems by Emily Dickinson, Series One. This is an excerpt from the first series of her collection of works written by American poet Emily Dickinson (1830.12.10-1886.5.15).

(2) Target audience and the intended use of the corpus
The target audience of this project is lovers and researchers of British and American literature or women's literature.
The analysis of this project will help them understand the characteristics of these four female writers in poetry creation in the form of data analysis.

(3) Text selection criteria
This project selected four of the most representative female writers of the 19th century, namely two British writers and two American writers. The selected works are also the most representative or special works in their creative careers. Due to the space limitations of this project, the entire collection of poems cannot be used, and the previous parts of the collection are excerpted for study.

(4) The data collection process
The txt data of the corpus is downloaded from the official website of Project Gutenberg, the URL is https://www.gutenberg.org/

(5) Cleaning and/or preprocessing steps
This project uses regular expressions to remove extra whitespace characters from the text, and trim the text to remove leading and trailing spaces. 

(6) Annotations that I’ve added and tools that are used for that
In the code, the Spacy library is used, which provides some NLP functions, including part-of-speech tagging, lemmatization, named entity recognition, etc. Here are the functions and tools used in this code:
Part-of-Speech Tagging: Get the part-of-speech of each word through token.pos_.
Lemmatization: Use token.lemma_ to get the token (basic form) of each word.
Named Entity Recognition: Use doc.ents to get named entities in text and recognize their tags.
These tools are applied to a given text for part-of-speech tagging, tokenization, and named entity recognition.

(7) The format of the files in the corpus (in this case, probably txt and csv), as well as the description of the columns in the CSV file.
The code handles two file formats: .txt and .csv. The .txt file contains text content, while the .csv file contains metadata information. The CSV file contains some columns, and the code renames one of the columns according to specific needs.


(8) The quality checks
The code creates a new DataFrame final_paper_df by merging text and metadata. During this process, only those rows that have both paper content and metadata information are retained, possibly as a quality control method to ensure data integrity and correspondence.


In [219]:
!pip install spaCy
!pip install plotly
!pip install nbformat==5.1.2



In [220]:
import spacy

!spacy download en_core_web_sm

import os

from spacy import displacy

import pandas as pd
pd.options.mode.chained_assignment = None 

import plotly.graph_objects as go
import plotly.express as px

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [221]:
texts = []
file_names = []

for _file_name in os.listdir('txt_files_Yaran'):
    if _file_name.endswith('.txt'):
        texts.append(open('txt_files_Yaran' + '/' + _file_name, 'r', encoding='utf-8').read())
        file_names.append(_file_name)

In [222]:
d = {'Filename':file_names,'Text':texts}

In [223]:
paper_df = pd.DataFrame(d)

In [224]:
paper_df.head()

Unnamed: 0,Filename,Text
0,Poems by Edna St. Vincent Millay.txt,The Project Gutenberg eBook of Poems\n \nTh...
1,"Poems by Emily Dickinson, Series One by Emily ...",The Project Gutenberg eBook of Poems by Emily ...
2,Sonnets from the Portuguese by Elizabeth Barre...,The Project Gutenberg eBook of Sonnets from th...
3,Poems by Christina Georgina Rossetti.txt,The Project Gutenberg eBook of Poems\n \nTh...


In [225]:
paper_df['Text'] = paper_df['Text'].str.replace('\s+', ' ', regex=True).str.strip()
paper_df.head()

Unnamed: 0,Filename,Text
0,Poems by Edna St. Vincent Millay.txt,The Project Gutenberg eBook of Poems This eboo...
1,"Poems by Emily Dickinson, Series One by Emily ...",The Project Gutenberg eBook of Poems by Emily ...
2,Sonnets from the Portuguese by Elizabeth Barre...,The Project Gutenberg eBook of Sonnets from th...
3,Poems by Christina Georgina Rossetti.txt,The Project Gutenberg eBook of Poems This eboo...


In [226]:
metadata_df = pd.read_csv('metadata_Yaran.csv')
metadata_df.head()

Unnamed: 0,WORK_ID,AUTHOR,AUTHOR'S NATIONALITY,TITLE,RELEASE DATE,LANGUAGE
0,Poems by Christina Georgina Rossetti,Christina Georgina Rossetti,ENGLAND,Poems,5-Sep-06,English
1,Poems by Edna St. Vincent Millay,Edna St. Vincent Millay,AMERICA,Poems,31-Mar-19,English
2,"Poems by Emily Dickinson, Series One by Emily ...",Emily Dickinson,AMERICA,"Poems by Emily Dickinson, Series One",1-Jun-01,English
3,Sonnets from the Portuguese by Elizabeth Barre...,Elizabeth Barrett Browning,ENGLAND,Sonnets from the Portuguese,1-Dec-99,English


In [228]:
paper_df['Filename'] = paper_df['Filename'].str.replace('.txt', '', regex=True)

metadata_df.rename(columns={"WORK_ID": "Filename"}, inplace=True)

print(metadata_df.columns)
print(paper_df.columns)

Index(['Filename', 'AUTHOR', 'AUTHOR'S NATIONALITY', 'TITLE', 'RELEASE DATE',
       'LANGUAGE'],
      dtype='object')
Index(['Filename', 'Text'], dtype='object')


In [229]:
final_paper_df = metadata_df.merge(paper_df,on='Filename')

In [230]:
final_paper_df.head()

Unnamed: 0,Filename,AUTHOR,AUTHOR'S NATIONALITY,TITLE,RELEASE DATE,LANGUAGE,Text
0,Poems by Christina Georgina Rossetti,Christina Georgina Rossetti,ENGLAND,Poems,5-Sep-06,English,The Project Gutenberg eBook of Poems This eboo...
1,Poems by Edna St. Vincent Millay,Edna St. Vincent Millay,AMERICA,Poems,31-Mar-19,English,The Project Gutenberg eBook of Poems This eboo...
2,"Poems by Emily Dickinson, Series One by Emily ...",Emily Dickinson,AMERICA,"Poems by Emily Dickinson, Series One",1-Jun-01,English,The Project Gutenberg eBook of Poems by Emily ...
3,Sonnets from the Portuguese by Elizabeth Barre...,Elizabeth Barrett Browning,ENGLAND,Sonnets from the Portuguese,1-Dec-99,English,The Project Gutenberg eBook of Sonnets from th...


## Text Enrichment with spaCy

In [231]:
nlp = spacy.load('en_core_web_sm')
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


In [232]:
def process_text(text):
    return nlp(text)

In [233]:
final_paper_df['Doc'] = final_paper_df['Text'].apply(process_text)

### Text Reduction

In [234]:
def get_token(doc):
    return [(token.text) for token in doc]

In [235]:
final_paper_df['Tokens'] = final_paper_df['Doc'].apply(get_token)
final_paper_df.head()

Unnamed: 0,Filename,AUTHOR,AUTHOR'S NATIONALITY,TITLE,RELEASE DATE,LANGUAGE,Text,Doc,Tokens
0,Poems by Christina Georgina Rossetti,Christina Georgina Rossetti,ENGLAND,Poems,5-Sep-06,English,The Project Gutenberg eBook of Poems This eboo...,"(The, Project, Gutenberg, eBook, of, Poems, Th...","[The, Project, Gutenberg, eBook, of, Poems, Th..."
1,Poems by Edna St. Vincent Millay,Edna St. Vincent Millay,AMERICA,Poems,31-Mar-19,English,The Project Gutenberg eBook of Poems This eboo...,"(The, Project, Gutenberg, eBook, of, Poems, Th...","[The, Project, Gutenberg, eBook, of, Poems, Th..."
2,"Poems by Emily Dickinson, Series One by Emily ...",Emily Dickinson,AMERICA,"Poems by Emily Dickinson, Series One",1-Jun-01,English,The Project Gutenberg eBook of Poems by Emily ...,"(The, Project, Gutenberg, eBook, of, Poems, by...","[The, Project, Gutenberg, eBook, of, Poems, by..."
3,Sonnets from the Portuguese by Elizabeth Barre...,Elizabeth Barrett Browning,ENGLAND,Sonnets from the Portuguese,1-Dec-99,English,The Project Gutenberg eBook of Sonnets from th...,"(The, Project, Gutenberg, eBook, of, Sonnets, ...","[The, Project, Gutenberg, eBook, of, Sonnets, ..."


In [236]:
tokens = final_paper_df[['Text', 'Tokens']].copy()
tokens.head()

Unnamed: 0,Text,Tokens
0,The Project Gutenberg eBook of Poems This eboo...,"[The, Project, Gutenberg, eBook, of, Poems, Th..."
1,The Project Gutenberg eBook of Poems This eboo...,"[The, Project, Gutenberg, eBook, of, Poems, Th..."
2,The Project Gutenberg eBook of Poems by Emily ...,"[The, Project, Gutenberg, eBook, of, Poems, by..."
3,The Project Gutenberg eBook of Sonnets from th...,"[The, Project, Gutenberg, eBook, of, Sonnets, ..."


#### Lemmatization

In [237]:
def get_lemma(doc):
    return [(token.lemma_) for token in doc]

final_paper_df['Lemmas'] = final_paper_df['Doc'].apply(get_lemma)

In [238]:
all_lemmas = []
for lemmas in final_paper_df['Lemmas']:
    all_lemmas.extend(lemmas)
    
from collections import Counter

top_100_lemmas_per_doc = []

for i in range(4):
    doc = final_paper_df['Doc'][i]
    lemmas = [token.lemma_ for token in doc]
    lemma_counts = Counter(lemmas)
    top_100_lemmas = lemma_counts.most_common(100)
    top_100_lemmas_per_doc.append(top_20_lemmas)
    print(f"Top 100 lemmas in document {i + 1}:")
    print(top_100_lemmas)

Top 100 lemmas in document 1:
[(',', 394), ('the', 181), ('and', 153), ('"', 140), ('.', 95), ('a', 91), ('-', 76), ('in', 75), ('her', 71), (':', 70), ('of', 68), ('to', 64), ('she', 60), (';', 57), ('be', 47), ('one', 47), ('with', 45), ('not', 45), ('like', 44), ('I', 39), ('their', 34), ('they', 32), ('for', 31), ('at', 31), ('come', 27), ('Laura', 27), ('buy', 26), ('but', 25), ('Lizzie', 24), ('by', 23), ('fruit', 23), ('goblin', 22), ('no', 21), ('*', 21), ('you', 20), ('or', 20), ('from', 18), ('as', 18), ('my', 17), ('it', 16), ("'s", 16), ('?', 16), ('we', 16), ('man', 16), ('A', 15), ('day', 15), ('that', 15), ('have', 14), ('golden', 14), ('up', 13), ('all', 12), ('on', 12), ('would', 12), ('should', 12), ('kiss', 12), ('cry', 11), ('down', 11), ('night', 11), ('this', 10), ('an', 10), ('Song', 10), ('when', 10), ('life', 10), ('say', 10), ('hear', 10), ('look', 10), ('if', 9), ('will', 9), ('head', 9), ('our', 9), ('glen', 9), ('then', 9), ('out', 8), ('heart', 8), ('do', 

#### summary
Document 1:

The higher frequency of commas, articles, conjunctions, and punctuation marks (such as double quotation marks, periods) may indicate that the author pays more attention to the flow of the storyline and the connection between sentences. The presence of female characters is emphasized, with names such as "Laura" and "Lizzie" appearing frequently, which may reflect the important role women play in the poem.
Words about feelings and relationships appear frequently, possibly referring to feelings between women, friendships, or sisterhood.

Document 2:
Commas, articles, conjunctions, and punctuation marks also appear frequently, and attention is also paid to the connectivity of sentences.
A high frequency of words related to personal pronouns (e.g., “I,” “you,” “she”) may indicate that the poem focuses more on the personal emotions and experiences of the author, reader, or other characters.


Document 3:

Commas and punctuation marks appear more frequently, but articles and conjunctions appear less frequently.
Time-related words such as "time", as well as pronouns and exclamation points are used more frequently, which may point to thinking or reflection on time, emotion and life.

Document 4:
Highlights the expression of love and emotion. The frequent occurrence of words such as "love", "soul" and "heart" reflects the author's strong focus on emotions and inner experience.

In [239]:
print(f'"love" appears in the text tokens column ' + str(final_paper_df['Tokens'].apply(lambda x: x.count('love')).sum()) + ' times.')
print(f'"love" appears in the lemmas column ' + str(final_paper_df['Lemmas'].apply(lambda x: x.count('love')).sum()) + ' times.')

"love" appears in the text tokens column 75 times.
"love" appears in the lemmas column 91 times.


In [240]:
def get_pos(doc):
    return [(token.pos_, token.tag_) for token in doc]

final_paper_df['POS'] = final_paper_df['Doc'].apply(get_pos)

In [241]:
list(final_paper_df['POS'])

[[('DET', 'DT'),
  ('PROPN', 'NNP'),
  ('PROPN', 'NNP'),
  ('PROPN', 'NNP'),
  ('ADP', 'IN'),
  ('PROPN', 'NNPS'),
  ('DET', 'DT'),
  ('NOUN', 'NN'),
  ('AUX', 'VBZ'),
  ('ADP', 'IN'),
  ('DET', 'DT'),
  ('NOUN', 'NN'),
  ('ADP', 'IN'),
  ('PRON', 'NN'),
  ('ADV', 'RB'),
  ('ADP', 'IN'),
  ('DET', 'DT'),
  ('PROPN', 'NNP'),
  ('PROPN', 'NNP'),
  ('CCONJ', 'CC'),
  ('ADJ', 'JJS'),
  ('ADJ', 'JJ'),
  ('NOUN', 'NNS'),
  ('ADP', 'IN'),
  ('DET', 'DT'),
  ('NOUN', 'NN'),
  ('ADP', 'IN'),
  ('DET', 'DT'),
  ('NOUN', 'NN'),
  ('CCONJ', 'CC'),
  ('ADP', 'IN'),
  ('ADV', 'RB'),
  ('PRON', 'DT'),
  ('NOUN', 'NNS'),
  ('ADV', 'RB'),
  ('PUNCT', '.'),
  ('PRON', 'PRP'),
  ('AUX', 'MD'),
  ('VERB', 'VB'),
  ('PRON', 'PRP'),
  ('PUNCT', ','),
  ('VERB', 'VB'),
  ('PRON', 'PRP'),
  ('ADV', 'RB'),
  ('CCONJ', 'CC'),
  ('VERB', 'VB'),
  ('VERB', 'VB'),
  ('VERB', 'VB'),
  ('PRON', 'PRP'),
  ('ADP', 'IN'),
  ('DET', 'DT'),
  ('NOUN', 'NNS'),
  ('ADP', 'IN'),
  ('DET', 'DT'),
  ('PROPN', 'NNP'),
  ('PROP

In [242]:
spacy.explain("IN")

'conjunction, subordinating or preposition'

In [243]:
def extract_proper_nouns(doc):
    return [token.text for token in doc if token.pos_ == 'PROPN']

final_paper_df['Proper_Nouns'] = final_paper_df['Doc'].apply(extract_proper_nouns)

In [244]:
list(final_paper_df.loc[[1, 1], 'Proper_Nouns'])

[['Project',
  'Gutenberg',
  'eBook',
  'Poems',
  'United',
  'States',
  'Project',
  'Gutenberg',
  'License',
  'United',
  'States',
  'eBook',
  'Edna',
  'St.',
  'Vincent',
  'Millay',
  'Release',
  'March',
  'English',
  'Tim',
  'Lindell',
  'Charlie',
  'Howard',
  'Online',
  'Proofreading',
  'Team',
  'Internet',
  'Archive',
  'Canadian',
  'Libraries',
  'PROJECT',
  'GUTENBERG',
  'Tim',
  'Lindell',
  'Charlie',
  'Howard',
  'Online',
  'Proofreading',
  'Team',
  'Internet',
  'Archive',
  'Canadian',
  'Libraries',
  '_',
  'Edna',
  'St.',
  'Vincent',
  'Millay',
  'Poems',
  '_',
  'Edna',
  'St.',
  'Vincent',
  'Millay',
  'London',
  'Martin',
  'Secker',
  'Great',
  'Britain',
  'London',
  'Martin',
  'Secker',
  'Ltd.',
  '_',
  'Renascence',
  'God',
  'World',
  'Afternoon',
  'Hill',
  'Journey',
  'Sorrow',
  'Tavern',
  'Life',
  'Little',
  'Ghost',
  'Kin',
  'Sorrow',
  'Songs',
  'Shattering',
  'Shroud',
  'Dream',
  'Indifference',
  'Witch'

#### Named Entity Recognition

In [245]:
labels = nlp.get_pipe("ner").labels

for label in labels:
    print(label + ' : ' + spacy.explain(label))

CARDINAL : Numerals that do not fall under another type
DATE : Absolute or relative dates or periods
EVENT : Named hurricanes, battles, wars, sports events, etc.
FAC : Buildings, airports, highways, bridges, etc.
GPE : Countries, cities, states
LANGUAGE : Any named language
LAW : Named documents made into laws.
LOC : Non-GPE locations, mountain ranges, bodies of water
MONEY : Monetary values, including unit
NORP : Nationalities or religious or political groups
ORDINAL : "first", "second", etc.
ORG : Companies, agencies, institutions, etc.
PERCENT : Percentage, including "%"
PERSON : People, including fictional
PRODUCT : Objects, vehicles, foods, etc. (not services)
QUANTITY : Measurements, as of weight or distance
TIME : Times smaller than a day
WORK_OF_ART : Titles of books, songs, etc.


In [246]:
def extract_named_entities(doc):
    return [ent.label_ for ent in doc.ents]
final_paper_df['Named_Entities'] = final_paper_df['Doc'].apply(extract_named_entities)
final_paper_df['Named_Entities']

0    [GPE, ORG, GPE, PRODUCT, PERSON, DATE, LAW, PE...
1    [GPE, ORG, GPE, PRODUCT, DATE, LAW, PERSON, PE...
2    [GPE, ORG, GPE, PRODUCT, PERSON, PERSON, PERSO...
3    [NORP, GPE, ORG, GPE, PRODUCT, NORP, PERSON, D...
Name: Named_Entities, dtype: object

In [247]:
def extract_named_entities(doc):
    return [ent for ent in doc.ents]

final_paper_df['NE_Words'] = final_paper_df['Doc'].apply(extract_named_entities)
final_paper_df['NE_Words']

0    [(the, United, States), (the, Project, Gutenbe...
1    [(the, United, States), (the, Project, Gutenbe...
2    [(the, United, States), (the, Project, Gutenbe...
3    [(Portuguese), (the, United, States), (the, Pr...
Name: NE_Words, dtype: object

In [248]:
doc = final_paper_df['Doc'][2]

displacy.render(doc, style='ent', jupyter=True)

### Download Enriched Dataset

In [249]:
final_paper_df.to_csv('Yaran_MICUSP_papers_with_spaCy_tags.csv')

## Analysis of Linguistic Annotations

### Part of Speech Analysis

In [250]:
doc = nlp("This is 'an' example? sentence")

print(doc.count_by(spacy.attrs.POS))

{95: 1, 87: 1, 97: 3, 90: 1, 92: 2}


In [251]:
num_pos = doc.count_by(spacy.attrs.POS)

dictionary = {}
for k,v in sorted(num_pos.items()):
  dictionary[doc.vocab[k].text] = v

dictionary

{'AUX': 1, 'DET': 1, 'NOUN': 2, 'PRON': 1, 'PUNCT': 3}

In [253]:
pos_analysis_df = final_paper_df[['Filename','Doc']]

num_list = []

def get_pos_tags(doc):
    dictionary = {}
    num_pos = doc.count_by(spacy.attrs.POS)
    for k,v in sorted(num_pos.items()):
        dictionary[doc.vocab[k].text] = v
    num_list.append(dictionary)

pos_analysis_df.loc['C_POS'] = pos_analysis_df['Doc'].apply(get_pos_tags)

In [254]:
pos_counts = pd.DataFrame(num_list)
columns = list(pos_counts.columns)
idx = 0
new_col = pos_analysis_df['Doc']
pos_counts.insert(loc=idx, column='Doc', value=new_col)

pos_counts

Unnamed: 0,Doc,ADJ,ADP,ADV,AUX,CCONJ,DET,INTJ,NOUN,NUM,PART,PRON,PROPN,PUNCT,SCONJ,SYM,VERB,X
0,"(The, Project, Gutenberg, eBook, of, Poems, Th...",270,445,128,110,202,330,13,856,59,103,399,537,883,59,1,616,8
1,"(The, Project, Gutenberg, eBook, of, Poems, Th...",371,567,261,338,389,594,31,1109,116,126,759,380,1005,143,3,770,4
2,"(The, Project, Gutenberg, eBook, of, Poems, by...",214,294,157,193,137,341,6,677,25,79,345,212,739,120,1,414,5
3,"(The, Project, Gutenberg, eBook, of, Sonnets, ...",304,478,243,163,185,314,18,827,27,113,557,256,723,136,1,559,2


In [255]:
print(pos_counts.columns)

Index(['Doc', 'ADJ', 'ADP', 'ADV', 'AUX', 'CCONJ', 'DET', 'INTJ', 'NOUN',
       'NUM', 'PART', 'PRON', 'PROPN', 'PUNCT', 'SCONJ', 'SYM', 'VERB', 'X'],
      dtype='object')


In [256]:
average_pos_df = pos_counts.groupby(['NOUN']).mean()

average_pos_df = average_pos_df.round(0)

average_pos_df = average_pos_df.reset_index()

average_pos_df


The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.



Unnamed: 0,NOUN,ADJ,ADP,ADV,AUX,CCONJ,DET,INTJ,NUM,PART,PRON,PROPN,PUNCT,SCONJ,SYM,VERB,X
0,677,214.0,294.0,157.0,193.0,137.0,341.0,6.0,25.0,79.0,345.0,212.0,739.0,120.0,1.0,414.0,5.0
1,827,304.0,478.0,243.0,163.0,185.0,314.0,18.0,27.0,113.0,557.0,256.0,723.0,136.0,1.0,559.0,2.0
2,856,270.0,445.0,128.0,110.0,202.0,330.0,13.0,59.0,103.0,399.0,537.0,883.0,59.0,1.0,616.0,8.0
3,1109,371.0,567.0,261.0,338.0,389.0,594.0,31.0,116.0,126.0,759.0,380.0,1005.0,143.0,3.0,770.0,4.0


### summary
Document 0:
A large number of nouns (NOUN) may indicate that the document focuses more on describing and depicting things and concepts.
A relatively low number of verbs (VERB) may mean fewer descriptions of behaviors or actions.
Frequent use of punctuation marks (PUNCT) may indicate that the document has a strong sense of rhythm or a complex sentence structure.
The greater use of pronouns (PRON) and proper nouns (PROPN) may indicate that the document describes specific names and persons more frequently.


Document 1:
Punctuation marks (PUNCT) are used more frequently, which may imply that the sentence structure of the document is more diverse and has a certain sense of rhythm.
A higher number of verbs (VERB) may indicate that there are more parts of the document describing behaviors or actions.
The frequent use of pronouns (PRON) may imply that specific persons or items are described more frequently in the document.


Document 2:
The number of punctuation marks (PUNCT) is significantly higher than other parts of speech, which may imply that the sentence structure of the document is relatively complex.
Proper nouns (PROPN) are used more frequently, which may reflect the frequent descriptions of specific names, places, or people in the document.
The relatively high number of verbs (VERB) and adjectives (ADJ) may indicate that the document contains more actions and adjective modifications.


Document 3:
The number of nouns (NOUN) and punctuation marks (PUNCT) is high, which may imply that the document has more descriptions of things and a strong sense of sentence rhythm.
The number of pronouns (PRON) and verbs (VERB) is also relatively high, which may indicate that there are more personal descriptions and action descriptions in the document.
These analyzes can help you understand some differences in grammatical features between different documents, or reveal the content style and characteristics expressed in the documents.

In [257]:
fig = px.bar(average_pos_df, x="NOUN", y=["ADJ", 'VERB', "NUM"], title="Average Part-of-Speech Use in Papers Written by Female Writers", barmode='group')
fig.show()

### Analysis of ```DATE``` Named Entities

In [258]:
def extract_date_named_entities(doc):
    return [ent for ent in doc.ents if ent.label_ == 'DATE']

ner_analysis_df['Date_Named_Entities'] = final_paper_df['Doc'].apply(extract_date_named_entities)

ner_analysis_df['Date_Named_Entities'] = [', '.join(map(str, l)) for l in ner_analysis_df['Date_Named_Entities']]

In [259]:
date_word_counts_df = ner_analysis_df[(ner_analysis_df == 'Proposal').any(axis=1)]

date_word_frequencies = date_word_counts_df.Date_Named_Entities.str.split(expand=True).stack().value_counts()

date_word_frequencies[:10]

Series([], dtype: int64)

In [260]:
date_word_counts_df = ner_analysis_df[(ner_analysis_df == 'Critique/Evaluation').any(axis=1)]

date_word_frequencies = date_word_counts_df.Date_Named_Entities.str.split(expand=True).stack().value_counts()

date_word_frequencies[:10]

Series([], dtype: int64)

### Conclusions

Poems by Christina Georgina Rossetti:

The higher frequency of commas, articles, conjunctions, and punctuation marks (such as double quotation marks, periods) may indicate that the author pays more attention to the flow of the storyline and the connection between sentences. The presence of female characters is emphasized, with names such as "Laura" and "Lizzie" appearing frequently, which may reflect the important role women play in the poem. Words about feelings and relationships appear frequently, possibly referring to feelings between women, friendships, or sisterhood.


Poems by Edna St. Vincent Millay: 

Commas, articles, conjunctions, and punctuation marks also appear frequently, and attention is also paid to the connectivity of sentences. A high frequency of words related to personal pronouns (e.g., “I,” “you,” “she”) may indicate that the poem focuses more on the personal emotions and experiences of the author, reader, or other characters.


Poems by Emily Dickinson, Series One by Emily Dickinson:

Commas and punctuation marks appear more frequently, but articles and conjunctions appear less frequently. Time-related words such as "time", as well as pronouns and exclamation points are used more frequently, which may point to thinking or reflection on time, emotion and life.


Sonnets from the Portuguese by Elizabeth Barrett Browning: 

Highlights the expression of love and emotion. The frequent occurrence of words such as "love", "soul" and "heart" reflects the author's strong focus on emotions and inner experience.

Additionally，Analyzing the number of various parts of speech provides some grammatical analysis clues to help us understand the possible grammatical characteristics and style of each document.

First, the noun (NOUN) and verb (VERB) are probably the most meaningful parts because they directly reflect the content and action of the document. Of the four documents, Sonnets from the Portuguese by Elizabeth Barrett Browning has the highest number of nouns and verbs, which may indicate that it is more action- and description-focused, while Poems by Christina Georgina Rossetti has the fewest nouns and verbs, which may be more oriented towards Description and abstraction.

Adjectives (ADJ) and adverbs (ADV) are also very important parts of speech. They are usually used to describe nouns and verbs, helping readers perceive the text more clearly. Poems by Edna St. Vincent Millay has a relatively high number of adjectives and adverbs, which may indicate that it is richer in description and modification. Sonnets from the Portuguese by Elizabeth Barrett Browning has more adverbs, which may increase the detail of actions and descriptions.

The number of punctuation marks (PUNCT) is also important. Sonnets from the Portuguese by Elizabeth Barrett Browning has the highest number of punctuation marks, possibly indicating a more complex or expressive syntactic structure, while Poems by Christina Georgina Rossetti uses relatively little punctuation and may have a more concise or direct sentence structure. .

Finally, the number of proper nouns (PROPNs) may reflect the use of specific names or names in the document. Sonnets from the Portuguese by Elizabeth Barrett Browning has a higher number of proper nouns and may contain more specific places, names of people, or specific names, whereas other documents have relatively few.