# Part-of-Speech Tagging and Analysis with spaCy and pandas

This script demonstrates how to use spaCy for part-of-speech (POS) tagging on an excerpt from "Emma" by Jane Austen.
It extracts tokens and their POS tags, analyzes the frequency of each tag, and identifies the most common nouns using pandas for data manipulation.

**Key steps:**
- Load English language model with spaCy
- Process a literary text sample and extract tokens with POS tags
- Analyze and display the most frequent tokens and POS categories
- Identify and list the top 10 most common nouns

This example shows practical skills in text analysis, combining NLP and data processing tools.

In [1]:
import spacy
import pandas as pd

In [2]:
nlp = spacy.load("en_core_web_sm")

In [21]:
emma_ja = "Emma Woodhouse handsome clever and rich with a comfortable home and happy disposition seemed to unite some of the best blessings of existence and had lived nearly twenty one years in the world with very little to distress or vex her She was the youngest of the two daughters of a most affectionate indulgent father and had in consequence of her sisters marriage been mistress of his house from a very early period Her mother had died too long ago for her to have more than an indistinct remembrance of her caresses and her place had been supplied by an excellent woman as governess who had fallen little short of a mother in affection Sixteen years had Miss Taylor been in Mr Woodhouses family less as a governess than a friend very fond of both daughters but particularly of Emma Between them it was more the intimacy of sisters Even before Miss Taylor had ceased to hold the nominal office of governess the mildness of her temper had hardly allowed her to impose any restraint and the shadow of authority being now long passed away they had been living together as friend and friend very mutually attached and Emma doing just what she liked highly esteeming Miss Taylors judgment but directed chiefly by her own"

In [22]:
print(emma_ja)

Emma Woodhouse handsome clever and rich with a comfortable home and happy disposition seemed to unite some of the best blessings of existence and had lived nearly twenty one years in the world with very little to distress or vex her She was the youngest of the two daughters of a most affectionate indulgent father and had in consequence of her sisters marriage been mistress of his house from a very early period Her mother had died too long ago for her to have more than an indistinct remembrance of her caresses and her place had been supplied by an excellent woman as governess who had fallen little short of a mother in affection Sixteen years had Miss Taylor been in Mr Woodhouses family less as a governess than a friend very fond of both daughters but particularly of Emma Between them it was more the intimacy of sisters Even before Miss Taylor had ceased to hold the nominal office of governess the mildness of her temper had hardly allowed her to impose any restraint and the shadow of aut

In [23]:
spacy_doc = nlp(emma_ja)

In [24]:
pos_df = pd.DataFrame(columns=['token','pos_tag'])

In [25]:
for token in spacy_doc:
    pos_df = pd.concat([pos_df,
                        pd.DataFrame.from_records([{'token': token.text, 'pos_tag': token.pos_}])], ignore_index=True)

In [26]:
pos_df.head(15)

Unnamed: 0,token,pos_tag
0,Emma,PROPN
1,Woodhouse,PROPN
2,handsome,ADJ
3,clever,ADJ
4,and,CCONJ
5,rich,ADJ
6,with,ADP
7,a,DET
8,comfortable,ADJ
9,home,NOUN


In [27]:
pos_df_counts = pos_df.groupby(['token', 'pos_tag']).size().reset_index(name='counts').sort_values(by='counts',ascending=False)

In [29]:
pos_df_counts.head(10)

Unnamed: 0,token,pos_tag,counts
93,of,ADP,14
62,her,PRON,8
56,had,AUX,8
18,and,CCONJ,8
114,the,DET,8
12,a,DET,6
117,to,PART,5
69,in,ADP,4
24,been,AUX,4
123,very,ADV,4


In [30]:
pos_df_poscounts = pos_df_counts.groupby(['pos_tag'])['token'].count().sort_values(ascending=False)

In [31]:
pos_df_poscounts.head(10)

pos_tag
NOUN     33
VERB     19
ADJ      18
ADV      18
PRON     11
ADP       8
PROPN     7
DET       5
AUX       4
NUM       4
Name: token, dtype: int64

In [32]:
nouns = pos_df_counts[pos_df_counts.pos_tag == 'NOUN'][:10]

In [33]:
nouns

Unnamed: 0,token,pos_tag,counts
55,governess,NOUN,3
53,friend,NOUN,3
88,mother,NOUN,2
38,daughters,NOUN,2
109,sisters,NOUN,2
131,years,NOUN,2
32,caresses,NOUN,1
37,consequence,NOUN,1
47,existence,NOUN,1
41,disposition,NOUN,1
