# POS tags - Nouns, Verbs, Adjectives in Spacy
* Notebook by Adam Lang
* Date: 8/8/2024

# Overview
* In this notebook we will go over POS tags and ways to find the most common words in Spacy.

In [1]:
## imports
import spacy
from collections import Counter

In [3]:
## load spacy language model
nlp = spacy.load('en_core_web_sm')

In [4]:
## load data file path
data_path = '/content/drive/MyDrive/Colab Notebooks/Classical NLP/moon.txt'

In [5]:
# load spacy doc object
doc = nlp(open(data_path).read())

In [7]:
# print doc
doc

The moon is the satellite of the earth. It moves round the earth. It shines at night by light reflected from the Sun. It looks beautiful. The bright Moonlight is very soothing. The earthly objects shine like silver in the moonlight. We are fascinated by the enchanting beauty of the Moon. The moon is not as beautiful as it looks. It seems to be lovely when it shines in the sky at night. As a matter of fact it is devoid of plants and animals. The moon is not a suitable place for plants and animals. Therefore, no form of life can be found on the moon. Unlike the earth, the moon has got no atmosphere. Therefore, the lunar days are very hot and the lunar nights are intensely cold. The moon looks beautiful from the earth but in fact it has up forbidding appearance. It is full of rocks and craters. When we look at the moon at night we see some dark spots on it. These dark spots are dangerous rocks and craters. The gravitational pull of the moon is less than that of the earth, so it is difficu

## Most Common NOUNs
* First we want to preprocess the text and remove stop words and punctuation.
* Then we will have the nouns and we can utilize the Counter.

In [13]:
# remove punctuation, stop words --> nouns
nouns = [token.text for token in doc if token.is_stop != True and token.is_punct != True and token.pos_ == 'NOUN']

In [14]:
## now this will extract ONLY the nouns
nouns

['moon',
 'satellite',
 'earth',
 'earth',
 'night',
 'light',
 'objects',
 'silver',
 'moonlight',
 'beauty',
 'moon',
 'sky',
 'night',
 'matter',
 'fact',
 'plants',
 'animals',
 'moon',
 'place',
 'plants',
 'animals',
 'form',
 'life',
 'moon',
 'earth',
 'moon',
 'atmosphere',
 'days',
 'nights',
 'moon',
 'earth',
 'fact',
 'appearance',
 'rocks',
 'craters',
 'moon',
 'night',
 'spots',
 'spots',
 'rocks',
 'craters',
 'pull',
 'moon',
 'earth',
 'surface',
 'moon',
 'man',
 'beginning',
 'life',
 'earth',
 'wonder',
 'poets',
 'poems',
 'moon',
 'Scientists',
 'mystery',
 'moon',
 'human',
 'moon',
 'attempts',
 'man',
 'moon',
 'moon',
 'surface',
 'moon',
 'rocks',
 'earth',
 'scientists',
 'men',
 'moon',
 'times',
 'moon',
 'man',
 'object',
 'journey',
 'moon',
 'life',
 'earth',
 'life',
 'earth',
 'moon']

## Word Frequency of NOUNs

In [15]:
# setup counter
word_freq = Counter(nouns)

## extract top 10 most common nouns
common_nouns = word_freq.most_common(10)

## output top 10 most common nouns
common_nouns

[('moon', 19),
 ('earth', 9),
 ('life', 4),
 ('night', 3),
 ('rocks', 3),
 ('man', 3),
 ('fact', 2),
 ('plants', 2),
 ('animals', 2),
 ('craters', 2)]

## Most Common Verbs
* We will follow the same process as above but we will leave the stop words in at first.
* Use the python counter to find the frequencies.

In [21]:
verbs = [token.text for token in doc if token.is_punct != True and token.is_stop == True and token.pos_ == 'VERB']

In [22]:
verb_freq = Counter(verbs)

## get most common
common_verbs = verb_freq.most_common(10)

# print common_verbs
common_verbs

[('has', 2),
 ('seems', 1),
 ('see', 1),
 ('made', 1),
 ('make', 1),
 ('have', 1),
 ('go', 1)]

## Verbs without stop words
* Lets see what the differences are if we remove the stop words.

In [23]:
verb_no_stop = [token.text for token in doc if token.is_punct != True and token.is_stop !=True and token.pos_ == 'VERB']


## counter
verb_no_stop_freq = Counter(verb_no_stop).most_common(10)

# output most common
verb_no_stop_freq

[('looks', 3),
 ('shines', 2),
 ('fascinated', 2),
 ('moves', 1),
 ('reflected', 1),
 ('shine', 1),
 ('found', 1),
 ('got', 1),
 ('forbidding', 1),
 ('look', 1)]

Summary:
* Clearly removing the stop words allows us to see the actual most common verbs, whereas if we leave the stop words then it will count those that are verbs as verbs.

## Most Common Adjectives
* We will follow the same process above but remove the stop words.

In [24]:
## adjectives
adj = [token.text for token in doc if token.is_punct != True and token.is_stop != True and token.pos_ == 'ADJ']

# count most common adjectives
adj_freq = Counter(adj).most_common(10)

# output most common adjectives
adj_freq

[('beautiful', 4),
 ('lunar', 3),
 ('dark', 2),
 ('happy', 2),
 ('bright', 1),
 ('soothing', 1),
 ('earthly', 1),
 ('enchanting', 1),
 ('lovely', 1),
 ('devoid', 1)]

## Most common Adverbs


In [25]:
adverbs = [token.text for token in doc if token.is_punct != True and token.is_stop != True and token.pos_ == 'ADV']

# counter
adverb_freq = Counter(adverbs).most_common(10)

# output
adverb_freq

[('intensely', 1), ('safely', 1), ('longer', 1)]

## Most common lemma?


In [27]:
lemma = [token.lemma_ for token in doc if token.is_punct != True and token.is_stop != True]

lemma_freq = Counter(lemma).most_common(10)

lemma_freq

[('moon', 19),
 ('earth', 9),
 ('look', 5),
 ('night', 4),
 ('beautiful', 4),
 ('life', 4),
 ('man', 4),
 ('shine', 3),
 ('lunar', 3),
 ('rock', 3)]

## Functionizing these processes

In [43]:
# need to import pandas
import pandas as pd

## write a function
def pos_freq(tokens):

  # use list comprehension in series - count verbs
  freq_noun = pd.Series([token.text for token in doc if token.is_punct != True and token.is_stop != True and token.pos_ == 'NOUN']).value_counts()

  # use list comprehension in series - count nouns
  freq_verbs = pd.Series([token.text for token in doc if token.is_punct != True and token.is_stop != True and token.pos_ == 'VERB']).value_counts()

  # use list comprehension in series - count adjectives
  freq_adj = pd.Series([token.text for token in doc if token.is_punct != True and token.is_stop != True and token.pos_ == 'ADJ']).value_counts()

  # print out results
  print(f"Top 10 most frequent nouns in moon.txt file are:\n {freq_noun[:9]}")
  print("\n")
  print(f"Top 10 most frequent verbs in moon.txt file are:\n {freq_verbs[:9]}")
  print("\n")
  print(f"Top 10 most frequent adjectives in moon.text file are:\n {freq_adj[:9]}")

  #return freq_verb, freq_nouns

In [44]:
## lets test it out
pos_freq(doc)

Top 10 most frequent nouns in moon.txt file are:
 moon       19
earth       9
life        4
night       3
rocks       3
man         3
fact        2
craters     2
animals     2
Name: count, dtype: int64


Top 10 most frequent verbs in moon.txt file are:
 looks         3
fascinated    2
shines        2
moves         1
reveal        1
conquered     1
sent          1
returned      1
collected     1
Name: count, dtype: int64


Top 10 most frequent adjectives in moon.text file are:
 beautiful        4
lunar            3
happy            2
dark             2
cold             1
mysterious       1
American         1
difficult        1
gravitational    1
Name: count, dtype: int64


In [45]:
adj_freq

[('beautiful', 4),
 ('lunar', 3),
 ('dark', 2),
 ('happy', 2),
 ('bright', 1),
 ('soothing', 1),
 ('earthly', 1),
 ('enchanting', 1),
 ('lovely', 1),
 ('devoid', 1)]