# CS315 Mini Lecture: Part of Speech Tagging

## Data

A small paragraph describing Wellesley

In [1]:
sentence = "Wellesley College is a private women's liberal arts college in Wellesley, Massachusetts. Founded in 1870 by Henry and Pauline Durant as a female seminary, it is a member of the Seven Sisters Colleges, an unofficial grouping of current and former women's colleges in the northeastern United States."

## POS Tagging Usage
There are different models available that we can use to identify different part of speech. Here are the two famous ones. 

"spaCy has support for word vectors whereas NLTK does not. As spaCy uses the latest and best algorithms, its performance is usually good as compared to NLTK. In word tokenization and POS-tagging spaCy performs better, but in sentence tokenization, NLTK outperforms spaCy."-- Kaggle

#### Model1: NLTK

In [None]:
pip install nltk

In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk import Tree

In [3]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /Users/dorajyl/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/dorajyl/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [4]:
def getTags(txt):
    """
    Given a string of description, return the adj and adv in a list of tuples 
    in the form of (word, part of speech)
    """
    result = []
    
    tokenized = sent_tokenize(txt)
    for i in tokenized:

        # Word tokenizers is used to find the words 
        # and punctuation in a string
        wordsList = nltk.word_tokenize(i)

        # removing stop words from wordList
        stop_words = set(stopwords.words('english'))
        wordsList = [w for w in wordsList if not w in stop_words] 

        # Using a Tagger. Which is part-of-speech 
        # tagger or POS-tagger. 
        tagged = nltk.pos_tag(wordsList)
        
        # save the tags to results if they are JJ, JJS ,JJR, RB, RBR, or RBS
        wanted = ['JJ', 'JJS' ,'JJR', 'RB', 'RBR', 'RBS']
        
        for tag in tagged:
            if tag[1] in wanted:
                result.append(tag[0])
                
    return result

In [5]:
getTags(sentence)

['private',
 'liberal',
 'Founded',
 'female',
 'seminary',
 'unofficial',
 'current',
 'former']

#### Model 2: spaCy

In [6]:
pip install spacy


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [7]:
import spacy
from spacy import displacy
import spacy.cli
spacy.cli.download("en_core_web_md")
import en_core_web_md

Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [8]:
nlp = spacy.load('en_core_web_md')

In [9]:
def getTags(txt):
    """
    Given a string of description, return the adj and adv in a list
    """
    result = []
    
    # apply nlp to the txt
    doc = nlp(txt)
    
    # save the tags to results if they are ADJ or ADV
    wanted = ['ADJ', 'ADV']
    
    for token in doc:
        if  token.pos_ in wanted:
            result.append(token)
    
    return result
            
def get_ent_label_list(txt):
    """
    Given a string of description, return the description in a list of tuples 
    in the form of (word, label)
    """
    result = []
    
    # apply nlp to the txt
    doc = nlp(txt)
    
    # save the entities and their labels
    for ent in doc.ents:
        result.append((ent, ent.label_))
    
    return result


In [10]:
getTags(sentence)

[private, liberal, female, unofficial, current, former, northeastern]

In [11]:
get_ent_label_list(sentence)

[(Wellesley College, 'ORG'),
 (Wellesley, 'GPE'),
 (Massachusetts, 'GPE'),
 (1870, 'DATE'),
 (Henry, 'PERSON'),
 (Pauline Durant, 'PERSON'),
 (the Seven Sisters Colleges, 'ORG'),
 (United States, 'GPE')]

## Dependency Graph

In [15]:
sentence = "Wellesley College is one of the most academically challenging institutions of higher education in the country"
doc = nlp(sentence)

In [16]:
def to_nltk_tree(node):
    # if the node is not a leaf
    if node.n_lefts + node.n_rights > 0:
        return Tree(node.orth_, [to_nltk_tree(child) for child in node.children])
    # if the node is a leaf 
    else:
        return node.orth_


[to_nltk_tree(sent.root).pretty_print() for sent in doc.sents]

               is                                            
     __________|____________________                          
    |                              one                       
    |                               |                         
    |                               of                       
    |                               |                         
    |                          institutions                  
    |       ________________________|____________________     
    |      |            |                       of       in  
    |      |            |                       |        |    
 College   |       challenging              education country
    |      |    ________|___________            |        |    
Wellesley the most             academically   higher    the  



[None]