### Simple One-liners

<b>Python one-liners</b> are short programs that perform powerful operations, doing a lot within a single line of code.

One-liners are very common in text processing, which is why we’re introducing a few examples here to help you better understand the code we are going to show today. Run the code cell below, can you explain what is happening in it?

In [148]:
# Simple for loop
result = [x * 2 for x in range(5)]
print(result)

[0, 2, 4, 6, 8]


A list can also be created as a one-liner with conditions.

In [145]:
letters = list("abCdEfG")

lower = []
for letter in letters:
    if letter.islower():
        lower.append(letter)

# The code above does the same thing as this one-liner:
lower = [letter for letter in letters if letter.islower()]
print(lower)


['a', 'b', 'd', 'f']


One can even add transformations on top of it:

In [154]:
words = "Hello World from Python !".split(" ")
lower = [w.lower() for w in words if w.isalpha()]
print(lower)

['hello', 'world', 'from', 'python']


In general such one-liners follow the following template:

In [None]:
[expression for item in iterable if condition]

#### Exercise 1:  One-liner
1. From a list of integers, write a one-liner that keeps only the even numbers.

In [153]:
numbers = [3, 4, 7, 10, 11, 14]
# even_number =

[4, 10, 14]


2. Take a list of words, convert each to uppercase, and concatenate them into one string.

In [255]:
words = ["how","was","the","mensa","today"]
# sentence =

WHAT IS THE MENSA MENU TODAY


---
### spaCy
#### Installation
Before you install spaCy and its dependencies, make sure that your `pip`, `setuptools` and `wheel` are up to date by clicking the following cell:


In [20]:
import sys
!{sys.executable} -m pip install setuptools wheel


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Then install spaCy

In [None]:
import sys
!{sys.executable} -m pip install spacy

Or you can do the steps above directly in terminal

In [None]:
pip install -U pip setuptools wheel
pip install -U spacy

After installing spaCy, you will also need to download a language model. Use the following cell to download a basic English language model.

In [21]:
import sys
!{sys.executable} -m spacy download en_core_web_md

Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m13.2 MB/s[0m  [33m0:00:02[0mm0:00:01[0m00:01[0m
[?25hInstalling collected packages: en-core-web-md
Successfully installed en-core-web-md-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


Now let's create a new spaCy object using `spacy.load()`. What you put as the parameter here shall match with the model you downloaded earlier. If you want to try another model, you shall change the name as well.

In [26]:
import spacy
nlp = spacy.load('en_core_web_md')

After that we can use this spaCy object to parse a short piece of text:

In [30]:
# We are taking here one paragraph from The Picture of Dorian Gray as an example
text = "Dorian Gray hurried along the quay through the drizzling rain. His meeting with Adrian Singleton had strangely moved him, and he wondered if the ruin of that young life was really to be laid at his door, as Basil Hallward had said to him with such infamy of insult. He bit his lip, and for a few seconds his eyes grew sad. Yet, after all, what did it matter to him? One's days were too brief to take the burden of another's errors on one's shoulders. Each man lived his own life, and paid his own price for living it. The only pity was one had to pay so often for a single fault. One had to pay over and over again, indeed. In her dealings with man Destiny never closed her accounts."

doc = nlp(text)


---
#### Tokenization and Word Counter

By simply passing our text to spaCy, it is going to tokenize the text and give us the following basic information about the text:
1. All of the sentences (doc.sents)
1. All of the words (doc)
1. All of the "named entities,": names of places, people, #brands, etc. (doc.ents)
1. All none phrases or "noun_chunks": nouns in the text plus surrounding matter like adjectives and articles



In [126]:
import random

# 1. sentences
sentences = list(doc.sents)
print("There are in total", len(sentences), "sentences in the text.")

print("Sample sentences:")
for item in random.sample(sentences, 3):
    print(item.text.strip().replace("\n", " "))


There are in total 9 sentences in the text.
Sample sentences:
In her dealings with man Destiny never closed her accounts.
Dorian Gray hurried along the quay through the drizzling rain.
One had to pay over and over again, indeed.


In [127]:
# 2. words
# tokens = [token for token in doc]
words = [token for token in doc if token.is_alpha]
#print(tokens)
print(words)


[Dorian, Gray, hurried, along, the, quay, through, the, drizzling, rain, His, meeting, with, Adrian, Singleton, had, strangely, moved, him, and, he, wondered, if, the, ruin, of, that, young, life, was, really, to, be, laid, at, his, door, as, Basil, Hallward, had, said, to, him, with, such, infamy, of, insult, He, bit, his, lip, and, for, a, few, seconds, his, eyes, grew, sad, Yet, after, all, what, did, it, matter, to, him, One, days, were, too, brief, to, take, the, burden, of, another, errors, on, one, shoulders, Each, man, lived, his, own, life, and, paid, his, own, price, for, living, it, The, only, pity, was, one, had, to, pay, so, often, for, a, single, fault, One, had, to, pay, over, and, over, again, indeed, In, her, dealings, with, man, Destiny, never, closed, her, accounts]


The list of words that we are composing here is actually a list of spaCy [Token](https://spacy.io/api/token) objects. To compose a list of words, we are using `.is_alpha` attribute from the Token class, which returns true if the token consist of alphabetic characters. In the code cell above, compare `tokens` and `words`, what's the difference?

<b>Entities</b> are important in NLP because they usually contain information about the “who/what/where,” making them the most information-dense parts of a text. Identifying them helps us quickly understand the main topic. It also automatically group multi-word concepts for you. The process of extracting entities from text is called <b>Named Entity Recognition (NER)</b>. The NER label scheme varies by language and depends heavily on the training data available. You can find all available entity types for `en_core_web_sm` [here](https://spacy.io/models/en#en_core_web_sm-labels).

`noun_chunks` are also very useful in this regard. They are commonly used for text summarization and keyword extraction.


In [54]:
# 3. Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

# 4. Noun phrases
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])

Dorian Gray PERSON
Adrian Singleton PERSON
Basil Hallward PERSON
a few seconds TIME
One CARDINAL
days DATE
Noun phrases: ['Dorian Gray', 'the quay', 'the drizzling rain', 'His meeting', 'Adrian Singleton', 'him', 'he', 'the ruin', 'that young life', 'his door', 'Basil Hallward', 'him', 'such infamy', 'insult', 'He', 'his lip', 'a few seconds', 'his eyes', 'what', 'it', 'him', "One's days", 'the burden', "another's errors", "one's shoulders", 'Each man', 'his own life', 'his own price', 'it', 'The only pity', 'a single fault', 'her dealings', 'man Destiny', 'her accounts']


We can also visualise entities from a given text.

In [220]:
doc_demo = nlp('"As We May Think" is an essay written by Vannevar Bush in 1945. He is an American engineer.')
displacy.render(doc_demo, style="ent")

This is more useful if we load the whole text and try to take a look in it. This can take a while to run.

In [71]:
text = open("../week3/pg26740.txt").read()
full_doc = nlp(text)

And let's make a word counter and print the 10 most common words from the text. What do you expect the result to be like?

In [77]:
from collections import Counter

all_words =  [token for token in full_doc if token.is_alpha]
word_count = Counter([w.text for w in all_words])

print(word_count.most_common(10))

[('the', 3558), ('of', 2286), ('and', 2195), ('to', 2153), ('I', 1694), ('a', 1629), ('that', 1302), ('in', 1233), ('you', 1146), ('was', 1066)]
52


We see that it's not so useful to analyse these words as they don't represent the content of the text. These words are called <b>stop words</b> and there is a list for such words in spaCy.

In [84]:
from spacy.lang.en.stop_words import STOP_WORDS
print(STOP_WORDS)

{'once', 'others', 'here', 'third', 'elsewhere', 'throughout', 'call', 'his', 'done', 'nevertheless', 'i', 'onto', 'being', '’ll', 'somehow', '’re', 'not', 'as', 'very', "n't", 'less', 'why', 'nobody', 'well', 'whenever', 'is', 'whereupon', 'at', 'however', 'latterly', 'few', 'then', 'should', 'get', 'due', 'last', 'sixty', 'just', '‘d', 'if', 'four', 'keep', 'n’t', 'forty', 'before', 'again', 'over', 'both', 'afterwards', 'than', 'amongst', 'against', 'while', 'used', 'my', 'and', 'everything', 'who', 'same', 'since', 'wherever', 'did', 'full', 'himself', 'are', 'make', 'one', 'thereupon', 'anything', 'besides', 'everyone', 'another', 'some', 'serious', 'any', 'was', 'itself', 'noone', 'whence', 'can', '‘ll', 'were', 'around', 'next', 'you', 'perhaps', 'seemed', 'hundred', 'into', 'fifty', 'whereby', 'our', 'top', 'many', 'mine', 'also', 'anyhow', 'herein', 'except', 'bottom', 'would', 'because', 'hence', 'too', 'part', 'herself', 'though', 'about', 'anywhere', 'else', 'those', 'had',

We can then go ahead to remove these words from our word counter to see some more meaningful statistics from our word counter.

In [178]:
# create a list of words without stop words, also considering different cases
all_words_without_sw = [word for word in all_words if word.text.lower() not in STOP_WORDS]
print(len(all_words), len(all_words_without_sw))
word_count = Counter([w.text for w in all_words_without_sw])
print(word_count.most_common(20))
most_common_adj = [adj for word in all_words_without_sw if word.pos_ == "ADJECTIVE" ]
print(most_common_adj)

83533 33956
[('Dorian', 417), ('said', 262), ('Lord', 245), ('Henry', 236), ('like', 221), ('life', 217), ('Gray', 204), ('man', 179), ('know', 175), ('Harry', 175), ('Basil', 158), ('things', 126), ('think', 126), ('thing', 121), ('eyes', 109), ('good', 107), ('come', 107), ('face', 106), ('want', 105), ('time', 103)]
[]


It is also possible to check if a token is part of stop words by using the `.is_stop` attribute from `Token` class.

In [111]:
sample = words[:5]
for word in sample:
    print(type(word), word.text, word.is_stop)

<class 'spacy.tokens.token.Token'> Dorian False
<class 'spacy.tokens.token.Token'> Gray False
<class 'spacy.tokens.token.Token'> hurried False
<class 'spacy.tokens.token.Token'> along True
<class 'spacy.tokens.token.Token'> the True


Depending on the text you want to analyze, you can also customize the default stop-word list by adding or removing words as needed.

In [122]:
STOP_WORDS.remove('again')
print("again" in STOP_WORDS)
STOP_WORDS.add('again')
print("again" in STOP_WORDS)

False
True


#### Exercise 2
- Load a text file of your choice
- Create a word frequency counter
- Improve the results by applying preprocessing steps (lowercasing, removing stop words, etc.)
- Use spaCy to extract additional information from the text, such as named entities.

---
#### POS(Parts of speech) Tagging
After tokenization, spaCy can parse and tag a given Doc. POS tags provide information about a word in its context. spaCy provides such tagging in two systemsg: `.pos_` uses [universal POS tags](https://universaldependencies.org/u/pos/), while `.tag_` follows [Penn Treebank](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) system. The Penn Treebank POS tags are more detailed than the universal ones as it has different tags for verb in different tenses or noun is plural/singular forms.

In [216]:

doc_demo = nlp("The quick brown fox jumps over one of the lazy dogs.")
for token in doc_demo:
    print(token.text, "/", token.pos_, "/", token.tag_)

# Printing the sentence in tag
demo_in_tag = " ".join([token.tag_ for token in doc_demo])
print(demo_in_tag)

The / DET / DT
quick / ADJ / JJ
brown / ADJ / JJ
fox / NOUN / NN
jumps / VERB / VBZ
over / ADP / IN
one / NUM / CD
of / ADP / IN
the / DET / DT
lazy / ADJ / JJ
dogs / NOUN / NNS
. / PUNCT / .
DT JJ JJ NN VBZ IN CD IN DT JJ NNS .


Most of the tags and labels look pretty abstract, and they vary between languages. You can use `spacy.explain()` to show you a short description of what a tag means.

In [186]:
spacy.explain("VBZ")

'verb, 3rd person singular present'

We can use `pos_` or `tag_` to filter a particular type of words that we want from a text corpus.

In [190]:
print("Nouns:", [token.text for token in doc if token.pos_ == "NOUN"])
print("Verbs in past tense:", [token.text for token in doc if token.tag_ == "VBD"])

Nouns: ['quay', 'drizzling', 'rain', 'meeting', 'ruin', 'life', 'door', 'infamy', 'insult', 'lip', 'seconds', 'eyes', 'days', 'burden', 'errors', 'shoulders', 'man', 'life', 'price', 'pity', 'fault', 'dealings', 'man', 'accounts']
Verbs in past tense: ['hurried', 'had', 'wondered', 'was', 'had', 'bit', 'grew', 'did', 'were', 'lived', 'paid', 'was', 'had', 'had', 'closed']


---
#### Lemmatization and Inflection
When we are compling a list of words from a text corpus, sometimes it makes sense to save the words in their most basic form. For that we will need the lemma of a word, or the process of lemmatization.

A word's "lemma" is its most "basic" form, the form without any morphology applied to it. In the example above we have the word "moved", the past tense of "move", or "seconds", the plural form of "second".

For example, we can get all the verbs, nouns, adjectives and adverbs without morphology from the example text with the following code:

In [181]:
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])
print("Nouns:", [token.lemma_ for token in doc if token.pos_ == "NOUN"])
print("Adjectives:", [token.lemma_ for token in doc if token.pos_ == "ADJ"])
print("Adverbs:", [token.lemma_ for token in doc if token.pos_ == "ADV"])

Verbs: ['hurry', 'move', 'wonder', 'lay', 'say', 'bite', 'grow', 'matter', 'take', 'live', 'pay', 'live', 'have', 'pay', 'have', 'pay', 'close']
Nouns: ['quay', 'drizzling', 'rain', 'meeting', 'ruin', 'life', 'door', 'infamy', 'insult', 'lip', 'second', 'eye', 'day', 'burden', 'error', 'shoulder', 'man', 'life', 'price', 'pity', 'fault', 'dealing', 'man', 'account']
Adjectives: ['young', 'such', 'few', 'sad', 'brief', 'own', 'own', 'only', 'single']
Adverbs: ['strangely', 'really', 'yet', 'after', 'all', 'too', 'so', 'often', 'over', 'again', 'indeed', 'never']


The opposite process of lemmatization is called <b>Inflection</b>, when we want to change the form of a verb/noun according to its current context (like number, tense, case...).

If we want to make sure that the words in a generated sentence is grammatically correct no matter what words got randomly chosen from a list, we could use a python library called `LemmInflect`. The system acts as a standalone module or as an extension to spaCy and it only works with English words.


In [101]:
!{sys.executable} -m pip install LemmInflect



Here is a demo of some commonly used transformations.

In [69]:
from lemminflect import getInflection

# Verb Demo
print(getInflection('be', tag='VBD'))
print(getInflection('be', tag='VBG'))
print(getInflection('be', tag='VBN'))
print(getInflection('be', tag='VBP'))
print(getInflection('be', tag='VBZ'))

# Noun Demo
print(getInflection('tooth', tag='NNS'))
print(getInflection('medium', tag='NNS'))

# Noun Demo
print(getInflection('good', tag='JJR'))
print(getInflection('good', tag='JJS'))

# pos_type = 'A'
# * JJ      Adjective
# * JJR     Adjective, comparative
# * JJS     Adjective, superlative
# * RB      Adverb
# * RBR     Adverb, comparative
# * RBS     Adverb, superlative
#
# pos_type = 'N'
# * NN      Noun, singular or mass
# * NNS     Noun, plural
#
# pos_type = 'V'
# * VB      Verb, base form
# * VBD     Verb, past tense
# * VBG     Verb, gerund or present participle
# * VBN     Verb, past participle
# * VBP     Verb, non-3rd person singular present
# * VBZ     Verb, 3rd person singular present
# * MD      Modal

('was', 'were')
('being',)
('been',)
('am', 'are')
('is',)
('teeth',)
('media', 'mediums')
('better',)
('best',)


---
### Sentence Level Analysis

To understand how words are connected within a sentence, spaCy uses [Dependency grammar](https://en.wikipedia.org/wiki/Dependency_grammar) as its framework to process the text. In this framework, each word depends on a “head” word their dependency relation can be categorized into pre-defined types. It helps spaCy understand sentence structure so it can figure out things like subjects, objects, and how different parts of the sentence relate.

In [221]:
doc_demo = nlp("The quick brown fox jumps over the lazy dog.")

def flatten_subtree(st):
    return ''.join([w.text_with_ws for w in st]).strip()

for token in doc_demo:
    print()
    print("Word:", token.text)
    print("Tag:", token.tag_)
    print("Head:", token.head.text)
    print("Dependency relation:", token.dep_)
    print("Subtree:", flatten_subtree(token.subtree))



Word: The
Tag: DT
Head: fox
Dependency relation: det
Subtree: The

Word: quick
Tag: JJ
Head: fox
Dependency relation: amod
Subtree: quick

Word: brown
Tag: JJ
Head: fox
Dependency relation: amod
Subtree: brown

Word: fox
Tag: NN
Head: jumps
Dependency relation: nsubj
Subtree: The quick brown fox

Word: jumps
Tag: VBZ
Head: jumps
Dependency relation: ROOT
Subtree: The quick brown fox jumps over the lazy dog.

Word: over
Tag: IN
Head: jumps
Dependency relation: prep
Subtree: over the lazy dog

Word: the
Tag: DT
Head: dog
Dependency relation: det
Subtree: the

Word: lazy
Tag: JJ
Head: dog
Dependency relation: amod
Subtree: lazy

Word: dog
Tag: NN
Head: over
Dependency relation: pobj
Subtree: the lazy dog

Word: .
Tag: .
Head: jumps
Dependency relation: punct
Subtree: .


Spacy has its own visualizer and we can use it to help us understand what's happening here in an easier way. In the visualisation we can see that "The quick brown fox" are grouped together under the head "fox" because it is the center of this sequence. Same happens with "the lazy dog". It is in this way how spaCy picks up the noun chunks.

In [257]:
from spacy import displacy
displacy.render(doc_demo, style="dep")

In [212]:
[chunk for chunk in doc_demo.noun_chunks]

[The quick brown fox, the lazy dog]

Let's see some concrete examples how this can be useful.  For example, we can get a whole phrase by its head using `Token.subtree`.

In [234]:
prep_phrases = [] # prepositional phrase
for token in full_doc:
    if token.dep_ == 'prep':
        prep_phrases.append(flatten_subtree(token.subtree).replace("\n", " ")) # replace line break with space

random.sample(prep_phrases, 7)

['of London',
 'of it',
 'about music',
 'At last',
 'of being stared at which',
 'As for what I said to you to-night',
 'to supper']

In [256]:
# compose list of phrases with exact text matching

def phrases_with_word(text):
    return [flatten_subtree(token.subtree).replace("\n", " ") for token in full_doc if token.text == text]

# random.sample(phrases_with('with'),10)
random.sample(phrases_with_word('wish'),10)


['if you wish it',
 'I wish to see it',
 '"  "I wish she were ill',
 'if you wish to know the exact time',
 'I wish you would tell me your secret.',
 '"I wish you had seen him.',
 'though I wish you chaps would not squabble over the picture',
 'I wish I had now.',
 'I wish that I had ever had such an experience.',
 '"  "Ah, Alan," murmured Dorian, with a sigh, "I wish you had a thousandth part of the pity for me that I have for you."']

We are covering only the basic usages of spaCy today. For more tutorials or documentations, you can check out [the official website](https://spacy.io/usage/spacy-101). We will learn more about it in the later sessions of the class.

#### Exercise 3
- Extract groups of words from your text. For example: lists of nouns, verbs, adjectives...
- Try to do that also with longer phrases. Look closely at your text, what phrases might be interesting to extract?

---
### Assignment 2
Option 1
1. Find a text corpus that interests you
2. Use spaCy to process and harvest groups of words/phrases
3. Then use `tracery`to build a text generator based on the harvested material

Option 2
1. Find a text corpus that interests you
2. Use spaCy to tag the text
3. Use `Markovify`to generate new sentences based on tags
4. Replace the tag with other words (either from same text or other resources)

Or feel free to come up with your own approach, as long as you use spaCy to help you create a text generator.
