<a href="https://colab.research.google.com/github/d-tomas/transform4europe/blob/main/notebooks/text_mining_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text mining basics

In this notebook we will work different text mining tasks: part-of-speech tagging, parsing and semantic analysis.

## Initial setup

In [None]:
# Import the required libraries

import spacy  # NLP library
import pandas as pd  # Table manipulation
import matplotlib.pyplot as plt  # Visualisation
import seaborn as sns  # Visualisation
import nltk  # NLP library
from nltk.corpus import wordnet  # WordNet

# Install the SpaCy model for English texts
spacy.cli.download('en_core_web_sm')

# Download WordNet
nltk.download('wordnet')

# Load the model
nlp = spacy.load('en_core_web_sm')

# Download example text file ('news.txt')
!wget https://raw.githubusercontent.com/d-tomas/transform4europe/main/datasets/news.txt

## Part-of-speech tagging

The goal of *part-of-speech* (POS) *tagging* is to assing to each word in a text a particular part of speech, i.e., to identify whether they are nouns, verbs, adjectives, adverbs, etc.

In [None]:
# Process and annotate text with the SpaCy model

text = 'Today is Monday, May 23, 2022. It is 6:00 p.m. I am attending a Text Mining seminar at the University of Alicante, in Spain. The teacher is David. He tries to make it interesting but sometimes fails.'
document = nlp(text)

In [None]:
# Extract the list of sentences from text

list(document.sents)

In [None]:
# Extract morphological information (POS-tagging) for each word in text

for token in document:  # For each token (word) in the document
    print('Word: ' + token.text)
    print('Lemma: ' + token.lemma_)
    print('POS: ' + token.pos_)
    print('POS fine: ' + token.tag_)
    print('---')

In [None]:
# You can use 'explain' if you do not understand the meaning of a POS tag

spacy.explain('CD')

In [None]:
# Create a Pandas DataFrame based on the content for further analysis

data = pd.DataFrame(data=[[token.text, token.lemma_, token.pos_, token.tag_] for token in document], columns=['Word', 'Lemma', 'POS', 'POS fine'])
data

In [None]:
# Basic statistics of the columns

data.describe()

In [None]:
# What is the number of verbs in the text?

(data['POS'] == 'VERB').sum()  # Substitute 'VERB' with any other POS tag (e.g. 'PUNCT')

In [None]:
# we can do some interesting visualisations
# Bar plot with the count of each POS (fine) tag

plt.figure(figsize=(14,7))
sns.countplot(x='POS fine', data=data, order=data['POS fine'].value_counts().index)  # Sort by frequency
plt.xticks(rotation=-45)  # Rotate the labels to avoid overlapping
plt.show()

### Exercise

Do the POS-tagging of the content in the file 'news.txt'. How many adjectives are there in the text?

**Tip**: load the information in Pandas DataFrame to manipulate it

In [None]:
# First we have to store all the content of 'news.txt' in the variable 'content'
with open('news.txt') as file:
    text = file.read()

In [None]:
# Your code goes here


## Shallow parsing

In [None]:
# Get all the noun phrases from text

for chunk in document.noun_chunks:
    print('Noun phrase: ' + chunk.text)

In [None]:
# 'displacy' shows the parse tree

spacy.displacy.render(document, style = 'dep', options = {'compact': True}, jupyter = True)

In [None]:
# Navigate the dependency tree
# - 'head' and 'child' describe words connected in the dependency tree
# - 'dep' is the type of syntactic relation connecting 'child' and 'head'

for token in document:
    print(token.text, token.dep_, token.head.text, token.head.pos_, [child for child in token.children])

In [None]:
# A very nice feature: Named Entity Recognition
# Full list of entity types recognised: https://spacy.io/api/annotation#named-entities

for ent in document.ents:
    print('Text: ' + ent.text)
    print('Start char: ' + str(ent.start_char))
    print('End char: ' + str(ent.end_char))
    print('Type: ' + ent.label_)
    print('---')

In [None]:
# Highlight named entities and their labels in a text

spacy.displacy.render(document, style='ent', options = {'distance': 90}, jupyter=True)

### Exercise

Complete the code in the following cell to show the (shallow) parse tree of the content in the file 'news.txt'.

In [None]:
# Your code goes here


Highlight the named entities in the file 'news.txt' using 'displacy'.

In [None]:
# Your code goes here


## Semantic analysis with WordNet

In [None]:
# Get all the synsets of a word

word = 'dog'

list_synsets = wordnet.synsets(word)
for synset in list_synsets:
    print('Synset: ' + synset.name())
    print('Lemma: ' + synset.lemmas()[0].name())
    print('Meaning: ' + synset.definition())
    print('Examples: ' + str(synset.examples()))
    print('---')

In [None]:
# Get synonyms and antonyms

word = 'tall'

list_synsets = wordnet.synsets(word)
list_sinonyms = set()  # Use 'set' instead of 'list' to avoid duplicates
list_antonyms = set()
for synset in list_synsets:
    for lemma in synset.lemmas():
        list_sinonyms.add(lemma.name())
        if lemma.antonyms():
            list_antonyms.add(lemma.antonyms()[0].name())

print('Synonyms: ' + str(list_sinonyms))
print('Antonyms: ' + str(list_antonyms))

In [None]:
# Get all the hypernyms

word = 'terrier'

synset = wordnet.synsets(word)[0]  # First synset of the word
hypernyms = lambda s:s.hypernyms()

print(list(synset.closure(hypernyms)))