## Why use classic NLP in 2023 when we have LLMs?

LLMs are indeed powerful but we still may want to know more about linguistic properties of our input data as well as outputs of LLMs (or just to analyse text data independently).

# Text analysis with Spacy

Spacy is a great library to process and analyse text data in order to extract valuable insights from it. It has a lot of features and algorithms that may be useful while working with NLP. Spacy is used in both industry and research. In industry it serves as state-of-the-art library for production use cases and in reserach it is used to deliver analytical insights and support testing of a parcticular research hyphothesis.

While things above are general for any text analysis, the instruments that Spacy offers may be useful for analysing and interpreting model behaviour as well. Extracting insights from texts using model predictions may also help to find out how to improve your model (e.g. debias it) and so on.

During this lesson we will learn how to make common operations with Spacy and will analyse sentiment-labelled dataset.

Rlease read the docs to dive deeper in what is possible with Spacy: https://spacy.io/

This lesson uses code tests meaning that you will be asked to implement functions, test them on a given test samples and make sure they match with expected output.

In [None]:
!pip install datasets==2.14.4

In [None]:
import spacy  # this is where we import spacy. It is preinstalled on Gooogle colab so no need to install it manually (unless you need some specific version)
import itertools
from datasets import load_dataset
import pandas as pd
import typing as tp

In [None]:
nlp = spacy.load("en_core_web_sm")

1) You are given a test pandas dataframe with 2 columns.
Column "text" refers to input text and column "label" refers to predicted by dentiment analysis model label. Label can be either 0 (negative) or 1 (positive).

Write a function to vreate 2 groups of samples from given dataframe. One group has only samples labelled as positive, another - samples labelled as negative.

In [None]:
data = {
    "text": [
        "I love this product. It is amazing!",
        "This movie is terrible. I didn't like it.",
        "It is such a great day today.",
        "The food at that fancy restaurant was awful.",
    ],
    "label": [0, 1, 0, 1],
}

test_df = pd.DataFrame(data)

In [None]:
test_df

Unnamed: 0,text,label
0,I love this product. It is amazing!,0
1,This movie is terrible. I didn't like it.,1
2,It is such a great day today.,0
3,The food at that fancy restaurant was awful.,1


In [None]:
condition = test_df["label"] == 1  # return rows where label == 1

test_df[condition]

Unnamed: 0,text,label
1,This movie is terrible. I didn't like it.,1
3,The food at that fancy restaurant was awful.,1


In [None]:
condition

0    False
1     True
2    False
3     True
Name: label, dtype: bool

In [None]:
# this may help: https://stackoverflow.com/questions/17071871/how-do-i-select-rows-from-a-dataframe-based-on-column-values


def get_texts_by_label(df: pd.DataFrame, label: int) -> tp.List[str]:
    """
    filter data by label and column
    """
    condition = df["label"] == label
    return df["text"][condition]

In [None]:
expected_output_label_1 = [
    "This movie is terrible. I didn't like it.",
    "The food at that fancy restaurant was awful.",
]
expected_output_label_0 = [
    "I love this product. It is amazing!",
    "It is such a great day today.",
]

In [None]:
assert set(get_texts_by_label(test_df, 0)) == set(expected_output_label_0)

In [None]:
assert set(get_texts_by_label(test_df, 1)) == set(expected_output_label_1)

In [None]:
text = "I love this product. It is amazing!"

In [None]:
type(text)

str

In [None]:
doc = nlp(text)

In [None]:
expected_output_label_1

["This movie is terrible. I didn't like it.",
 'The food at that fancy restaurant was awful.']

In [None]:
[len(text) for text in expected_output_label_1]

[41, 44]

2) Convert texts to spacy docs. Given input list of text samples convert each sample from the list into spact doc object. Read more in the Spact documentation about the semantics of spacy Doc object https://spacy.io/

In [None]:
def convert_texts_to_spacy_docs(texts: tp.List[str]) -> tp.List[spacy.tokens.doc.Doc]:
    """
    Wrap each text into Spact Doc object.
    """
    return [nlp(text) for text in texts]

In [None]:
docs = convert_texts_to_spacy_docs(expected_output_label_0)

assert all(isinstance(doc, spacy.tokens.doc.Doc) for doc in docs)

### 3) Extract tokens from spacy Doc Object.
Write a function that takes as inut Spacy doc example and extract tokens from it. Add a boolean option to lemmatize tokens https://spacy.io/api/lemmatizer

In [None]:
# demonstration

for token in doc:
    print(token.text)

I
love
this
product
.
It
is
amazing
!


In [None]:
def get_tokens_from_single_doc(
    doc: spacy.tokens.doc.Doc, lemmatize: bool = False
) -> tp.List[str]:
    """
    get strin representation of tokens from the doc. Add a lemmatize param
    to apply lemmatisation of tokens.
    """
    if lemmatize:
        return [token.lemma_ for token in doc]
    return [token.text for token in doc]


sample_doc = docs[0]  # this will work if you coreeclty solved previous task

tokens = get_tokens_from_single_doc(sample_doc, lemmatize=False)
assert all(isinstance(token, str) for token in tokens)

In [None]:
# demosntration of lemmatizer

get_tokens_from_single_doc(nlp("these birds are beautiful"), lemmatize=True)

['these', 'bird', 'be', 'beautiful']

### 4) Extract tokens from a list of docs.

Given a list of docs (e.g. from task 2, extract tokens from each doc and return list where each element is a list of tokens for given doc.

Expected output should be in the following format ```tokens_from_docs = [["it", "is", "good"], ["this", "movie", "is", "the", "best"]]```. This is a reference examples and tokens here are just for reference. Use it as a guide to understand better type of expected output.

In [None]:
def extract_tokens_from_docs(
    docs: tp.List[spacy.tokens.doc.Doc],
) -> tp.List[tp.List[str]]:
    """
    get tokens from every doc from input docs.
    """
    return [get_tokens_from_single_doc(doc) for doc in docs]


tokens_from_docs = extract_tokens_from_docs(docs)
assert len(tokens_from_docs) == len(docs)

In [None]:
tokens_from_docs

[['I', 'love', 'this', 'product', '.', 'It', 'is', 'amazing', '!'],
 ['It', 'is', 'such', 'a', 'great', 'day', 'today', '.']]

### 5) Flatten tokens. From previous example you can see there are list of lists of tokens, meaning that each element of outer list represents a doc and is a list itself that contaits all tokens belonging to the doc. Let's now flatten these tokens into single list.

You are given a test input ```tokens_from_docs = [["it", "is", "good"], ["this", "movie", "is", "the", "best"]]. ```

Flatten tokens to a format matching the ```expected_output```

In [None]:
def flatten_tokens(texts: tp.List[tp.List[tp.Any]]) -> tp.List[tp.Any]:
    """
    flatten tokens into single list.
    """
    return list(itertools.chain(*texts))


tokens_from_docs = [["it", "is", "good"], ["this", "movie", "is", "the", "best"]]

flattened_tokens = flatten_tokens(tokens_from_docs)

expected_output = ["it", "is", "good", "this", "movie", "is", "the", "best"]

assert set(flattened_tokens) == set(expected_output)
assert len(flattened_tokens) == len(expected_output)

### 6) Word counts

Compute frequency of each word among all tokens. This statistics can be useful for varuous analytics (e.g. it can help to understand what words occur more or less with positive/negative model predictions)

In [None]:
# tip: a simple way is to use ```collections.Counter``` object


def get_word_counts(tokens: tp.List[tp.Any]) -> tp.Mapping[str, int]:
    """
    Compute word frequences
    """
    counts = {}
    for token in tokens:
        if token in counts.keys():
            counts[token] += 1
        else:
            counts[token] = 1
    return counts


flattened_tokens = ["it", "is", "good", "this", "movie", "is", "the", "best"]

counts = get_word_counts(flattened_tokens)

In [None]:
expected_counts = {
    "it": 1,
    "is": 2,
    "good": 1,
    "this": 1,
    "movie": 1,
    "the": 1,
    "best": 1,
}

assert counts == expected_counts

### 7) Get top N frequent tokens

Extract top n frequent tokens from counts dict

In [None]:
def get_top_n_frequent_tokens(
    tokens_counts: tp.Mapping[str, int], top_n: int
) -> tp.List[tp.Any]:
    """
    get top frequent tokens.
    """
    return dict(sorted(counts.items(), key=lambda x: x[1], reverse=True)[:top_n])


top_3_freq = get_top_n_frequent_tokens(counts, 3)

expected_top_3_freq = {"is": 2, "it": 1, "good": 1}
assert top_3_freq == expected_top_3_freq

### 8) Extract Nouns from text with Spacy

You are given an input ```text = "this sofa is so comfortable but delivery service was not good```. Create spact doc from text and use it to extract nouns from a text.

In [None]:
def extract_nouns_from_text(text: str) -> tp.List[str]:
    """
    convert text to Doc object and filter it by Nouns.
    """
    doc = nlp(text)
    return [token.text for token in doc if token.pos_ == "NOUN"]


text = "this sofa is so comfortable but delivery service was not good"

extracted_nouns = extract_nouns_from_text(text)

assert set(extracted_nouns) == {"sofa", "delivery", "service"}

### 9) [Optional/Advanced] Extract all adjective + noun pairs from a text.

Read about Spacy linguistic features https://spacy.io/usage/linguistic-features and use its functions to find nouns + depdendent ajectives in the text.

In [None]:
def extract_adj_noun_pairs_from_text(text: str) -> tp.List[str]:
    """
    Convert ted to Doc object and use linguistic dependency parser
    to extract dependent adjectives for existing nouns.
    Return adjective + noun pairs
    """
    doc = nlp(text)
    adj_noun_pairs = []

    for token in doc:
        adj = None
        if token.pos_ == "NOUN":
            noun = token.text
            for child in token.children:
                if child.dep_ == "amod":
                    adj = child.text
                    break

            if adj:
                adj_noun_pairs.append(adj + " " + noun)
    return adj_noun_pairs

In [None]:
text = "this beautiful sofa impressed me very much! Its bright color was exactly what I wanted"

extracted_adj_noun_pairs = extract_adj_noun_pairs_from_text(text)

assert set(extracted_adj_noun_pairs) == {"beautiful sofa", "bright color"}

In [None]:
imdb = load_dataset("imdb", split="test")

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
imdb = imdb.to_pandas()

In [None]:
imdb = imdb.sample(500, random_state=42).reset_index(drop=True)

In [None]:
imdb.label.value_counts()

0    266
1    234
Name: label, dtype: int64

## Example Analysis on imdb data sample. Let's analyse ground truth and than a model

In [None]:
def make_analysis(df: pd.DataFrame, label: int) -> None:
    """
    Combine all together and print overall analytical report for given label
    """
    print(f"starting analysis for the label {label}")
    texts = get_texts_by_label(imdb, label)

    docs = convert_texts_to_spacy_docs(texts)
    tokens = extract_tokens_from_docs(docs)
    bag_of_words = flatten_tokens(tokens)
    print(f"total number of extracted words is {len(bag_of_words)} \n")
    word_counts = get_word_counts(bag_of_words)
    top_20_freq_tokens = get_top_n_frequent_tokens(word_counts, 20)
    print("top 20 frequent tokens: \n")
    print(top_20_freq_tokens)
    print("\n")

    nouns = [extract_nouns_from_text(text) for text in texts]
    nouns = flatten_tokens(nouns)
    nouns_counts = get_word_counts(nouns)
    top_20_frequent_nouns = get_top_n_frequent_tokens(nouns_counts, 20)
    print("top 20 frequent nouns: \n")
    print(top_20_frequent_nouns)
    print("\n")

    adj_noun_pairs = [extract_adj_noun_pairs_from_text(text) for text in texts]
    adj_noun_pairs = flatten_tokens(adj_noun_pairs)
    adj_noun_counts = get_word_counts(adj_noun_pairs)
    top_20_frequent_adj_nouns = get_top_n_frequent_tokens(adj_noun_counts, 20)
    print("top 20 frequent adj + nouns pairs: \n")
    print(top_20_frequent_adj_nouns)
    print("\n")

In [None]:
make_analysis(imdb, 0)

starrting analysis for the label 0
total number of extracted words is 76372 

top 20 frequent tokens: 

{',': 3101, 'the': 3064, '.': 2634, 'a': 1729, 'and': 1616, 'of': 1572, 'to': 1513, 'is': 1192, 'I': 972, 'in': 928, 'it': 837, '"': 820, 'this': 797, 'that': 788, '-': 674, "'s": 660, '/><br': 589, 'was': 588, 'movie': 574, 'with': 485}


top 20 frequent nouns: 

{'movie': 574, 'film': 402, 'time': 126, 'story': 120, '/><br': 115, 'people': 101, 'plot': 91, 'movies': 84, 'character': 80, 'way': 77, 'scenes': 73, 'characters': 72, 'life': 71, 'films': 70, 'thing': 69, 'scene': 69, 'one': 58, 'director': 58, 'guy': 55, 'acting': 54}


top 20 frequent adj + nouns pairs: 

{'low budget': 20, 'main character': 15, 'only thing': 12, 'bad movie': 10, 'high school': 8, 'only reason': 8, 'worst movie': 8, 'real life': 7, 'worst film': 7, 'special effects': 7, 'entire movie': 6, 'other hand': 6, 'many things': 6, 'good movie': 6, 'whole movie': 6, 'good idea': 5, 'entire film': 5, 'few scenes

In [None]:
make_analysis(imdb, 1)

starrting analysis for the label 1
total number of extracted words is 61887 

top 20 frequent tokens: 

{',': 2735, 'the': 2720, '.': 2099, 'and': 1508, 'a': 1460, 'of': 1366, 'to': 1190, 'is': 1062, 'in': 880, 'it': 677, 'that': 615, 'I': 586, '"': 586, "'s": 563, 'this': 522, 'as': 442, '-': 440, 'with': 424, 'for': 414, '/><br': 401}


top 20 frequent nouns: 

{'film': 395, 'movie': 303, 'story': 123, 'time': 110, 'films': 87, '/><br': 87, 'people': 79, 'life': 70, 'character': 68, 'man': 62, 'one': 59, 'love': 58, 'way': 57, 'years': 53, 'scenes': 52, 'series': 50, 'show': 50, 'movies': 49, 'characters': 48, 'scene': 44}


top 20 frequent adj + nouns pairs: 

{'first time': 14, 'first film': 9, 'bothersome man': 9, 'other films': 7, 'first movie': 7, 'low budget': 7, 'real life': 7, 'same time': 7, 'main character': 6, 'best films': 6, 'best film': 6, 'main characters': 5, 'great film': 5, 'human beings': 5, 'great movie': 5, 'great deal': 4, 'more sense': 4, 'good show': 4, 'last 

In [None]:
texts = get_texts_by_label(imdb, 1)

Interesting that "bothersome man" was among most frequent adj+noun pairs, but if we look up reviews with it, we will see it happens only in one review, but a lot of times :)


In [None]:
[x for x in texts if "bothersome man" in x]

['Where the heck is Andreas(Trond Fausa Aurvaag), exactly? Heaven? Hell? A parallel universe? When the bothersome man steps off the subway platform and meets an onrushing train, his next conscious moment occurs on a bus; riding solo, the newest arrival, in a dead netherworld where all the suicides go. Dressed as he was at the time of his sudden departure from the corporeal biosphere, Andreas is greeted by an official man, who processes and transports the bothersome man from the barren flatlands to a city, if the eyeballs work, is a dead ringer for the sort of urban landscapes that he once inhabited, if memory serves him right. Andreas retains the look of a sleepwalker in a trance, a man estranged from people and objects, struggling to find his bearings; at home, or rather, his assigned apartment; or at work, where the bothersome man is randomly designated as an accountant for an independent contractor. Havard(Johannes Joner), his boss, tells him, "You\'ll get used to it," which covers 

Follow ups:
* The analysis is still not perfect. Suggest up to 3 ideas how to improve it.

### Summary

We learned how to use Spacy to analyse text data.
To learn more about Spacy one may look into its free online courses:
https://spacy.io/universe/category/courses