# Week 10: Sentiment Analysis

Our task this week is as follows:
* Get to know Python dictionaries
* Learn about sentiment analysis, and learn how to use the sentiment analysis package in TextBlob
* Discuss limitations of lexicon-based approach and look at how we can overcome some of them
* Perform a small "who wore it better" competition between TextBlob with VADER (algorithm audit)
* Load a novel into a dataframe, sentence by sentence.
* Record the sentiment values for each sentence in that dataframe
* Extract the sentences identified as the "happiest" and the "saddest" by the sentiment analysis system

https://github.com/cjhutto/vaderSentiment/tree/master

## Python Dictionaries

Before we get to sentiment analysis, we need to introduce another Python data type, which arguably can be a faviourite for English majors: dictionaries

As [Melanie Walsh explains](https://melaniewalsh.github.io/Intro-Cultural-Analytics/02-Python/11-Dictionaries.html), dictionaries are mainly differentiated from `list`s by their use of **key-value pairs**. Whereas we access items in a list by their index position, we access the **values** of items in a dictionary by their **key**.

Python dictionaries are always surrounded by curly brackets `{ }`. You can make a dictionary in this manner:

```
variable_name = {
   'key1': value1,
   'key2': value2,
   'key3': value3,
}
```
Note:
- Keys are `string`s; values can be of any data type.
- Note that a `,` comes between each key-value pair your define
- You don't need to arrange things like this typographically, with key-values pairs each on their own line, but it does make things look prettier

Some examples:

In [None]:
writers = {
    "William Shakespeare": 1564,
    "Jane Austen": 1775,
    "Leo Tolstoy": 1828,
    "Gabriel Garcia Marquez": 1927,
    "Margaret Atwood": 1939,
    "Virginia Woolf": 1882
}

writers["William Shakespeare"]

In [1]:
writers = {
    "William Shakespeare": [1564, 1616],
    "Jane Austen": [1775, 1817],
    "Leo Tolstoy": [1828, 1910],
    "Gabriel Garcia Marquez": [1927, 2014],
    "Margaret Atwood": [1939, None],
    "Virginia Woolf": [1882, 1941]
}
writers["Margaret Atwood"]

[1939, None]

In [3]:
writers_in_20th_century = []

for writer in writers: #we go over the KEYS (writer names) this way
    birth_year = writers[writer][0]
    death_year = writers[writer][1]
    #birth_year, death_year = writers[writer] #alternative way
    still_alive =  death_year is None
    if (birth_year <= 2000 and (still_alive or death_year >= 1901)):
        writers_in_20th_century.append(writer)

writers_in_20th_century

['Leo Tolstoy', 'Gabriel Garcia Marquez', 'Margaret Atwood', 'Virginia Woolf']

In [4]:
writers = {
    "William Shakespeare": {
        "country": "England",
        "birth_year": 1564,
        "death_year": 1616
    },
    "Jane Austen": {
        "country": "England",
        "birth_year": 1775,
        "death_year": 1817
    },
    "Leo Tolstoy": {
        "country": "Russia",
        "birth_year": 1828,
        "death_year": 1910
    },
    "Gabriel Garcia Marquez": {
        "country": "Colombia",
        "birth_year": 1927,
        "death_year": 2014
    },
    "Margaret Atwood": {
        "country": "Canada",
        "birth_year": 1939,
        "death_year": None  # Still living
    },
    "Virginia Woolf": {
        "country": "England",
        "birth_year": 1882,
        "death_year": 1941
    }
}

writers["Gabriel Garcia Marquez"]["country"]

'Colombia'

## Iterating Through Dictionaries

You can iterate through dictionaries — but first you need to specify, by calling the appropriate method, if you want to iterate over keys, values, of key-value pairs.

In [14]:
carnivores = {
    "python": "A large heavy-bodied nonvenomous snake that kills poor prey by constriction and asphyxiation",
    "panda": "A large bearlike mammal that, while technically a carnivore, is in practice a vegetarian, eating only bamboo",
    "blob": "A third-party Python library that slowly kills you by sucking up all of your time, because the textual analysis it facilitates is so fascinating",
    "kitten": "A delightful, fuzzy creature whose natural prey is cat food (dry or wet) and, especially, treats"
}

We can loop through KEYS (two ways)

In [5]:
for key in carnivores:
    print(f"I am so afraid of {key.upper()}S!!!!")

I am so afraid of PYTHONS!!!!
I am so afraid of PANDAS!!!!
I am so afraid of BLOBS!!!!
I am so afraid of KITTENS!!!!


In [4]:
for key in carnivores.keys():
    print(f"I am so afraid of {key.upper()}S!!!!")

I am so afraid of PYTHONS!!!!
I am so afraid of PANDAS!!!!
I am so afraid of BLOBS!!!!
I am so afraid of KITTENS!!!!


That, of course, allows us to do something with values as well

In [7]:
for creature in carnivores:
    print(f"Why I am so afraif of {creature}s?")
    print(f"Because {creature} is {carnivores[creature].lower()}")

Why I am so afraif of pythons?
Because python is a large heavy-bodied nonvenomous snake that kills prey by constriction and asphyxiation
Why I am so afraif of pandas?
Because panda is a large bearlike mammal that, while technically a carnivore, is in practice a vegetarian, eating only bamboo
Why I am so afraif of blobs?
Because blob is a third-party python library that slowly kills you by sucking up all of your time, because the textual analysis it facilitates is so fascinating
Why I am so afraif of kittens?
Because kitten is a delightful, fuzzy creature whose natural prey is cat food (dry or wet) and, especially, treats


Or we can loop through values directly

In [None]:
for value in carnivores.values():
    print(f"Did you know there is a kind of carnivore that is {value}???")

Did you know there is a kind of carnivore that is a large heavy-bodied nonvenomous snake that kills prey by constriction and asphyxiation???
Did you know there is a kind of carnivore that is a large bearlike mammal that, while technically a carnivore, is in practice a vegetarian, eating only bamboo???
Did you know there is a kind of carnivore that is a third-party Python library that slowly kills you by sucking up all of your time, because the textual analysis it facilitates is so fascinating???
Did you know there is a kind of carnivore that is a delightful, fuzzy creature whose natural prey is cat food (dry or wet) and, especially, treats???


Difficult to remember at first, but there is a useful bit of Python syntax called unpacking, which we can rely on to loop through both keys and values:

In [None]:
for key, value in carnivores.items():
    print(f"A {key} is {value}")

A python is a large heavy-bodied nonvenomous snake that kills prey by constriction and asphyxiation
A panda is a large bearlike mammal that, while technically a carnivore, is in practice a vegetarian, eating only bamboo
A blob is a third-party Python library that slowly kills you by sucking up all of your time, because the textual analysis it facilitates is so fascinating
A kitten is a delightful, fuzzy creature whose natural prey is cat food (dry or wet) and, especially, treats


# Sentiment Analysis. Part I

We will
- show (and use) the simplest version of approaching sentiment analysis -- bag-of-words dictionary based approach
- briefly discuss what are the main disadvantages
- discuss the main heuristics we can apply to critically analyze algorithms
- look how we can improve on the simplest version and how to assess if it works

We also employ a subset of an approach called "algorithmic audit" trying to critically evaluate what the algorithm does, what is it (not) good for, what are the biases -- think about the questions we have for datasets.

In [15]:
from textblob import TextBlob
import nltk

blob = TextBlob(carnivores["panda"])
print(blob)
print(f"Polarity {blob.sentiment.polarity}")
print(f"Subjectivity {blob.sentiment.subjectivity}")

A large bearlike mammal that, while technically a carnivore, is in practice a vegetarian, eating only bamboo
Polarity 0.07142857142857142
Subjectivity 0.5095238095238095
A large heavy-bodied nonvenomous snake that kills poor prey by constriction and asphyxiation
Polarity -0.09285714285714287
Subjectivity 0.5142857142857142
A delightful, fuzzy creature whose natural prey is cat food (dry or wet) and, especially, treats
Polarity 0.1866666666666667
Subjectivity 0.6799999999999999


In [None]:
blob = TextBlob(carnivores["python"])
print(blob)
print(f"Polarity {blob.sentiment.polarity}")
print(f"Subjectivity {blob.sentiment.subjectivity}")

blob = TextBlob(carnivores["kitten"])
print(blob)
print(f"Polarity {blob.sentiment.polarity}")
print(f"Subjectivity {blob.sentiment.subjectivity}")

Word-based approach to sentiment analysis assigns some numeric score (positive or negative) to the word. And sums/averages over the word scores.

Let's audit how this would work on the examples for which we know/can think of the answers!

In [21]:
print(TextBlob("awful").polarity)
print(TextBlob("great").polarity)
print(TextBlob("window").polarity)
print(TextBlob("not great").polarity)
print(TextBlob("not so great").polarity)

-1.0
0.8
0.0
-0.4
0.8


- What other cases you can think of when this approach could fail?
- When it might be good enough?

Alternative algorithm:

  > VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media.
  [VADER readme](https://github.com/cjhutto/vaderSentiment/tree/master)

> [VADER aims to properly handle] sentences with:

> - typical negations (e.g., "*not* good")
- use of contractions as negations (e.g., "*wasn't* very good")
- conventional use of **punctuation** to signal increased sentiment intensity (e.g., "Good!!!")
- conventional use of **word-shape** to signal emphasis (e.g., using ALL CAPS for words/phrases)
- using **degree modifiers** to alter sentiment intensity (e.g., intensity *boosters* such as "very" and intensity *dampeners* such as "kind of")
- understanding many **sentiment-laden slang** words (e.g., 'sux')
- understanding many sentiment-laden **slang words as modifiers** such as 'uber' or 'friggin' or 'kinda'
- understanding many sentiment-laden **emoticons** such as :) and :D
- translating **utf-8 encoded emojis** such as 💘 and 💋 and 😁
- understanding sentiment-laden **initialisms and acronyms** (for example: 'lol')

Take a look at [VADER paper](https://ojs.aaai.org/index.php/ICWSM/article/view/14550/14399)

## So, let's set up our small investigation, comparing VADER and TextBlob approaches

1. Define (tricky) examples

In [39]:
examples_bow = [
    "It was the best of times, it was the worst of times.",
    "I love how tragic her story is; it makes me feel alive.",
    "That poem wasn’t bad at all!",
    "The character’s demise was inevitable; simply tragic.",
    "Oh, fantastic... yet another twist ending.",
    "I couldn’t put the book down; it was haunting, to say the least.",
    "She was calm, almost too calm, like the eye of a storm.",
    "The protagonist had a truly unforgettable experience.",
    "Thank goodness it’s over. That was exhausting.",
    "This plot twist is simply too much... breathtaking!",
    "Wow, thanks for ruining my day with that spoiler! 😒",
    "Absolutely loved the movie... except for that ending, ugh!",
    "Can’t believe I waited hours for this. What a waste!",
    "OMG, this is the best thing I've seen all week!!!",
    "This product is seriously underrated, honestly amazing!",
    "LOL, yeah right, as if this would actually work... 🙄",
    "Finally finished it... mixed feelings, to be honest.",
    "You have to read this book—it’s like nothing else!",
    "Just what I needed... another delay. Fantastic. 🤦‍♂️",
    "I'm impressed! Didn’t expect it to be this good!"
]

print(len(examples_bow))


20


2. Formulate how the results should look like

In [None]:
audit_results = {
    "Sentence": [],
    "VADER Sentiment": [],
    "VADER Score": [],
    "TextBlob Sentiment": [],
    "TextBlob Polarity": [],
    "Difference Detected": []
}

3. Set up VADER and TextBlob

In [27]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
nltk.download('vader_lexicon')

# Initialize VADER sentiment analyzer
vader_analyzer = SentimentIntensityAnalyzer()
print(vader_analyzer.polarity_scores(carnivores["panda"]))
print(vader_analyzer.polarity_scores(carnivores["python"]))
print(vader_analyzer.polarity_scores(carnivores["kitten"]))

{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.398, 'neu': 0.602, 'pos': 0.0, 'compound': -0.765}
{'neg': 0.0, 'neu': 0.674, 'pos': 0.326, 'compound': 0.743}


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [28]:
print(vader_analyzer.polarity_scores(carnivores["kitten"])['compound'])

0.743


4. Analyze sentiment and write down the results

In [30]:
for sentence in examples_bow:
    # VADER sentiment
    vader_scores = vader_analyzer.polarity_scores(sentence)
    vader_sentiment = "Positive" if vader_scores['compound'] >= 0.05 else "Negative" if vader_scores['compound'] <= -0.05 else "Neutral"


    # TextBlob sentiment
    blob = TextBlob(sentence)
    blob_polarity = blob.sentiment.polarity
    blob_subjectivity = blob.sentiment.subjectivity
    blob_sentiment = "Positive" if blob_polarity > 0 else "Negative" if blob_polarity < 0 else "Neutral"

    # Detect if there is a difference in sentiment
    difference_detected = vader_sentiment != blob_sentiment

    # Append results to the data dictionary
    audit_results["Sentence"].append(sentence)
    audit_results["VADER Sentiment"].append(vader_sentiment)
    audit_results["VADER Score"].append(vader_scores['compound'])
    audit_results["TextBlob Sentiment"].append(blob_sentiment)
    audit_results["TextBlob Polarity"].append(blob_polarity)
    audit_results["Difference Detected"].append(difference_detected)

Unnamed: 0,Sentence,VADER Sentiment,VADER Score,TextBlob Sentiment,TextBlob Polarity,Difference Detected
0,"It was the best of times, it was the worst of ...",Neutral,0.0258,Neutral,0.0,False
1,I love how tragic her story is; it makes me fe...,Positive,0.5859,Negative,-0.05,True
2,That poem wasn’t bad at all!,Negative,-0.5848,Negative,-0.875,False
3,The character’s demise was inevitable; simply ...,Negative,-0.4588,Negative,-0.375,False
4,"Oh, fantastic... yet another twist ending.",Neutral,0.0,Positive,0.4,True
5,"I couldn’t put the book down; it was haunting,...",Negative,-0.2732,Negative,-0.227778,False
6,"She was calm, almost too calm, like the eye of...",Positive,0.7037,Positive,0.3,False
7,The protagonist had a truly unforgettable expe...,Positive,0.4404,Positive,0.8,False
8,Thank goodness it’s over. That was exhausting.,Positive,0.4588,Negative,-0.4,True
9,This plot twist is simply too much... breathta...,Positive,0.5093,Positive,0.4,False


5. Transform results in analysis-friendly form

In [None]:
# Create a DataFrame from dictionary with column names as keys and column data as lists of values
import pandas as pd
df = pd.DataFrame(audit_results)
df

6. Look at the differences

In [33]:
df[df["Difference Detected"] == True]
## or just
#df[df["Difference Detected"]
## why?

Unnamed: 0,Sentence,VADER Sentiment,VADER Score,TextBlob Sentiment,TextBlob Polarity,Difference Detected
1,I love how tragic her story is; it makes me fe...,Positive,0.5859,Negative,-0.05,True
4,"Oh, fantastic... yet another twist ending.",Neutral,0.0,Positive,0.4,True
8,Thank goodness it’s over. That was exhausting.,Positive,0.4588,Negative,-0.4,True
17,You have to read this book—it’s like nothing e...,Positive,0.4199,Neutral,0.0,True
21,I love how tragic her story is; it makes me fe...,Positive,0.5859,Negative,-0.05,True
24,"Oh, fantastic... yet another twist ending.",Neutral,0.0,Positive,0.4,True
28,Thank goodness it’s over. That was exhausting.,Positive,0.4588,Negative,-0.4,True
37,You have to read this book—it’s like nothing e...,Positive,0.4199,Neutral,0.0,True


In [37]:
len(df[df["Difference Detected"] == True]) / len(df)

0.2

In [38]:
pd.crosstab(df["VADER Sentiment"], df["TextBlob Sentiment"])

TextBlob Sentiment,Negative,Neutral,Positive
VADER Sentiment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Negative,8,0,0
Neutral,0,2,2
Positive,4,2,22


- What do we think about the results?
- Key question: How typical they are for tasks at hand? What other troubles we may encounter?
- What extra questions should we ask ourselves?

# Getting back to TextBlob



The [documentation for TextBlob](https://textblob.readthedocs.io/en/dev/) isn't the best, but the default sentiment system is based on a tool called [pattern](https://github.com/clips/pattern), which employs a sentiment lexicon — a list of words with values, many of them hand-coded.
- You can see the source code [here](https://github.com/sloria/TextBlob/blob/6396e24e85af7462cbed648fee21db5082a1f3fb/textblob/en/__init__.py#L8) (around line 80): it basically averages the sentiment scores for the all the words in the span, and applies some rule-based heuristics to identify negations.
- You can see the full lexicon [here](https://github.com/sloria/TextBlob/blob/6396e24e85af7462cbed648fee21db5082a1f3fb/textblob/en/__init__.py#L8); it's mostly adjective-based.

In [55]:
from textblob import TextBlob
nltk.download('punkt_tab')
nltk.download('punkt')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [56]:
TextBlob("Neil Young is the greatest artist to come out of this country").polarity

0.55

In [57]:
TextBlob("I hate Neil Young and his stupid, whiny voice").polarity

-0.5

In [58]:
TextBlob("Sometimes I feel like Neil Young is the greatest singer of his generation").polarity

0.55

In [59]:
TextBlob("Neil Young isn’t the worst Canadian musician").polarity

-0.45

In [60]:
TextBlob("Oh yeah, Neil Young’s voice is as lovely as Josh Groban’s").polarity

0.3

In [61]:
TextBlob("Hating on amazing music isn’t something I’m known for").polarity

0.6000000000000001

In [62]:
TextBlob("Neil Young").polarity

0.1

The way we work with TextBlob is first by "blobbing" a string of text (aka, turning it from a string to a TextBlob object). This is done by passing the string as argument to the `TextBlob` function.

In [63]:
text = "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife."

In [64]:
pride_blob = TextBlob(text)

In [65]:
type(pride_blob)

## Using TextBlob to Tokenize Strings and Split Them Into Sentences

Once a text is blobbed, we can start calling the special TextBlob methods on it. Note that TextBlob methods don't take arguments, and indeed don't even have the usual method syntax of being followed by `()` — which I personally find a bit ugly.

Let's look at two to start with:
- `blob.words`: This tokenizes the string, turning into words. We've been accomplishing this with Python's built-in `string.split()` for many weeks now, then doing some extra stuff like removing punctuation with regular expressions. TextBlob does it all in one fell swoop, and does a good job with it — although we get less control over the process, and I personally prefer our previous method (can you see why??). The object it returns behaves like a `list`.
- `blob.sentences`: This returns all the sentences in a string. We've been accomplishing this with `string.split(".")`. This does exactly the same thing, from what I can tell; for instance, it isn't smart enough to also split on `?` or `!`, and it is just as confused by contractions like `per cent.`. The object it returns again behaves like a `list`'

In [66]:
pride_blob.words

WordList(['It', 'is', 'a', 'truth', 'universally', 'acknowledged', 'that', 'a', 'single', 'man', 'in', 'possession', 'of', 'a', 'good', 'fortune', 'must', 'be', 'in', 'want', 'of', 'a', 'wife'])

In [67]:
type(pride_blob.words)

In [68]:
pride_blob.words[0]

'It'

In [69]:
for word in pride_blob.words:
    print(word.upper())

IT
IS
A
TRUTH
UNIVERSALLY
ACKNOWLEDGED
THAT
A
SINGLE
MAN
IN
POSSESSION
OF
A
GOOD
FORTUNE
MUST
BE
IN
WANT
OF
A
WIFE


In [70]:
sot4 = open("sign-of-four.txt", encoding="utf-8").read()

FileNotFoundError: [Errno 2] No such file or directory: 'sign-of-four.txt'

In [None]:
sot4_blob = TextBlob(sot4)

In [None]:
sot4_blob.words[255:269]

In [None]:
sot4_blob.sentences[9:20]

### TextBlob Word Counts... and Python Dictionaries

TextBlob has another use method, `blob.word_counts`, which returns a list of the most commonly used terms in a document, along with a count for each of those words.

In [None]:
pride_blob.word_counts

In [None]:
sot4_blob.word_counts

**Python data type** returned by the `blob.words_counts` method — well, that's not a `list` at all, but rather a **dictionary (`dict`)**.

## Changing Values and Adding Key-Value Pairs

This is accomplished as follows:

In [None]:
carnivores['blob'] = "a third-party Python library that slowly kills you by sucking up all of your time, because the textual analysis it facilitates is so fascinating"

In [None]:
carnivores['blob']

In [None]:
carnivores['kitten'] = "a delightful, fuzzy creature whose natural prey is cat food (dry or wet) and, especially, treats"

In [None]:
carnivores['kitten']

In [None]:
carnivores.values()

## Back to `blob.word_counts`!

So... as I said, TextBlob's `word_counts` method produces a dictionary-like object, in which each key is a unique word in the string, and each value is a count of how many times that word occurs in the string.

In [None]:
sot4_counts = sot4_blob.word_counts

In [None]:
type(sot4_counts)

In [None]:
sot4_counts['cocaine']

By the way, since `blob.word_counts` produces a dictionary-like object in which each key is a unique word... can you tell me the one-line command we could use use to calculate the TTR of any TextBlob object?

In [None]:
# We'll figure this one out together...

# Sentiment Analysis in TextBlob

Okay, it's finally time to get back to the thing we really want to do in TextBlob: use its sentiment analysis package!

This is accessible with the `blob.sentiment`, `blob.polarity`, and `blob.subjectivity` methods.

In [None]:
pride_blob.sentiment

In [None]:
pride_blob.polarity

In [None]:
pride_blob.subjectivity

Today we are going to focus on sentiment polarity today (how positive or negative, happy or sad, a particular span of text is.

In [None]:
TextBlob("My life is ruined and I am miserable").polarity

In [None]:
TextBlob("My life is amazing and I am overjoyed").polarity

In [None]:
TextBlob("My life is not ruined and I am not miserable").polarity

In [None]:
TextBlob("My life is not amazing and I am not overjoyed").polarity

In [None]:
TextBlob("It's kind of like a potato").polarity

## Creating a DataFrame of Polarity Values for *The Sign of the Four*

We now have pretty much all the pieces in place to accomplish our task: creating a DataFrame in which each row contains a sentence from *The Sign of the Four* and the TextBlob polarity and subjectivity score for that sentence. Let's go!

We will create three parallel lists:
- one containing the text of every sentence, in the form of a `string`
- one containing a polarity value for each sentence, in the form of a `float`
- one containing a subjectivity value for each sentence, also in the form of a `float`

How would we do this, using skills we learned back in the first half of the course?

### Using `blob.sentences`

Let's start by examining the output of TextBlob's `blob.sentences` method more closely, so we get a better sense of how we'll produce our three desired lists.

In [None]:
sot4_sentences_blob = sot4_blob.sentences

In [None]:
type(sot4_sentences_blob)

In [None]:
sot4_sentences_blob[22]

In [None]:
type(sot4_sentences_blob[22])

In [None]:
sot4_sentences_blob[22].polarity

In [None]:
sot4_polarities = []

for sentence in sot4_sentences_blob:
    sot4_polarities.append(sentence.polarity)

In [None]:
sot4_polarities[:10]

In [None]:
sot4_subjectivities = []

for sentence in sot4_sentences_blob:
    sot4_subjectivities.append(sentence.subjectivity)

In [None]:
sot4_subjectivities[:10]

In [None]:
sot4_sentences_blob[22]

In [None]:
sot4_sentences_blob[22].raw

In [None]:
type(sot4_sentences_blob[22].raw)

In [None]:
sot4_sentences_blob[0]

In [None]:
sot4_sentences_blob[0].raw

Since that output is a bit ugly, with all those `\n\n\n`s, let's create our `string` of each sentence in a slightly different way: by using Python's `string.join()` method, which we met wayyyyy back in Week 3 (go look if you don't believe me!).

Here, we'll use `string.join()` to join together all the `blob.word`s with spaces, which gives us a pretty string to work with.

In [None]:
sot4_sentences_blob[0].words

In [None]:
" ".join(sot4_sentences_blob[0].words)

In [None]:
type(" ".join(sot4_sentences_blob[0].words))

In [None]:
sot4_sentences = []

for sentence in sot4_sentences_blob:
    sot4_sentences.append(" ".join(sentence.words))

In [None]:
sot4_sentences[:10]

### Creating a DataFrame from Three Parallel Lists

Okay, we have all the contents of our desired DataFrame.

- A list containing all the sentences of *The Sign of the Four*, in order
- A list containing the polarity values for each of those sentences, in order
- A list containing the subjectivity values for each of those sentences, in order

Our friend Pandas allows us to quite easily make a new DataFrame out of this kind of data, with its `pd.DataFrame()` method.

The `pd.DataFrame()` method takes as its argument... **a dictionary**! (See why we had to finally learn about dictionaries??). It expects this argument to be formatted as follows:

```
new_df = pd.DataFrame(
    {
        'column1': list1,
        'column2': list2,
        'column3': list3
    }
)
```

Of course, you could also write this same command without all the tabs and newlines as follows:

`new_df = pd.DataFrame({'column1': list1, 'column2': list2, 'column3': list3})`


In [None]:
import pandas as pd

In [None]:
sot4_sentence_sentiment_df = pd.DataFrame({
    'sentence': sot4_sentences,
    'polarity': sot4_polarities,
    'subjectivity': sot4_subjectivities
})

In [None]:
sot4_sentence_sentiment_df

Let's now have a look at the sentences that TextBlob considers the most positive, as well as the most negative ones...

In [None]:
sot4_sentence_sentiment_df.sort_values(by='polarity', ascending=False)[:15]

Pretty hard to read what's in the `Sentence` column! We could export it to a CSV and explore it in Excel or Google Sheets... or we can set this Pandas parameter so that there is no maximum column width, and it will just show us everything!

In [None]:
pd.set_option('display.max_colwidth', 0)

In [None]:
sot4_sentence_sentiment_df.sort_values(by='polarity', ascending=False)[:15]

In [None]:
sot4_sentence_sentiment_df.sort_values(by='polarity', ascending=True)[:15]

# For the most curious: how we can deal with subjectivity using VADER scores?

- There is no "subjectivity" metrics but we can define our own looking at the dictionary with negative, positive and neutral scores VADER gives us
- We need to define a new set of examples to compare the algorithms

My + LLM boring set of examples (you can do much better!):

In [None]:
examples_objectivity = [
    "The sun rises in the east and sets in the west.",
    "The book details the origins of the solar system.",
    "The experiment yielded unexpected results, showing potential errors in previous models.",
    "Wow, the sunset tonight is absolutely breathtaking!",
    "This study proves nothing new and is a complete waste of resources.",
    "The data strongly suggests a correlation between the two variables.",
    "That movie was unbelievably bad; I can't believe I wasted two hours on it!",
    "I'm thrilled to finally visit Paris next week; it’s been my dream for years!",
    "The procedure is fairly straightforward, requiring only basic understanding of calculus.",
    "I guess it’s okay, but it’s not as great as everyone says.",
    "After reviewing all the reports, it’s evident that improvements are needed.",
    "She’s just amazing; every time I see her perform, it’s magical!",
    "The solution is elegant and minimizes computational overhead.",
    "No, just no. I cannot understand why anyone would enjoy this.",
    "To be honest, the outcome was predictable and lacked excitement.",
    "Their customer service is decent, but nothing extraordinary.",
    "The latest model offers a marginal improvement over previous versions.",
    "Reading this article was such a joy! It’s insightful and well-researched.",
    "I would not recommend this product; it’s just not worth the price.",
]