<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://lwfiles.mycourse.app/651ebde6c37afd427e55d85f-public/70e7481517e9de41c30d0c1b1a315182.png" style="width: 600px"></div>


<div style="color:white; background-color:#00318F; padding: 10px; border-radius: 15px; font-size: 150%; font-family: Verdana; text-align:center; -webkit-text-stroke-width: 1px; -webkit-text-stroke-color: black; text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.7);">
   Textual Data Analysis with Python using TextBlob
</div>

<div style="color:white; background-color:#758FC2; padding: 10px; border-radius: 15px; font-size: 100%; font-family: Verdana; text-align:center; width: 50%; margin: 0 auto; -webkit-text-stroke-width: 1px; -webkit-text-stroke-color: black; text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.7);">
   Last Update: Feb 22, 2026
</br>
   Notebook created by: Yashar Monfared
</div>


### Introduction 

Textual data analysis has become increasingly important because about 80% of the world's current data is unstructured. This includes emails, blog posts, social media posts, customer reviews, and voice transcripts. With advanced large language models (LLMs), organizations can change this text into useful insights. This can drive automation and improve decision-making. 

Most textual data analysis frameworks are work based on language models. Language models are the cornerstone of natural language processing (NLP), offering a range of techniques for understanding and generating human language. Language models are statistical models designed to predict the likelihood of a sequence of words in a given text. They work by estimating the probability distribution of word sequences, which allows them to generate or evaluate sentences based on how frequently certain word combinations appear in a corpus.


Some applications of language models:

- Sentiment analysis: Monitoring social media or analyzing customer product reviews.
- Named Entity Recognition: Automatically indexing news or extracting data from legal contracts.
- Question Answering: Virtual assistants (Siri, Alexa) and automated customer support bots.
- Spam Detection: It involves assigning a "label" to a piece of text.
- Grammar and Spell Checking: Look at the Part-of-Speech (POS) context and tags.
- Speech recognition: Transcribing spoken language into text.
- Machine translation: Translating text from one language to another.
- Information retrieval: Ranking documents based on their relevance to a query.
- Text summarization: Condensing long pieces of text into shorter summaries.


There are different Python libraries for analyzing textual data using traditional language models including **NLTK, TextBlob,** and **spaCy**. In this lesson, we will learn about TextBlob which is a Python library for processing textual data. It provides a simple API for diving into common NLP tasks such as sentiment analysis, part-of-speech tagging, text tokenization, spelling correction, and word frequency counting. With TextBlob, you can perform not only these tasks, but also several other tasks like creating n-grams and lemmatization. We cover some of the basic features of this powerful Python library in this lesson.

TextBlob is a lightweight Python library for NLP that simplifies tasks by acting as a wrapper around NLTK (Natural Language Toolkit) and the pattern library. It operates primarily on rule-based methods and pre-trained lexicons rather than complex, real-time machine learning models or large language models (LLMs), making it efficient for prototyping and smaller applications. TextBlob is ideal for applications requiring high-speed, low-cost, and low resource usage, such as social media monitoring or reviewing the sentiment of simple textual data rather than analyzing long texts which require deep contextual understanding (via deep neural networks or large language models). 

TextBlob uses a pre-trained statistical model for most of its language processing tasks powered by traditional language models. Two main traditional language models are Bag-of-Words (BoW) and N-grams (which will be discussed in section 6). 

The BoW model represents text as a collection of words, without considering the order in which they appear. It breaks down a text into individual words, and each word is assigned a count or weight, treating each word as an independent entity.

TextBlob, for most use cases, uses a BoW model combined with its dictionary (the lexicon) to perform various NLP tasks on the textual data. 

### 1) Importing TextBlob and Required Packages

We need to import TextBlob, its wrapper library, NLTK, and the required datasets (dictionaries) and models for TextBlob to perform various NLP tasks on textual data.

In [None]:
# Textblob training
# If you use Kaggle, you don't need to install these libraries but if you use Jupyter Lab
# On your Local PC, then you need to install these packages:
# pip install -U textblob
# python -m textblob.download_corpora

# Installing the required libaries, and datasets for textblob and NLTK

from textblob import TextBlob
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')



You may ask what are these nltk.download commands. In simple terms, these are the "data fuel" that TextBlob needs to run. TextBlob is the engine, but it doesn't come pre-packaged with the massive dictionaries and statistical rules required to understand English. These nltk.download commands pull specific "corpora" (data sets) or "models" (trained rules) from the Natural Language Toolkit (NLTK) servers.

- punkt and punkt_tab are Tokenizers (splitting text into words).
- averaged_perceptron_tagger_eng is the part-of-speech tagger model.
- wordnet is a massive English lexical database.
- omw-1.4 (Open Multilingual Wordnet) is an expansion of wordnet.

If you are working in Kaggle or Google Colab, the virtual machine is "wiped" every time you start a new session. Since these files are large, they aren't kept on the base image to save space. You have to download them into the local environment memory each time you run your script. We can use TextBlob() function to create a textblob object in Python:

In [None]:
# Let’s create our first TextBlob. We need to feed TextBlob something like plain text.
sentence1 = TextBlob("Python is a high-level programming language.")
print(sentence1)
print(type(sentence1))

### 2) Tokenization

In TextBlob, tokenization is the process of breaking a large body of text into smaller, meaningful pieces called "tokens." TextBlob makes this incredibly easy by providing two main properties: ***.words*** and ***.sentences***. 

When you use ***.words***, TextBlob breaks the text down into individual words. Key Feature: Unlike a simple Python *.split()*, TextBlob's word tokenizer is smart. It understands that punctuation (like periods and commas) should be stripped away from the word itself. It also understand "closed compounds" or a single word that functions as two distinct grammatical parts (will see an example of this in exercise 1).

When you use ***.sentences***, this breaks a large paragraph into a list of individual sentences. It doesn't just look for periods. It uses a pre-trained model (punkt) to distinguish between a period that ends a sentence and a period used in an abbreviation (like "Mr." or "Dr.").

In [None]:
print(sentence1.words)
print(sentence1.sentences)


In [None]:
text = """Never forget what you are, for surely the world will not. Make it 
your strength. Then it can never be your weakness. Armor yourself in it,
and it will never be used to hurt you."""
sentences_example = TextBlob(text)
print(sentences_example.sentences)

#### Exercise 1

Take this famous quote from Martin Luther King Jr.:


Quote: "Darkness cannot drive out darkness; only light can do that. Hate cannot drive out hate; only love can do that."

**A)** Break it into tokens using the default ***.words*** property, and count how many tokens are in that quote.

**B)** Break it into words using ***.split()*** method and count how many words are in that quote.

**C)** Compare number of tokens and number of words in the quote, and analzye the result.

Hint: You can use len() function to find the number of tokens or words.




In [None]:
from textblob import TextBlob

# 1. Define the text
quote = "Darkness cannot drive out darkness; only light can do that. Hate cannot drive out hate; only love can do that."

# 2. Create the Blob
sentence_blob = TextBlob(quote)

# 3. Tokenize into words
tokens = sentence_blob.words

# 4. Show results
print("Tokens found:")
print(tokens)

print("\nTotal token count:", len(tokens))

# To see the number of words instead of tokens, instead of sentence_blob.words, we use:
simple_words = quote.split()
print("total word count:", len(simple_words)) # This would give you a different count

As you may notice, the word "cannot" is tokenized as two words by TextBlob default tokenizer: "can" and "not". This is one of the key differences between simple splitting and NLP tokenization. Tokenization, even when using ***.words*** property, is not equal to splitting the text into words!! The reason cannot is split into can and not is that TextBlob’s default tokenizer is designed to follow standard linguistic conventions used in English grammar analysis. In English, "cannot" is a single word that functions as two distinct grammatical parts:

- Can: The modal verb (the action/possibility).

- Not: The negation (the denial).

Most advanced NLP tools (like NLTK, which TextBlob uses under the hood) split "cannot" so that if you were performing Sentiment Analysis or Part-of-Speech Tagging, the model can clearly see the "not" and realize the sentence is being negated. Without splitting it, a simple model might see "can" as positive, but it wouldn't "see" the negation tucked inside "cannot" as easily.

The ***.words*** and ***.sentences*** properties use the default tokenizers of TextBlob (*textblob.tokenizers.WordTokenizer and textblob.tokenizers.SentenceTokenizer*).

You can use other tokenizers, such as those provided by NLTK, by passing them into the TextBlob constructor then accessing the tokens property. When you pass a custom tokenizer into the TextBlob constructor, you are overriding the default WordTokenizer. For example, the **TabTokenizer** will only split the text where it finds a *\t* (tab) character. 


In [None]:
from textblob import TextBlob
from nltk.tokenize import TabTokenizer
tokenizer = TabTokenizer()
blob = TextBlob("This is\ta rather tabby\tblob.", tokenizer=tokenizer)
blob.tokens

In this example, TabTokenizer will treat "This is" as a single token because there is no tab between them. TextBlob is flexible. While it provides smart defaults, it allows you to 'plug in' specialized tools from the NLTK library to handle non-standard text formats, which you can then access via the ***.tokens*** property

### 3) Part-of-speech Tagging:

Part-of-Speech (POS) tagging is the process of labeling each word in a sentence with its corresponding grammatical category—such as noun, verb, adjective, or adverb—based on both its definition and its context.

In NLP, this is a foundational step because the same word can mean very different things depending on its part of speech (e.g., "The book (noun) is on the table" vs. "I need to book (verb) a flight").

In [None]:
# Part-of-speech tags can be accessed through the tags attribute.
print(sentence1.tags)

The output is a list of tuples, where each tuple represents a word and its corresponding POS tag. These tags are assigned using a specific tagging scheme. These are tags in this example: 

- NNP stands for "Proper Noun, Singular". This indicates that "Python" is a specific, named entity.
- VBZ stands for "Verb, Present Tense, 3rd Person Singular". This means "is" is a verb in the present tense, used with a third-person singular subject.
- DT stands for "Determiner". This indicates that "a" is an article, used to specify a noun.
- JJ stands for "Adjective". This means "high-level" is a descriptive word modifying a noun.
- NN stands for "Noun, Singular". This means "programming" and "language" are both singular noun.

When you run blob.tags, the model follows these three steps:

- Tokenization: It breaks your sentence into individual words (tokens).

- Look-up: It checks a simplified internal dictionary for the word.

- Contextual Guessing: If the word can be multiple parts of speech (like "run"), it looks at the tags of the words before and after it to decide. If the word is unknown, it looks at the suffix (e.g., words ending in "-ing" are likely verbs, "-ly" are likely adverbs).

### 4) Spelling Correction

One of TextBlob's most user-friendly features is automatic spelling correction. When you call ***.correct()***, TextBlob performs a statistical guess for every word in your string. It uses a combination of two things:

- It compares your words against a large list of correctly spelled words in a known corpus (dataset).

- It calculates how many "edits" (insertions, deletions, or swaps of letters) it takes to turn your misspelled word into a real word.

In [None]:
from textblob import TextBlob
sentence3 = TextBlob("I havv goood speling!")
print(sentence3.correct())

Word objects have a ***Word.spellcheck()*** method that returns a list of (word, confidence) tuples with spelling suggestions.


In [None]:
from textblob import Word
word_3 = Word("abbility")
print(word_3.spellcheck())


Here 1.0 means 100% confidence of the model. Model may not be so confident when there are multiple words as alternatives:

In [None]:
from textblob import Word

word_4 = Word("wormd")
print(word_4.spellcheck())

Remember that TextBlob's default corrector is not perfect, it's just a statistical model which process the text "word-by-word." It doesn't always understand the context of the whole sentence. For example, if you type "I went to the see" instead of "sea," it might not correct it because "see" is already a valid word. 

Also remember that calling .correct() on a massive book (millions of words) can be slow because it has to check every single word against its dictionary.

Finally, if you have medical jargon or very technical slang, TextBlob might try to "correct" those into common English words, which might not be what you want!

#### Exercise 2

Use the ***.correct()*** method to return a polished version of the following sentences:

Sentence 1: "I am realy hapy that the weatherr is sunni today!"
</br>
Sentence 2: "I am realy hapy that the wedder is suny today!"

In [None]:
from textblob import TextBlob

# A sentence with deliberate typos
messy_text1 = "I am realy hapy that the weatherr is sunni today!"
messy_text2 = "I am realy hapy that the wedder is suny today!"
# Create the TextBlob
sentence_blob1 = TextBlob(messy_text1)
sentence_blob2 = TextBlob(messy_text2)

# Apply the correction
clean_text1 = sentence_blob1.correct()
clean_text2 = sentence_blob2.correct()

print("Original:", messy_text1)
print("Corrected:",clean_text1)
print()
print("Original 2:", messy_text2)
print("Corrected 2:",clean_text2)

The second sentence is not corrected as it was expected! Why is that? This example reveals the limitations of statistical spellcheckers. TextBlob’s ***.correct()*** method doesn't "read" the sentence like a human. It looks at each word individually and asks: "What is the most statistically likely word that looks like this?". TextBlob sees "wedder" and looks for the closest matches.

- Weather: requires changing "dd" to "ath" (3 changes).

- Redder: only requires changing "w" to "r" (1 change).

Because "redder" is "closer" in terms of typing distance (Edit Distance), the model picks it, even though it makes no sense in the context of the sky!

Furthermore:

- Sunny: requires adding a letter ("n").

- Sun: requires removing a letter ("y").

In the specific dataset TextBlob uses, the word "sun" is much more common than the adjective "sunny." It assumes you accidentally added a "y" to the end of "sun."

This is one of the major limitations of TextBlob and traditional language models in general: TextBlob is a "Dictionary" checker. It looks at words in isolation. Generally, it does not consider the context of words into account!

### 5) Lemmatization

Lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item (dictionary form). It’s like looking up the "root" of a word in a professional dictionary. Words can be lemmatized by calling the ***.lemmatize()*** method. Here we are using the *Word* object directly. This object uses **WordNet**, a massive lexical database of English, to find the root.

In [None]:
from textblob import Word
word_1 = Word("octopi")
print(word_1.lemmatize())
word_2 = Word("went")
print(word_2.lemmatize("v")) 


TextBlob recognizes that "octopi" is the irregular plural of "octopus." A simple stemmer would have just left it as "octop," but the lemmatizer returns the actual base noun. Also, by default, the .lemmatize() method assumes the word is a noun. If you want to lemmatize a verb, you have to tell it by passing the part of speech ("v"). Without the "v", TextBlob would look for a noun named "went" and find nothing, so it would just return "went." By specifying the verb, it correctly identifies "go" as the base form.

Lemmatization is vital when you want to count word frequencies accurately. Without it, your computer thinks "run," "ran," and "running" are three completely different topics. After lemmatization, it knows they are all the same action: "run."

### 6) N-grams

An n-gram model is one of the simplest traditional language models. The n-grams functionality in TextBlob is a simple but powerful tool for breaking down text into contiguous sequences of n items (words). This is a standard technique in NLP used to understand the context of words by looking at their neighbors. The idea is that the probability of a word depends on the context of the last few words, with n representing the number of words considered. In symmary, the TextBlob.ngrams() method returns a list of tuples of n successive words. When you call *.ngrams(n=3)* on a TextBlob object, it returns a list of WordList objects. Each WordList contains n consecutive words from your text.


- Unigram (n=1): Only the current word is considered (no context).
- Bigram (n=2): The model considers one previous word.
- Trigram (n=3): The model considers two previous words, and so on.





In [None]:
blob = TextBlob("Now is better than never.")
print(blob.ngrams(n=3))

In NLP, n-grams are essential for tasks where single words don't provide enough meaning:

- Sentiment Context: "Not happy" is a bigram. If you only look at unigrams ("not" and "happy"), a model might think the text is positive because of the word "happy." The bigram captures the negation.

- Auto-complete/Next-word Prediction: By analyzing which words frequently follow others (e.g., "New York" vs. "New Table"), models can predict the next word in a sequence.

- Entity Recognition: Identifying phrases like "Social Media" or "Artificial Intelligence" as single concepts rather than separate words.


In order to count the frequency of n-grams, we can use Counter function from collection library.


In [None]:
# counting the top 3 bigrams in the text

from textblob import TextBlob
from collections import Counter

text= """
Social science is the study of people: as individuals, communities and societies; their behaviours and interactions with each other and with their built, technological and natural environments. Social science seeks to understand the evolving human systems across our increasingly complex world and how our planet can be more sustainably managed. It’s vital to our shared future.

Social science includes many different areas of study, such as how people organise and govern themselves, and broker power and international relations; how wealth is generated, economies develop, and economic futures are modelled; how business works and what a sustainable future means; the ways in which populations are changing, and issues of unemployment, deprivation and inequality; and how these social, cultural and economic dynamics vary in different places, with different outcomes.
"""

# 1. Initialize TextBlob and create bigrams
blob = TextBlob(text)
bigrams = blob.ngrams(n=2)

# 2. Create an empty list to store the "string" versions
# The Counter tool is designed to count individual items in a list.
bigram_list = []

# 3. Loop through and add them to our list
for word in bigrams:
    # word is something like ['social', 'science']
    # We turn it into "social science" and add it to the list
    bigram_as_string = str(word[0] + " " + word[1])
    bigram_list.append(bigram_as_string)

# 3. Use Counter to find the most frequent pairs
bigram_counts = Counter(bigram_list)

# 4. Extract the top 3
top_3_bigrams = bigram_counts.most_common(3)

print("Top 3 Bigrams:")
for bigram, count in top_3_bigrams:
    print(bigram,":", count)

#### Exercise 3)

Find the most frequency unigram in the following body of text:

"Never forget what you are, for surely the world will not. Make it your strength. Then it can never be your weakness. Armor yourself in it, and it will never be used to hurt you."

In [None]:
from textblob import TextBlob
from collections import Counter

text = """Never forget what you are, for surely the world will not. Make it 
your strength. Then it can never be your weakness. Armor yourself in it,
and it will never be used to hurt you."""

# 1. Initialize TextBlob and create unigrams
blob = TextBlob(text)
unigrams = blob.ngrams(n=1)

# 2. Create an empty list to store the "string" versions
# The Counter tool is designed to count individual items in a list.
unigram_list = []

# 3. Loop through and add them to our list
for word in unigrams:
    # word is something like ['social', 'science']
    # We turn it into "social science" and add it to the list
    unigram_as_string = str(word[0])
    unigram_list.append(unigram_as_string)

# 3. Use Counter to find the most frequent pairs
unigram_counts = Counter(unigram_list)

# 4. Extract the top unigram
top_unigram = unigram_counts.most_common(1)

print("Top unigram:",top_unigram)

In most real-world applications, we need to exclude words like "the" and "it" from our n-grams analysis and count, as they are usually meaningful words for various NLP tasks. We also need to lower case n-grams to avoid counting "communication" and "Communication" as two different words! To filter out "stop words" from your n-grams, you combine Python's list comprehension with a list of common words to ignore (like "the", "is", "at"). This is essential for finding the actual meaning in a text, rather than just seeing "of the" or "in a" repeatedly. Here is the most efficient way to do it using NLTK’s stop word list:

In [None]:
from textblob import TextBlob
from collections import Counter
import nltk
from nltk.corpus import stopwords

# Ensure the stop words are downloaded
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

text = """Never forget what you are, for surely the world will not. Make it 
your strength. Then it can never be your weakness. Armor yourself in it,
and it will never be used to hurt you."""

# 1. Initialize TextBlob and create unigrams (n=1)
blob = TextBlob(text)
unigrams = blob.ngrams(n=1)

# 2. Create an empty list for the filtered words
unigram_list = []

# 3. Loop through and add only if it's NOT a stop word
for word in unigrams:
    unigram_as_string = str(word[0]).lower() # Convert to lower to match stop_words list
    
    # NEW: Check if the word is in the stop words list
    if unigram_as_string not in stop_words:
        unigram_list.append(unigram_as_string)

# 4. Use Counter to find the most frequent meaningful word
unigram_counts = Counter(unigram_list)

# 5. Extract the top unigram
top_unigram = unigram_counts.most_common(1)

print("Top unigram (filtered):", top_unigram)

### 7) Sentiment analysis

The lexicon is a pre-defined dictionary. It contains 3,546 words and phrases, each annotated with its sentiment and other language characteristics. The lexicon is the foundation of TextBlob's default, rule-based sentiment analyzer. The Lexcicon has specifically a databse named en-sentiment.xml, which maps thousands of English words to their respective sentiment polarity (positive/negative) and subjectivity (objective/subjective) scores.  These are steps that TextBlob takes before showing you the sentiment scores for a body of text:

- Tokenization: TextBlob splits a sentence into individual words.
- Lookup: It searches for each word in the en-sentiment.xml lexicon.
- Aggregation: It computes the overall sentiment by averaging the polarity and subjectivity scores of all recognized words.

One type of sentiment analysis is to calculate the overall positivity or negativity of a text corpus. There are a variety of algorithms and scales. TextBlob’s default ***.sentiment*** function rates an input text as negative or positive on a scale of -1 to 1 (Lexicon Approach). The sentiment function returns two numbers in the form of a named tuple:

**Sentiment(polarity, subjectivity)**

The polarity score is a float within the range [-1.0, 1.0]. Subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective (ie: does not express strong sentiment) and 1.0 is very subjective.


In [None]:
testimonial = TextBlob("Textblob is amazingly simple to use. What a fun library!")

print(testimonial.sentiment)
print(testimonial.sentiment.polarity)

When you call ***TextBlob(text).sentiment***, the library:

- Tokenizes the text into words.
- Identifies words present in its sentiment lexicon.
- Calculates an average polarity and subjectivity score for the entire sentence based on - individual word scores.
- Adjusts for negations or modifiers

In [None]:
sentence2 = TextBlob("This product is not great, it is very slow!")
print(sentence2.sentiment)

Here, polarity represents the sentiment of the text, ranging from -1 to 1:
-1: Very negative sentiment
0: Neutral sentiment
1: Very positive sentiment
The closer the polarity is to 1, the more positive the sentiment is. The closer the polarity is to -1, the more negative the sentiment is. If it's close to 0, it suggests neutral sentiment. Subjectivity measures how subjective or objective the text is, with a range from 0 to 1:

	0: Very objective (factual)
	1: Very subjective (opinion-based)

Higher subjectivity values indicate that the text is more opinion-based or emotional, while lower values indicate that the text is more factual or neutral.


#### Exercise 4 

Using TextBlob, determine the whether the sentiment of the following four Amazon reviews is positive or negative (determine the sentiment for each sentence separately). You should determine whether it is positive or negative by printing out "positive" or "negative" for their sentiments.

- Review 1:  The computer speed is great, I love it!
- Review 2:  The computer is extremely slow and unreliable.
- Review 3:  It worth the money but it's just average.
- Review 4:  It is so fast that I can see bar load loading!

Did you see any limitations when using TextBlob for sentiment analysis?


In [None]:
reviews = [
    "The computer speed is great, I love it!",
    "The computer is extremely slow and unreliable.",
    "It worth the money but it's just average.",
    "It is so fast that I can see bar load loading!"
]

for i, review_text in enumerate(reviews, 1):
    blob = TextBlob(review_text)
    
    print(f"Review {i}:")
    for sentence in blob.sentences:
        # Determine sentiment based on polarity
        if sentence.sentiment.polarity > 0:
            sentiment_label = "positive"
        else:
            sentiment_label = "negative"
            
        print(f"  Sentence: '{sentence}' -> {sentiment_label}")

Note: TextBlob is built on top of the pattern library, allowing for efficient sentiment lookup, but it is less capable of understanding context-dependent, nuanced, or sarcastic text compared to modern large language models.

Why it lacks deep "Conceptual Understanding":

- Lexicon-Based Approach: By default, TextBlob's sentiment analyzer is lexicon-based, meaning it calculates sentiment by averaging scores of predefined words and phrases, not by truly understanding the concept behind them.
- No Vectorization/Semantic Depth: TextBlob does not provide advanced features like word vectors or deep neural networks that allow for semantic understanding of relationships between words.
- Limited Context: While n-grams are better than unigrams (single words), they only capture local, immediate context (the 2 or 3 words next to each other), not the broader conceptual meaning of a sentence or paragraph

The textblob.sentiments module contains two sentiment analysis implementations, **PatternAnalyzer** (based on the pattern library) and **NaiveBayesAnalyzer** (an NLTK classifier trained on a movie reviews corpus).

The default implementation is PatternAnalyzer, but you can override the analyzer by passing another implementation into a TextBlob’s constructor.

For instance, the NaiveBayesAnalyzer returns its result as a namedtuple of the form: Sentiment(classification, p_pos, p_neg).


In [None]:
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer
review_example = TextBlob("I love this library", analyzer=NaiveBayesAnalyzer())
print(review_example.sentiment)
print(review_example.sentiment.classification)


As mentioned before, by default, TextBlob uses a "PatternAnalyzer" (based on a simple word-dictionary), but here you are swapping it out for a more advanced machine learning model. Instead of just looking for "good" or "bad" words, it calculates the probability that a sentence belongs to the "positive" class versus the "negative" class based on patterns it learned during training. It is often more "opinionated" than the default analyzer. It doesn't just give you a polarity score between -1 and 1; it gives you a specific classification.

- classification: The model's final "vote" (either 'pos' for positive or 'neg' for negative).

- p_pos: The calculated probability that the text is positive.

- p_neg: The calculated probability that the text is negative.

In this example "I love this library", the model will see the word "love" (which appears frequently in positive movie reviews) and will likely assign a very high p_pos (positive probability) and classify it as 'pos'

### 8) TextBlob and Python Strings

TextBlobs objects are Like Python strings, meaning they support slicing and indexing. TextBlob was designed to feel like a natural extension of Python’s built-in str type, so you don't have to learn new syntax to grab specific parts of your text. Just like a string, you can use square brackets [] to grab a character at a specific position. Remember that Python starts counting at 0. You can also use the [start:stop] syntax to "slice" out a portion of the text. You can finally use string methods like ***.upper()***. 

In [None]:
sentences_example2 = TextBlob(

    "Beautiful is better than ugly. "

    "Explicit is better than implicit. "

    "Simple is better than complex."

)
print(sentences_example2[0])
print(sentences_example2[0:19])
print(sentences_example2.upper())


Remember when you slice a TextBlob or use a method like .upper(), the result isn't just a plain string—it's a new TextBlob object. This means the "child" slice inherits all the "DNA" of the parent, allowing you to run NLP tasks (like sentiment or tagging) on that specific fragment immediately.

#### Exercise 5

In NLP, we often want to isolate specific parts of a document to see how the tone shifts. In the following text provided, slice the string to skip the "hateful" part and only analyze the the "loving" part.

Using TextBlob, calculate the sentiment polarity of the second sentence in this text (I love coding):

I hate chores. I love coding!



In [None]:
from textblob import TextBlob
sentence_5= TextBlob("I hate chores. I love coding!")

# Slice the second sentence and check its mood immediately
print(sentence_5[15:].sentiment.polarity)

### 9) How to Import Textual Data from Web

In real world applications, we usually need to import textual data from a text file, usually a raw text file. Let’s try to first create a text file using a text editor (nano, or even Jupyter lab text editor). Open a blank text file and type in a sentence like: 

*“This is a simple sentence that I typed for illustration purposes. I do not intend to use it for anything else.”* 

Once you finish typing, save the text file with a name mytext.txt and proceed to your notebook again. This is the code we can use to import this text file and use textblob tagging on the data: 


In [None]:
import requests
from textblob import TextBlob

# 1. Use the "RAW" version of the GitHub URL
url = "https://raw.githubusercontent.com/YasharMonf/Textual_data_analysis/main/mytext.txt"

# 2. Fetch the document data from the web
response = requests.get(url)

# 3. Get the text content from the response
document = response.text

print("Data type:", type(document))

# 4. Process the document text with TextBlob
document_blob = TextBlob(document)
print(document_blob.sentiment.subjectivity)

To wrap up this session, it’s important to understand where TextBlob shines and where it hits a wall, especially as we prepare to transition into the world of Large Language Models (LLMs). 

Advantages of TextBlob:

- Extremely Beginner-Friendly: If you know how to use a Python string, you practically already know how to use TextBlob.

- Built-in properties and methods: It bundles complex tasks (tagging, sentiment, spelling correction) into simple properties. 

- Efficient and Fast: Because it uses rule-based logic and lightweight statistical models, it can process thousands of short sentences in seconds on a standard laptop.

- Logic Transparency: Unlike "black box" AI models, TextBlob’s default sentiment is based on a dictionary. You can actually look up why a word like "excellent" gave a score of 1.0.

Limitations of TextBlob:

- Context Blindness: TextBlob often struggles with sarcasm or double negatives. For example, it might struggle to understand that "not bad" is actually "good."

- Limited Deep Understanding: It treats language more like a bag of words than a coherent thought. It doesn't understand the relationship between ideas across long paragraphs.

- Dependency on NLTK Corpora: As you saw today, you have to manually download data packages (punkt, wordnet) to make it work, which can be a hurdle for deployment.

- Language Limitations: The core features are heavily optimized for English and may perform poorly on other languages.

## References and further information

TextBlob GitHub page: https://github.com/sloria/textblob


TextBlob official documentation: https://textblob.readthedocs.io/en/dev/