<a href="https://colab.research.google.com/github/comparativechrono/Principles-of-Data-Science/blob/main/Week_9/Section_8__Python_Example__Basic_Text_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Section 8: Python example - basic text analysis

Text analysis is a fundamental aspect of Natural Language Processing (NLP) that involves processing text to understand its structure, meaning, and intent. In this section, we will use Python to demonstrate basic text analysis techniques, including tokenization, part-of-speech tagging, and named entity recognition. We'll use the Natural Language Toolkit (NLTK), a widely used library for NLP in Python.

1. Setting Up the Environment:

To start with text analysis in Python, you'll need the NLTK library, which provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. Install NLTK using pip if it's not already installed:

In [None]:
pip install nltk

2. Importing Required Libraries:

After installation, import NLTK and download the necessary datasets and models:

In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker_tab')

3. Tokenization:

Tokenization is the process of breaking down text into smaller chunks, typically words or sentences. This is often the first step in text analysis.

In [None]:
from nltk.tokenize import word_tokenize, sent_tokenize
text = "Hello there! Welcome to Principles of Data Science. Hope you find it helpful."
print("Sentence Tokenization:")
print(sent_tokenize(text))
print("\nWord Tokenization:")
print(word_tokenize(text))

4. Part-of-Speech Tagging:

Part-of-Speech (POS) tagging is used to label each word in a sentence as nouns, verbs, adjectives, etc. This is useful for many language processing tasks because it helps in understanding the grammar and role of each word in the sentence.

In [None]:
words = word_tokenize(text)
print("\nPart-of-Speech Tagging:")
print(nltk.pos_tag(words))

5. Named Entity Recognition (NER):

NER is used to identify and classify named entities mentioned in text into pre-defined categories such as person names, organizations, locations, etc.

In [None]:
from nltk import ne_chunk
print("\nNamed Entity Recognition:")
print(ne_chunk(nltk.pos_tag(word_tokenize(text))))

6. Conclusion:

This simple example demonstrates the initial steps in text analysis using NLTK in Python. These foundational techniques are critical in more complex NLP tasks such as sentiment analysis, machine translation, or information extraction. Understanding how to perform basic text processing allows for the exploration and analysis of text data, which can be applied to a vast array of practical applications in business, social media, academia, and more.

As you advance, you can explore more sophisticated NLP models and tools, such as those provided by newer libraries like spaCy or transformer-based models like BERT and GPT, which offer pre-trained models capable of understanding and generating human-like text. This can significantly enhance your capability to handle diverse and complex NLP tasks.