# Introduction to NLP: Tokenization and Sentiment Analysis

Welcome to this introductory tutorial on Natural Language Processing (NLP)! In this notebook, we'll learn about two fundamental NLP tasks:
1. Tokenization - breaking text into smaller units (words, sentences)
2. Sentiment Analysis - determining the emotional tone of text

Let's start by setting up our environment and importing the necessary libraries.

In [None]:
# Import required libraries
import nltk
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.sentiment import SentimentIntensityAnalyzer

# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('vader_lexicon')

print("Setup complete! Let's begin our NLP journey.")

## Part 1: Tokenization

Tokenization is the process of breaking down text into smaller units called tokens. These tokens can be words, sentences, or even characters. Let's explore different types of tokenization.

### Word Tokenization

Word tokenization splits text into individual words. Let's see how it works:

In [None]:
# Example text
text = "Hello! This is a sample text. We're learning NLP today."

# Word tokenization using NLTK
words = word_tokenize(text)
print("Original text:", text)
print("\nTokenized words:", words)

### Sentence Tokenization

Sentence tokenization splits text into individual sentences. This is useful when you want to analyze text at the sentence level:

In [None]:
# Sentence tokenization using NLTK
sentences = sent_tokenize(text)
print("Original text:", text)
print("\nTokenized sentences:", sentences)

### Practice Exercise: Tokenization

Now it's your turn! Try tokenizing the following text into words and sentences:

In [None]:
# Practice text
practice_text = "NLP is fascinating! We can analyze text in many ways. What do you think?"

# TODO: Tokenize the text into words
words = word_tokenize(practice_text)
print("Tokenized words:", words)

# TODO: Tokenize the text into sentences
sentences = sent_tokenize(practice_text)
print("\nTokenized sentences:", sentences)

## Part 2: Sentiment Analysis

Sentiment analysis is the process of determining the emotional tone of text. We'll use NLTK's VADER (Valence Aware Dictionary and sEntiment Reasoner) for this task.

### Basic Sentiment Analysis

Let's analyze the sentiment of some example texts:

In [None]:
# Initialize the sentiment analyzer
sia = SentimentIntensityAnalyzer()

# Example texts
texts = [
    "I love this product! It's amazing.",
    "This is the worst experience ever.",
    "The movie was okay, nothing special."
]

# Analyze sentiment for each text
for text in texts:
    scores = sia.polarity_scores(text)
    print(f"\nText: {text}")
    print(f"Sentiment scores: {scores}")
    print(f"Overall sentiment: {'Positive' if scores['compound'] > 0 else 'Negative' if scores['compound'] < 0 else 'Neutral'}")

### Understanding Sentiment Scores

The VADER sentiment analyzer provides four scores:
1. `neg`: Negative sentiment score (0 to 1)
2. `neu`: Neutral sentiment score (0 to 1)
3. `pos`: Positive sentiment score (0 to 1)
4. `compound`: Normalized compound score (-1 to 1)

### Practice Exercise: Sentiment Analysis

Now it's your turn! Analyze the sentiment of these texts:

In [None]:
# Practice texts
practice_texts = [
    "The weather is beautiful today!",
    "I'm feeling a bit tired.",
    "This restaurant has terrible service."
]

# TODO: Analyze sentiment for each text
for text in practice_texts:
    scores = sia.polarity_scores(text)
    print(f"\nText: {text}")
    print(f"Sentiment scores: {scores}")
    print(f"Overall sentiment: {'Positive' if scores['compound'] > 0 else 'Negative' if scores['compound'] < 0 else 'Neutral'}")

## Solutions

Here are the solutions to the practice exercises:

### Tokenization Solution
```python
# Word tokenization
words = word_tokenize(practice_text)
# Result: ['NLP', 'is', 'fascinating', '!', 'We', 'can', 'analyze', 'text', 'in', 'many', 'ways', '.', 'What', 'do', 'you', 'think', '?']

# Sentence tokenization
sentences = sent_tokenize(practice_text)
# Result: ['NLP is fascinating!', 'We can analyze text in many ways.', 'What do you think?']
```

### Sentiment Analysis Solution
```python
for text in practice_texts:
    scores = sia.polarity_scores(text)
    print(f"\nText: {text}")
    print(f"Sentiment scores: {scores}")
    print(f"Overall sentiment: {'Positive' if scores['compound'] > 0 else 'Negative' if scores['compound'] < 0 else 'Neutral'}")
```

Expected results:
1. "The weather is beautiful today!" - Positive sentiment
2. "I'm feeling a bit tired." - Neutral sentiment
3. "This restaurant has terrible service." - Negative sentiment