# Section 1: Introduction to NLP

Welcome to the first lesson of our NLP course! In this section, we will cover the fundamental concepts of Natural Language Processing and perform some basic text preprocessing tasks.

**Our goals for this section are:**
1. Understand what NLP is.
2. Learn how to tokenize text.
3. Understand the importance of removing stop words.

## 1. A Quick Refresher

As we discussed in the `getting_started_with_nlp.md` guide, NLP is all about making computers understand and process human language. The first step in doing this is breaking down text into a more manageable form. Let's start with a simple sentence.

In [1]:
sample_text = "NLP is fascinating! It's a field of AI that has seen rapid growth in recent years."

## 2. Tokenization

Tokenization is the process of splitting text into individual words or sentences, called **tokens**. This is a crucial first step for many NLP tasks.

### Tokenization with NLTK

In [3]:
import nltk

# Word Tokenization
tokens = nltk.word_tokenize(sample_text)
print("Word Tokens:", tokens)

Word Tokens: ['NLP', 'is', 'fascinating', '!', 'It', "'s", 'a', 'field', 'of', 'AI', 'that', 'has', 'seen', 'rapid', 'growth', 'in', 'recent', 'years', '.']


### Tokenization with spaCy

spaCy is another powerful NLP library. It's known for its speed and efficiency. Let's see how to tokenize with spaCy.

In [5]:
import spacy

# Load the English language model
nlp = spacy.load('en_core_web_sm')

# Process the text with spaCy
doc = nlp(sample_text)

# Get tokens
spacy_tokens = [token.text for token in doc]
print("spaCy Tokens:", spacy_tokens)

spaCy Tokens: ['NLP', 'is', 'fascinating', '!', 'It', "'s", 'a', 'field', 'of', 'AI', 'that', 'has', 'seen', 'rapid', 'growth', 'in', 'recent', 'years', '.']


## 3. Stop Words Removal

Stop words are common words (like 'is', 'a', 'the') that often don't carry significant meaning. Removing them can help us focus on the more important words in the text.

### Stop Words with NLTK

In [6]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print("Tokens after stop word removal:", filtered_tokens)

Tokens after stop word removal: ['NLP', 'fascinating', '!', "'s", 'field', 'AI', 'seen', 'rapid', 'growth', 'recent', 'years', '.']


### Stop Words with spaCy

In [7]:
spacy_filtered_tokens = [token.text for token in doc if not token.is_stop]
print("spaCy tokens after stop word removal:", spacy_filtered_tokens)

spaCy tokens after stop word removal: ['NLP', 'fascinating', '!', 'field', 'AI', 'seen', 'rapid', 'growth', 'recent', 'years', '.']


## Exercise

Now it's your turn!
1. Create a new text variable with a sentence of your choice.
2. Tokenize the text using either NLTK or spaCy.
3. Remove the stop words from your tokenized text.

In [8]:
 # 1. Create a new text variable with a sentence of your choice.
new_sentence = "The quick brown fox jumps over the lazy dog. This is a classic sentence used for typography."
      
# We can reuse the 'nlp' object that was created in a previous cell
doc = nlp(new_sentence)

# 2. Tokenize the text and 3. Remove stop words and punctuation.
# spaCy makes this easy. We loop through the 'doc' object and check if a token is a stop word or punctuation.

filtered_words = [token.text for token in doc if not token.is_stop and not token.is_punct]

# Print the final list of important words
print("Original Sentence: ", new_sentence)
print("Filtered Words: ", filtered_words)


Original Sentence:  The quick brown fox jumps over the lazy dog. This is a classic sentence used for typography.
Filtered Words:  ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog', 'classic', 'sentence', 'typography']


## End of Section 1

Congratulations on completing the first section! You've learned how to perform two fundamental NLP tasks: tokenization and stop word removal.

**Next Steps:**
1. Save this notebook.
2. Commit your changes to Git with the message 'Complete Section 1'.
3. When you're ready, ask me to proceed to **Section 2: Text Preprocessing**.