#Lab Session: Basics of Natural Language Processing with Python
##1. Lab Objectives
**By the end of this session, students will be able to:**
+ How to work with Strings (Text)
+ How to acquire and handle text in Python.
+ How to clean text using string operations.
+ How to remove stopwords.
+ How to perform word frequency analysis with a practical example.

## 2. Background
Before applying advanced NLP models, raw text must be preprocessed. Preprocessing ensures:

+ **Consistency** (e.g., converting everything to lowercase).
+ **Noise reduction** (removing punctuation, numbers, and stop words).
+ **Useful insights** (like most frequent words).

##3. Strings
In Python, a string is a sequence of characters (letters, numbers, symbols).
Python has a set of built-in methods that you can use on strings.

**Example:**

text = "Natural Language Processing"


###3.1. Creating Strings

In [41]:
s1 = "Hello"
s2 = 'World'
s3 = """This is
a multi-line string."""
print(s1, s2, s3)

Hello World This is 
a multi-line string.


In [42]:
text = "Python NLP"

print(len(text))       # Length of string
print(text[0])         # First character
print(text[-1])        # Last character
print(text[0:6])       # Slice (characters 0 to 5)
print(text + " Lab")   # Concatenation
print(text * 3)        # Repetition

10
P
P
Python
Python NLP Lab
Python NLPPython NLPPython NLP


###3.2. Basic Operations

In [43]:
text = "Python NLP"

print(len(text))       # Length of string
print(text[0])         # First character
print(text[-1])        # Last character
print(text[0:6])       # Slice (characters 0 to 5)
print(text + " Lab")   # Concatenation
print(text * 3)        # Repetition

10
P
P
Python
Python NLP Lab
Python NLPPython NLPPython NLP


###3.3. Changing Case

In [44]:
s = "Natural Language Processing"
print(s.lower())   # lowercase
print(s.upper())   # UPPERCASE
print(s.title())   # Title Case
print(s.capitalize()) # First letter capitalized

natural language processing
NATURAL LANGUAGE PROCESSING
Natural Language Processing
Natural language processing


###3.4. Removing Unwanted Characters

In [45]:
s = "   NLP is fun!!!   "
print(s.strip())     # remove leading/trailing spaces
print(s.rstrip("!")) # remove characters from right side
print(s.lstrip())    # remove spaces from left side

NLP is fun!!!
   NLP is fun!!!   
NLP is fun!!!   


###3.5. Searching and Replacing

In [46]:
s = "I love Python. Python is powerful."

print(s.find("Python"))     # first occurrence
print(s.rfind("Python"))    # last occurrence
print(s.count("Python"))    # count occurrences
print(s.replace("Python", "NLP"))  # replace

7
15
2
I love NLP. NLP is powerful.


###3.6. Checking String Content

In [47]:
s = "Python3"

print(s.isalpha())   # only letters? False (because of 3)
print(s.isdigit())   # only digits? False
print("3".isdigit()) # only digits? True
print(s.isalnum())   # letters and numbers? True
print("hello".islower())  # all lowercase? True
print("HELLO".isupper())  # all uppercase? True
print(" ".isspace())      # only spaces? True

False
False
True
True
True
True
True


###3.7. Splitting and Joining

In [48]:
s = "NLP makes language processing easy"

words = s.split()   # split into words
print(words)

joined = "-".join(words)  # join words with hyphen
print(joined)
joined = " ".join(words)  # join words with space
print(joined)

['NLP', 'makes', 'language', 'processing', 'easy']
NLP-makes-language-processing-easy
NLP makes language processing easy


###3.8. String Formatting

In [49]:
name = "Alice"
age = 25
print("My name is {} and I am {} years old.".format(name, age))
print(f"My name is {name} and I am {age} years old.")  # f-string (modern)

My name is Alice and I am 25 years old.
My name is Alice and I am 25 years old.


###3.9. Practical Example: Cleaning a Sentence

In [50]:
raw = "   $$$ Welcome to NLP Lab!!! 2025 ###   "

# Cleaning step by step
clean = raw.strip()                          # remove spaces
clean = clean.strip("$#")                    # remove $ and #
clean = clean.replace("!!!", "")             # remove !!!
clean = ''.join(ch for ch in clean if ch.isalpha() or ch.isspace()) # keep only letters/spaces
clean = clean.lower()                        # lowercase

print("Before:", raw)
print("After:", clean)

Before:    $$$ Welcome to NLP Lab!!! 2025 ###   
After:  welcome to nlp lab  


##4. Lab Tasks

###Task 1: Acquiring/Downloading Text
We’ll use text from an online source (Project Gutenberg) or simply define text in Python.

In [52]:
# Option A: Simple text
text = """
Natural Language Processing (NLP) is a field of Artificial Intelligence
that enables machines to understand, interpret, and 2/3 generate human language.
"""
print(text)


Natural Language Processing (NLP) is a field of Artificial Intelligence
that enables machines to understand, interpret, and 2/3 generate human language.



In [53]:
# Option B: Downloading text (requires requests library)
import requests

url = "https://www.gutenberg.org/cache/epub/1342/pg1342.txt"  # Pride and Prejudice
response = requests.get(url)
novel_text = response.text[:1000]  # get only first 1000 characters
#text = response.text  # get only first 1000 characters
print(novel_text)

﻿The Project Gutenberg eBook of Pride and Prejudice
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Title: Pride and Prejudice

Author: Jane Austen

Release date: June 1, 1998 [eBook #1342]
                Most recently updated: October 29, 2024

Language: English

Credits: Chuck Greif and the Online Distributed Proofreading Team at http://www.pgdp.net (This file was produced from images available at The Internet Archive)


*** START OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***
                            [Illustration:

                             GEORGE A

###Task 2: Text Cleaning and String Functions

In [54]:
# Lowercasing
print(text)
clean_text = text.lower()
print(clean_text)


Natural Language Processing (NLP) is a field of Artificial Intelligence
that enables machines to understand, interpret, and 2/3 generate human language.


natural language processing (nlp) is a field of artificial intelligence
that enables machines to understand, interpret, and 2/3 generate human language.



In [55]:
# Removing punctuation
import string, re
clean_text = clean_text.translate(str.maketrans("", "", string.punctuation))
clean_text = re.sub(r'[“”’]','',clean_text)
print("Cleaned Text:", clean_text)

Cleaned Text: 
natural language processing nlp is a field of artificial intelligence
that enables machines to understand interpret and 23 generate human language



In [56]:
# Removing numbers
clean_text = ''.join([ch for ch in clean_text if not ch.isdigit()])
print("Cleaned Text:", clean_text)


Cleaned Text: 
natural language processing nlp is a field of artificial intelligence
that enables machines to understand interpret and  generate human language



###Task 3: Removing Stop Words
We’ll use NLTK stopwords (common words like "the", "is", "and").

In [57]:
import nltk
nltk.download("stopwords")
nltk.download('punkt_tab')


from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words("english"))

tokens = word_tokenize(clean_text)
filtered_words = [w for w in tokens if w not in stop_words]

print("Filtered Words:", filtered_words)
clean_text = " ".join(filtered_words)
print(clean_text)


Filtered Words: ['natural', 'language', 'processing', 'nlp', 'field', 'artificial', 'intelligence', 'enables', 'machines', 'understand', 'interpret', 'generate', 'human', 'language']
natural language processing nlp field artificial intelligence enables machines understand interpret generate human language


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [58]:
type(tokens)

list

###Task 4: Word Frequency Analysis (Practical Example)
We’ll analyze the most common words in the text.

**Practical Example:** If you download Pride and Prejudice (Jane Austen), you’ll see that names like Elizabeth and Darcy appear very frequently. This shows how frequency analysis can highlight main characters or themes in a book.

In [59]:
from collections import Counter

# Count word frequencies
word_freq = Counter(filtered_words)

# Display 10 most common words
print("Most Common Words:", word_freq.most_common(10))


Most Common Words: [('language', 2), ('natural', 1), ('processing', 1), ('nlp', 1), ('field', 1), ('artificial', 1), ('intelligence', 1), ('enables', 1), ('machines', 1), ('understand', 1)]
