# Experiment 1: Text Processing Using NLTK


## Objective

To preprocess raw text by applying tokenization, stop word removal, and stemming using Python and the Natural Language Toolkit (NLTK).

## Tools Used

- Python 3
- NLTK (Natural Language Toolkit)
- Jupyter Notebook

## Theory

Text preprocessing is a fundamental step in Natural Language Processing (NLP) that converts unstructured text into a structured and machine-understandable form.
It helps reduce noise and dimensionality before applying higher-level NLP tasks such as text classification, sentiment analysis, and information retrieval.

The main preprocessing steps involved are:

- **Tokenization:** Breaking text into sentences and words.

- **Stopword Removal**: Eliminating commonly occurring words that carry little semantic meaning.

- **Stemming**: Reducing words to their root form using rule-based methods.

## Code
### Step 1: Import Libraries and Download Required Resources

In [1]:
import nltk
nltk.download('punkt_tab')
nltk.download('stopwords')

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

[nltk_data] Downloading package punkt_tab to /home/div/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /home/div/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Step 2: Input Text and Convert to Lowercase

In [2]:
og_text = "NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum"

# normalise to lowercase, important for stopword removal
lower_text = og_text.lower()
print(f"original:{og_text} \nlowered:{lower_text}")

original:NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum 
lowered:nltk is a leading platform for building python programs to work with human language data. it provides easy-to-use interfaces to over 50 corpora and lexical resources such as wordnet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength nlp libraries, and an active discussion forum


### Step 3: Sentence and Word Tokenization

In [3]:
words = word_tokenize(lower_text)
sentences = sent_tokenize(lower_text)
print(f"tokenized words:{words}")
print(f"tokenized sentences:{sentences}")

tokenized words:['nltk', 'is', 'a', 'leading', 'platform', 'for', 'building', 'python', 'programs', 'to', 'work', 'with', 'human', 'language', 'data', '.', 'it', 'provides', 'easy-to-use', 'interfaces', 'to', 'over', '50', 'corpora', 'and', 'lexical', 'resources', 'such', 'as', 'wordnet', ',', 'along', 'with', 'a', 'suite', 'of', 'text', 'processing', 'libraries', 'for', 'classification', ',', 'tokenization', ',', 'stemming', ',', 'tagging', ',', 'parsing', ',', 'and', 'semantic', 'reasoning', ',', 'wrappers', 'for', 'industrial-strength', 'nlp', 'libraries', ',', 'and', 'an', 'active', 'discussion', 'forum']
tokenized sentences:['nltk is a leading platform for building python programs to work with human language data.', 'it provides easy-to-use interfaces to over 50 corpora and lexical resources such as wordnet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength nlp libr

### Step 4: Stopword Removal

In [4]:
stp_words = set(stopwords.words('english'))
clean_words = [word for word in words if word not in stp_words]
print(f"with stop-words:{words}")
print(f"without stop-words:{clean_words}")

with stop-words:['nltk', 'is', 'a', 'leading', 'platform', 'for', 'building', 'python', 'programs', 'to', 'work', 'with', 'human', 'language', 'data', '.', 'it', 'provides', 'easy-to-use', 'interfaces', 'to', 'over', '50', 'corpora', 'and', 'lexical', 'resources', 'such', 'as', 'wordnet', ',', 'along', 'with', 'a', 'suite', 'of', 'text', 'processing', 'libraries', 'for', 'classification', ',', 'tokenization', ',', 'stemming', ',', 'tagging', ',', 'parsing', ',', 'and', 'semantic', 'reasoning', ',', 'wrappers', 'for', 'industrial-strength', 'nlp', 'libraries', ',', 'and', 'an', 'active', 'discussion', 'forum']
without stop-words:['nltk', 'leading', 'platform', 'building', 'python', 'programs', 'work', 'human', 'language', 'data', '.', 'provides', 'easy-to-use', 'interfaces', '50', 'corpora', 'lexical', 'resources', 'wordnet', ',', 'along', 'suite', 'text', 'processing', 'libraries', 'classification', ',', 'tokenization', ',', 'stemming', ',', 'tagging', ',', 'parsing', ',', 'semantic', 

### Step 5: Stemming

In [5]:
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in clean_words]
print(f"wihtout stemming:{clean_words}")
print(f"with stemming:{stemmed_words}")

wihtout stemming:['nltk', 'leading', 'platform', 'building', 'python', 'programs', 'work', 'human', 'language', 'data', '.', 'provides', 'easy-to-use', 'interfaces', '50', 'corpora', 'lexical', 'resources', 'wordnet', ',', 'along', 'suite', 'text', 'processing', 'libraries', 'classification', ',', 'tokenization', ',', 'stemming', ',', 'tagging', ',', 'parsing', ',', 'semantic', 'reasoning', ',', 'wrappers', 'industrial-strength', 'nlp', 'libraries', ',', 'active', 'discussion', 'forum']
with stemming:['nltk', 'lead', 'platform', 'build', 'python', 'program', 'work', 'human', 'languag', 'data', '.', 'provid', 'easy-to-us', 'interfac', '50', 'corpora', 'lexic', 'resourc', 'wordnet', ',', 'along', 'suit', 'text', 'process', 'librari', 'classif', ',', 'token', ',', 'stem', ',', 'tag', ',', 'pars', ',', 'semant', 'reason', ',', 'wrapper', 'industrial-strength', 'nlp', 'librari', ',', 'activ', 'discuss', 'forum']
