# NLP Basic Concept
## Lexical Analysis
### Tokenization
Breaks down raw text into smaller, meaningful units called tokens (words, subwords, or characters)

In [None]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /home/adam/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [23]:
text = "NLTK tokenizing is a crucial step in NLP. It is widely used."

sentences = sent_tokenize(text)
words = word_tokenize(text)

print("Sentences:", sentences)
print("Words:", words)

Sentences: ['NLTK tokenizing is a crucial step in NLP.', 'It is widely used.']
Words: ['NLTK', 'tokenizing', 'is', 'a', 'crucial', 'step', 'in', 'NLP', '.', 'It', 'is', 'widely', 'used', '.']


### Case folding
Converts all characters in a text to a single case (usually lowercase)

In [25]:
print(text.lower())

nltk tokenizing is a crucial step in nlp. it is widely used.


### Punctuation Removal
Only retain the important word by removing punctuations

In [26]:
import string
teks = "Hello!!! Are you there??? :)"
print(''.join([char for char in teks if char not in string.punctuation]))

Hello Are you there 


### Stop word removal
Filters out common, less meaningful words (like "the," "is," "a") to reduce noise

In [27]:
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/adam/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
words = word_tokenize("This is an example of stop word removal.")
print([word for word in words if word.lower() not in stopwords.words('english')])

['example', 'stop', 'word', 'removal', '.']


### Abbreviations Handling

In [29]:
import re

In [30]:
text = "Dr Smith is an M.D from U.S."
abbrev_cleaned = re.sub(r'\b(Dr|Mr|Ms|M\.D|U\.S)\.', lambda x : x.group(0).replace('.', ''), text)
print(abbrev_cleaned)


Dr Smith is an M.D from US


In [31]:
sent = "Prof John lives in the U.K. and works at M.I.T."
fixed_sent = re.sub(r'\b([A-Z])\.', r'\1', sent)
print(fixed_sent)

Prof John lives in the UK and works at MIT


### Stemming
Chops off word endings (suffixes/prefixes) to reduce words to their common "stem" or root form

In [32]:
from nltk.stem import PorterStemmer

In [33]:
stemmer = PorterStemmer()
words = ['running', 'runs', 'runner']
print([stemmer.stem(word) for word in words])

['run', 'run', 'runner']


### Part-of-speech tagging
Assigning grammatical categories (like noun, verb, adjective) to each word in a text

In [34]:
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/adam/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [35]:
tokens = word_tokenize("The quick brown fox jumps over the lazy dog.")
print(nltk.pos_tag(tokens))

[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]
