# Introduction to Natural Language Processing (NLP)

## What is NLP?
Natural Language Processing (NLP) is a subfield of Artificial Intelligence (AI) that focuses on the interaction between computers and humans through natural language. The goal of NLP is to enable machines to understand, interpret, and respond to text and spoken language in a way that is both meaningful and useful.

---

## Applications of NLP
NLP has a wide range of applications across various industries, including:

- **Sentiment Analysis**: Determining the sentiment (positive, negative, neutral) in text.
- **Machine Translation**: Translating text from one language to another (e.g., Google Translate).
- **Chatbots and Virtual Assistants**: Developing conversational agents like Siri, Alexa, or ChatGPT.
- **Text Summarization**: Automatically generating summaries of large documents or articles.
- **Speech Recognition**: Converting spoken language into text (e.g., voice-to-text services).
- **Named Entity Recognition (NER)**: Identifying entities like names, dates, locations, etc., in a text.

---

## Key Concepts in NLP
- **Tokenization**: Breaking text into smaller units such as words or sentences.
- **Part-of-Speech Tagging**: Identifying the grammatical role of each word in a sentence.
- **Dependency Parsing**: Analyzing grammatical structure and relationships between words.
- **Word Embeddings**: Representing words as vectors in a continuous vector space (e.g., Word2Vec, GloVe).
- **Transformer Models**: State-of-the-art models like BERT, GPT, and T5 for advanced NLP tasks.

---

## Popular NLP Libraries and Frameworks
- **NLTK**: Natural Language Toolkit for basic NLP tasks.
- **spaCy**: Industrial-strength NLP library with efficient pipelines.
- **Hugging Face Transformers**: A library for state-of-the-art transformer models.
- **Gensim**: For topic modeling and similarity analysis.
- **StanfordNLP**: Suite of NLP tools from Stanford University.

---

## Challenges in NLP
- **Ambiguity**: Understanding context when words or phrases have multiple meanings.
- **Sarcasm and Humor**: Interpreting non-literal or complex language patterns.
- **Multilingual Processing**: Supporting a wide variety of languages with different structures.
- **Data Availability**: Ensuring diverse and unbiased datasets for training.

---

## Getting Started with NLP
To begin exploring NLP:
1. Learn basic Python programming.
2. Install libraries like `nltk`, `spaCy`, or `transformers`.
3. Work on simple tasks like text preprocessing, tokenization, and sentiment analysis.
4. Experiment with pre-trained models from Hugging Face for advanced projects.

---

### Example: Tokenization with NLTK
```python
import nltk
from nltk.tokenize import word_tokenize

# Example text
text = "Natural Language Processing is fascinating!"
# Tokenize text
tokens = word_tokenize(text)
print(tokens)


In [1]:
%config Completer.use_jedi = False


# NLP Pipeline

**Data Collection** -> **Text Cleaning** -> **Pre-processing** -> **Feature Engineering** -> **Modeling** -> **Evalution** -> **Deployment** -> **Monitor and Model update**


In [2]:
x = "Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and computational linguistics, a subfield of linguistics."

In [4]:
from nltk.tokenize import word_tokenize

In [5]:
w = word_tokenize(x)

In [6]:
print(w)

['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'computer', 'science', 'and', 'especially', 'artificial', 'intelligence', '.', 'It', 'is', 'primarily', 'concerned', 'with', 'providing', 'computers', 'with', 'the', 'ability', 'to', 'process', 'data', 'encoded', 'in', 'natural', 'language', 'and', 'is', 'thus', 'closely', 'related', 'to', 'information', 'retrieval', ',', 'knowledge', 'representation', 'and', 'computational', 'linguistics', ',', 'a', 'subfield', 'of', 'linguistics', '.']


In [8]:
from nltk import pos_tag 

In [9]:
p = pos_tag(w)

In [10]:
print(p)

[('Natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('(', '('), ('NLP', 'NNP'), (')', ')'), ('is', 'VBZ'), ('a', 'DT'), ('subfield', 'NN'), ('of', 'IN'), ('computer', 'NN'), ('science', 'NN'), ('and', 'CC'), ('especially', 'RB'), ('artificial', 'JJ'), ('intelligence', 'NN'), ('.', '.'), ('It', 'PRP'), ('is', 'VBZ'), ('primarily', 'RB'), ('concerned', 'VBN'), ('with', 'IN'), ('providing', 'VBG'), ('computers', 'NNS'), ('with', 'IN'), ('the', 'DT'), ('ability', 'NN'), ('to', 'TO'), ('process', 'VB'), ('data', 'NNS'), ('encoded', 'VBN'), ('in', 'IN'), ('natural', 'JJ'), ('language', 'NN'), ('and', 'CC'), ('is', 'VBZ'), ('thus', 'RB'), ('closely', 'RB'), ('related', 'JJ'), ('to', 'TO'), ('information', 'NN'), ('retrieval', 'NN'), (',', ','), ('knowledge', 'NN'), ('representation', 'NN'), ('and', 'CC'), ('computational', 'JJ'), ('linguistics', 'NNS'), (',', ','), ('a', 'DT'), ('subfield', 'NN'), ('of', 'IN'), ('linguistics', 'NNS'), ('.', '.')]


# Part of Speech 
JJ means Adjustive 
NN menas Noun, singular
etc...

# Text-Processing Techniques

- Tokenization
- Stop word removal
- N-grams
- Stemming
- Word Sentence
- Count vectorization
- Lemmatization
- TF-IDF vectorizitaion
- Hashing vector

# Tokenization

used in natural language processing to split paragraphs and sentence into smaller unites that can be more easily assigned meaning.

we can mainly tokenize 3 stuff : word, sentence and character

In [11]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [12]:
sentence = sent_tokenize(x)

In [13]:
sentence

['Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence.',
 'It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and computational linguistics, a subfield of linguistics.']

In [14]:
# look more clear
for i in sentence:
    print(i)
    print()

Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence.

It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and computational linguistics, a subfield of linguistics.



# Stop word Removal

In [26]:
from nltk.corpus import stopwords
from string import punctuation

In [30]:
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [23]:
stop = stopwords.words("english")
stop

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [27]:
stop_word = list(punctuation) + stop

In [31]:
# w is word tokenized form that we do before
for i in w:
    if i not in stop_word:
        print(i)

Natural
language
processing
NLP
subfield
computer
science
especially
artificial
intelligence
It
primarily
concerned
providing
computers
ability
process
data
encoded
natural
language
thus
closely
related
information
retrieval
knowledge
representation
computational
linguistics
subfield
linguistics


# Stemming and Lemmatization

example:
1. changing -> chang
2. changed  -> chang
3. change   -> change

In [32]:
from nltk.stem import LancasterStemmer, RegexpStemmer, PorterStemmer, SnowballStemmer

In [33]:
l = LancasterStemmer()
r = RegexpStemmer('ing')
p = PorterStemmer()
s = SnowballStemmer('english')

In [38]:
l.stem("changed")

'chang'

In [35]:
r.stem("changed")

'changed'

In [37]:
p.stem("changed")

'chang'

In [39]:
s.stem("changed")

'chang'

# Lemmatizations

example:
1. Studying -> study
2. Studies  -> study
3. Study    -> study

In [40]:
from nltk.stem import WordNetLemmatizer

In [41]:
wl = WordNetLemmatizer()

In [43]:
wl.lemmatize("mice")

'mouse'

In [44]:
wl.lemmatize("heylo")

'heylo'

# N Grams

continues sequence of words or symbols or tokens in document (Auto-suggestions).

# Steps for Generating N-Grams

## What is an N-Gram?
An **N-Gram** is a contiguous sequence of `N` items (words, characters, etc.) from a given text or speech. N-Grams are commonly used in Natural Language Processing (NLP) for tasks such as text prediction, machine translation, and more.

---

## Steps to Generate N-Grams
1. **Input Text**: Start with a string of text.
2. **Tokenization**: Break the text into individual units (words or characters).
3. **Sliding Window**: Use a sliding window of size `N` over the tokens to extract groups of `N` consecutive elements.
4. **Output N-Grams**: Return the generated n-grams as a list.

---

## Example: Generating N-Grams
### Input
```text
"I love natural language processing."


In [45]:
example = "i am dhruv, i am software engineer"

In [46]:
word_tokens = word_tokenize(example)
word_tokens

['i', 'am', 'dhruv', ',', 'i', 'am', 'software', 'engineer']

In [47]:
from nltk.collocations import BigramCollocationFinder, TrigramCollocationFinder, ngrams

In [61]:
b = BigramCollocationFinder.from_words(word_tokens)
t = TrigramCollocationFinder.from_words(word_tokens)
n = ngrams(word_tokens, 2)

In [62]:
# b.ngram_fd
# t.ngram_fd
# n # it return zip file so we have to iterate
for i in n:
    print(i)

('i', 'am')
('am', 'dhruv')
('dhruv', ',')
(',', 'i')
('i', 'am')
('am', 'software')
('software', 'engineer')


In [57]:
# b.ngram_fd.keys()
t.ngram_fd.keys()

dict_keys([('i', 'am', 'dhruv'), ('am', 'dhruv', ','), ('dhruv', ',', 'i'), (',', 'i', 'am'), ('i', 'am', 'software'), ('am', 'software', 'engineer')])

# Count Vectorizations

In [74]:
lst = ["hi dhruv", "hi rni", "rni teach dhruv data science"]

In [65]:
import pandas as pd

In [75]:
df = pd.DataFrame({"name": lst})
df

Unnamed: 0,name
0,hi dhruv
1,hi rni
2,rni teach dhruv data science


In [76]:
from sklearn.feature_extraction.text import CountVectorizer

ModuleNotFoundError: No module named 'sklearn'

In [68]:
cv = CountVectorizer()

NameError: name 'CountVectorizer' is not defined

In [72]:
new_data = cv.fit_transform(df["name"]).toarray()

NameError: name 'cv' is not defined

In [73]:
cv.vocabulary_

NameError: name 'cv' is not defined

# Word Sense Disambiguations

like mousr have 2 meaning -> one is computer mouse and other mice

In [77]:
from nltk.wsd import lesk

In [78]:
l = lesk(word_tokens, "NLP")

In [79]:
l

Synset('natural_language_processing.n.01')

In [81]:
l.definition()

'the branch of information science that deals with natural language information'