<h1>Natural Language Processing - NLP</h1>
<p>Branch of AI that focuses on the interaction between computers and human languages</p>
<p>Includes the ability of a computer to understand, interpret, and generate human language in a valuable way</p>

<h3>Common application:</h3>
<ul>
    <li>Text Classification: Spam detection, sentiment analysis</li>
    <li>Machine Translation</li>
    <li>Named Entity Recognition</li>
    <li>Speech Recognition</li>
    <li>Chatbots</li>
</ul>

<h2>Installing Packages</h2>
<p>NLTK and Spacy</p>

In [1]:
! pip install nltk
! pip install spacy



<h2>Basic text processing techiniques</h2>
<h3>Tokenization</h3>
<p>Tokenization is the process of breaking down text into individual units, such as words or sentences.</p>
<p>Used to remove unnecessary words for faster performance.</p>

In [2]:
# Importing spacy and a method 'download' from spacy.cli
import spacy
from spacy.cli import download

In [3]:
# en_core_web_sm = English core from web small 
try:
    nlp = spacy.load('en_core_web_sm')
except OSError:
    print("Model not found")
    download('en_core_web_sm')
    nlp = spacy.load('en_core_web_sm')

In [4]:
text = "Natural language processing with python is fun. let's tokenize this sentence"

doc = nlp(text)

In [5]:
sentences = [sent.text for sent in doc.sents]
print(sentences)

['Natural language processing with python is fun.', "let's tokenize this sentence"]


In [6]:
word = [token.text for token in doc]
print(word)

['Natural', 'language', 'processing', 'with', 'python', 'is', 'fun', '.', 'let', "'s", 'tokenize', 'this', 'sentence']


In [7]:
print(text.split(" "))

['Natural', 'language', 'processing', 'with', 'python', 'is', 'fun.', "let's", 'tokenize', 'this', 'sentence']


In [8]:
print(text.split('.'))

['Natural language processing with python is fun', " let's tokenize this sentence"]


<h2>Stemming

In [9]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()

words = ['running','run','runner','easily','fairly']

stems = [ps.stem(word) for word in words]
print('stems:',stems)

stems: ['run', 'run', 'runner', 'easili', 'fairli']


<h2>Lemmatization

In [10]:
import spacy_loggers
nlp = spacy.load('en_core_web_sm')

# text = 'Natural Language Processing with Python is fun. Let\'s tokenize this sentence!'
# text = "He's crazy."
text = "they are crazy"
doc = nlp(text)
lemmas = [token.lemma_ for token in doc]
print('Lemas:',lemmas)


Lemas: ['they', 'be', 'crazy']


<h2>Sentiment Analysis

In [21]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import  fetch_20newsgroups
from sklearn.metrics import accuracy_score

categories = ['rec.autos', 
                'sci.electronics', 
                'comp.graphics', 
                'rec.sport.hockey',
                'talk.politics.guns',
                'talk.politics.mideast',
                'comp.os.ms-windows.misc',
                'comp.sys.ibm.pc.hardware',
                'misc.forsale',
                'sci.med'
]

newsgroups = fetch_20newsgroups(subset = 'train',categories = categories)
x,y = newsgroups.data, newsgroups.target
target_names = newsgroups.target_names

vectorizer = CountVectorizer(stop_words = 'english')
x_vect = vectorizer.fit_transform(x)

x_train,x_test,y_train,y_test = train_test_split(x_vect,y,test_size = 0.2,random_state = 42)

model = LogisticRegression(max_iter = 1000)
model.fit(x_train,y_train)
y_pred = model.predict(x_test)

accuracy = accuracy_score(y_test,y_pred)
print(f'Accuracy: {accuracy:.2f}')

Accuracy: 0.91


In [22]:
new_sentences = [input('Enter a sentence:')]
new_x_vect = vectorizer.transform(new_sentences)
predictions = model.predict(new_x_vect)

for sentence, prediction in zip(new_sentences,predictions):
    print(f'\nsentences:{sentence}\nPrediction Category: {target_names[prediction]}')


sentences:trump shot dead in his election ralley
Prediction Category: rec.sport.hockey


<h2>Word embeddings

In [29]:
import spacy_legacy
from spacy.cli import download

try:
    nlp = spacy.load('en_core_web_md')
except OSError:
    print('Model not found. Downloading...')
    download('en_core_web_md')
    nlp = spacy.load('en_core_web_md')

word1 = nlp('Spiderman is Peter Parker.')
word2 = nlp('Peter Parker is Spiderman.')

similarity = word1.similarity(word2)
print(f"Similarity between 2 sentences':{similarity:.2f} ")

Similarity between 2 sentences':1.00 
