**Text Processing Methods **    ->  method of transforming  raw text data into a structured format that is usable for analysis, machine learning / natural language processing  tasks.    

-> Involves a series of techniques to clean, manipulate text data and making it extracting meaningful insight's and patterns.


**Methods**

* Word Frequency

* Collocation

* TF-IDF

* Text Summarization

* Text classification

* Keyword extraction

* Lemmatization and stemming

**Installations **


In [None]:
pip install nltk




In [1]:
pip install spacy




In [2]:
pip install gensim



**Tokenization**   ->
 splits text into individual words or sentences, allowing easier analysis.



In [32]:
import nltk
from nltk.tokenize import word_tokenize,  sent_tokenize

nltk.download('punkt')
text = 'Hello, how are you?'
tokens = word_tokenize(text)
print(tokens)
sentence = 'This is a simple sentence.'
tokens = word_tokenize(sentence)
print(tokens)

#using Spacy
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Hello, how are you?')
tokens = [token.text for token in doc]
print(tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['Hello', ',', 'how', 'are', 'you', '?']
['This', 'is', 'a', 'simple', 'sentence', '.']
['Hello', ',', 'how', 'are', 'you', '?']


**Word Frequency**  -> counts occurances of each word & finding us common words in text.

In [7]:
import nltk
from collections import Counter

# Download the 'punkt_tab' resource
nltk.download('punkt_tab')

# Tokenize the text into words
text = "Natural Language Processing (NLP) is a fascinating field that enables machines to understand human language. Text processing is an essential part of NLP tasks."
words = nltk.word_tokenize(text.lower())
word_counts = Counter(words)

print("Word Frequency:", word_counts)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Word Frequency: Counter({'language': 2, 'processing': 2, 'nlp': 2, 'is': 2, '.': 2, 'natural': 1, '(': 1, ')': 1, 'a': 1, 'fascinating': 1, 'field': 1, 'that': 1, 'enables': 1, 'machines': 1, 'to': 1, 'understand': 1, 'human': 1, 'text': 1, 'an': 1, 'essential': 1, 'part': 1, 'of': 1, 'tasks': 1})


** Collocation**  ->
define it's word pairs that appear together more frequently than expected by chance, like "natural language" in NLP.

In [11]:
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(words)
collocations = finder.nbest(bigram_measures.raw_freq, 10)
print("Collocations:", collocations)

finder = BigramCollocationFinder.from_words(words)
collocations = finder.nbest(bigram_measures.raw_freq, 10)
print("Collocations:", collocations)



Collocations: [('(', 'nlp'), (')', 'is'), ('.', 'text'), ('a', 'fascinating'), ('an', 'essential'), ('enables', 'machines'), ('essential', 'part'), ('fascinating', 'field'), ('field', 'that'), ('human', 'language')]
Collocations: [('(', 'nlp'), (')', 'is'), ('.', 'text'), ('a', 'fascinating'), ('an', 'essential'), ('enables', 'machines'), ('essential', 'part'), ('fascinating', 'field'), ('field', 'that'), ('human', 'language')]


**Concordance**  ->
it displays each occurrence of a word in context, showing the surrounding text for quick analysis.

In [17]:
import nltk

# Download the Gutenberg corpus
nltk.download('gutenberg')

# Import the Gutenberg corpus
from nltk.corpus import gutenberg

# Access the desired file from the Gutenberg corpus.
text = gutenberg.raw('austen-emma.txt')

# Tokenize the text into words
words = nltk.word_tokenize(text)

# Create a Text object
text = nltk.Text(words)

# Use the concordance method
text.concordance('Happy', lines=10)

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


Displaying 10 of 123 matches:
d rich , with a comfortable home and happy disposition , seemed to unite some o
 well in Brunswick Square . It was a happy circumstance , and animated Mr. Wood
r about the wedding ; and I shall be happy to tell you , for we all behaved cha
 far as possible . And yet she was a happy woman , and a woman whom no one name
ery frequently able to collect ; and happy was she , for her father 's sake , i
icular pleasure in sending them away happy . The happiness of Miss Smith was qu
a good deal ; she had spent two very happy months with them , and now loved to 
as can well be ; and while she is so happy at Hartfield , I can not wish her to
Smith 's . '' Mr. Elton was only too happy . Harriet listened , and Emma drew i
. `` By all means . We shall be most happy to consider you as one of the party 


**Text Summarization** ->
 generates a shorter version of text that preserves important information. Here, we use the Gensim library’s summarization tool.

In [22]:
#install summa package

!pip install summa


Collecting summa
  Downloading summa-1.2.0.tar.gz (54 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/54.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.9/54.9 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: summa
  Building wheel for summa (setup.py) ... [?25l[?25hdone
  Created wheel for summa: filename=summa-1.2.0-py3-none-any.whl size=54387 sha256=7d8d614e47f90c5a7a846b0e65b8d3bd547d66bb24a9b04050548a608b145898
  Stored in directory: /root/.cache/pip/wheels/4a/ca/c5/4958614cfba88ed6ceb7cb5a849f9f89f9ac49971616bc919f
Successfully built summa
Installing collected packages: summa
Successfully installed summa-1.2.0


In [26]:
from summa import summarizer

text = "Natural Language Processing (NLP) is a fascinating field that enables machines to understand human language. Text processing is an essential part of NLP tasks."

summary = summarizer.summarize(text, ratio=0.9)
print("Summary:", summary)

Summary: Natural Language Processing (NLP) is a fascinating field that enables machines to understand human language.


**Text Classification**   ->
 involves labeling text data with predefined categories. Here, we create a simple model using the Scikit-Learn library.

In [30]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Sample data
documents = ["This is a positive document.", "This is a negative document.", "Another positive document."]
labels = ["positive", "negative", "positive"]

#convert text to feature vectors
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)

#train classifier
classifier = MultinomialNB()
classifier.fit(X, labels)

#prediction
new_documents = ["This is a positive document.", "This is a neutral document."]
new_X = vectorizer.transform(new_documents)
predictions = classifier.predict(new_X)

print("Predictions:", predictions)




Predictions: ['positive' 'positive']


**Stop Words Removal**  ->
Stop words are common words that often don't add significant meaning to text analysis.

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
text = 'This is an example sentence with some stop words.'
tokens = word_tokenize(text)
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
print(filtered_tokens)

#using spacy
doc = nlp(text)
filtered_tokens = [token.text for token in doc if not token.is_stop]
print(filtered_tokens)


['example', 'sentence', 'stop', 'words', '.']
['example', 'sentence', 'stop', 'words', '.']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


** Stemming**   ->
 reduces words to their base or root form by stripping suffixes.

In [33]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
word = 'running'
stemmed_word = stemmer.stem(word)
print(stemmed_word)


run


Spacy doesn’t provide stemming directly, but it has lemmatization  which is generally more accurate.

**Lemmatization** ->
 converts words to their dictionary form or lemma.

In [None]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
word = 'running'
lemmatized_word = lemmatizer.lemmatize(word)
print(lemmatized_word)

#using spacy
doc = nlp(word)
lemmatized_word = doc[0].lemma_
print(lemmatized_word)


[nltk_data] Downloading package wordnet to /root/nltk_data...


running
run


**Text Normalization**  ->  includes tasks like lowercasing, removing punctuation, and expanding contractions.

In [40]:
#text normalization

import re

#lowercase text
text = 'Hello, How are you?'
lowercase_text = text.lower()
print(lowercase_text)

#remove punctuation
text = 'Hello, How are you?'
punctuation_removed_text = re.sub(r'[^\w\s]', '', text)
print(punctuation_removed_text)


hello, how are you?
Hello How are you


**Word Embeddings** / Vectorization  ->
 convert words into numerical vectors representing semantic meaning.

In [19]:
#using TF-IDF with skikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['This is the first document.', 'This document is the second document.', 'And this is the third one.']
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(corpus)
print(tfidf_matrix.toarray())

#word embeddings (spacy)
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('This is an example sentence.')
word_embeddings = [token.vector for token in doc]
print(word_embeddings)



[[0.         0.46941728 0.61722732 0.3645444  0.         0.
  0.3645444  0.         0.3645444 ]
 [0.         0.7284449  0.         0.28285122 0.         0.47890875
  0.28285122 0.         0.28285122]
 [0.49711994 0.         0.         0.29360705 0.49711994 0.
  0.29360705 0.49711994 0.29360705]]
[array([-0.33933002, -0.49662387,  0.12057114,  1.6358336 ,  0.60533595,
        0.29953307,  1.3407537 ,  0.19745567, -1.3807919 , -0.1795856 ,
        2.000516  ,  1.3099126 , -0.84563535, -0.30392313, -1.1996465 ,
       -1.0611061 , -0.11863522,  1.4365962 , -0.98706937,  1.056712  ,
       -1.088603  ,  0.00695902, -0.20875025, -1.1293292 , -0.3502592 ,
        0.09407187,  0.2085126 ,  0.9926027 , -0.25046033,  0.257219  ,
        0.08152232, -0.8692955 , -0.12662315, -1.1390253 , -0.4092511 ,
        0.725322  ,  0.14983839, -0.38918096,  0.2666754 ,  3.5760899 ,
       -0.7669859 , -0.06527622, -0.733409  ,  2.247148  , -1.3842859 ,
        1.1196162 , -0.4907301 ,  1.2082839 ,  1.14846