__Feature Extraction__

In natural language processing (NLP), feature extraction is the process of converting raw text data into a set of numerical features that can be used as input for machine learning models. These features capture different aspects of the text such as its vocabulary, syntax, semantics, and context. Feature extraction is a crucial step in NLP tasks such as text classification, sentiment analysis, and information retrieval.


__eature extraction via frequency__

Feature extraction via frequency is a common technique used in Natural Language Processing (NLP) to identify and extract the most relevant features or words from a corpus of text. This is achieved by computing the frequency of each word in the corpus and selecting the most frequent words as features for the analysis.

In [None]:
import nltk
from nltk.corpus import brown

# Load the Brown Corpus
nltk.download('brown')
corpus = brown.words()

# Create a frequency distribution of words in the corpus
freq_dist = nltk.FreqDist(corpus)

# Select the 100 most frequent words as features
num_features = 100
most_common_words = [word for word, freq in freq_dist.most_common(num_features)]

print(most_common_words)


__Feature extraction via document frequency__

Feature extraction via document frequency is a technique used in natural language processing (NLP) to identify the most significant words in a corpus of text. The basic idea is to count the number of documents in which each word appears and use this count to rank the words by their importance. Words that appear in many documents are likely to be common and less informative, while words that appear in fewer documents are likely to be more specific and informative.

In [None]:
from nltk.corpus import brown
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.util import ngrams

# tokenize and get the frequency distribution of words in the Brown corpus
brown_tokens = word_tokenize(' '.join(brown.words()))
fd = FreqDist(brown_tokens)

# extract the 10 most frequent words
most_common_words = fd.most_common(10)
print('Most common words:', most_common_words)

# extract the 10 most frequent bigrams
bigrams = ngrams(brown_tokens, 2)
fd_bigrams = FreqDist(bigrams)
most_common_bigrams = fd_bigrams.most_common(10)
print('Most common bigrams:', most_common_bigrams)


__inverse document frequency__

Feature extraction is one of the most important steps in Natural Language Processing (NLP). It is the process of transforming raw text data into numerical features that can be used for machine learning models. One of the commonly used methods for feature extraction is Inverse Document Frequency (IDF).

IDF measures the relevance of a word in a document by comparing the frequency of the word in the document with its frequency in the whole corpus. The more frequent a word is in the corpus, the less relevant it is to a specific document, and hence the lower its IDF score.

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# sample corpus
documents = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

# initialize TfidfVectorizer
tfidf = TfidfVectorizer()

# fit and transform documents
tfidf_matrix = tfidf.fit_transform(documents)

# get feature names
feature_names = tfidf.get_feature_names()

# print feature names and IDF scores
for col in tfidf_matrix.nonzero()[1]:
    print(f"{feature_names[col]}: {tfidf_matrix[0, col]}")
