<a href="https://colab.research.google.com/github/bhavika67/NLP/blob/main/TF_IDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Converting Text to Features Using TF-IDF Generating N-grams(bigrams).

**TF-IDF:**TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic used in text mining and natural language processing (NLP) to evaluate how important a word is to a document within a collection (corpus) of documents. It is often used to transform text data into features for machine learning, and it improves on simple word counts by accounting for the frequency of words across multiple documents.

**N-grams:**N-grams are contiguous sequences of n items (usually words or characters) from a given sample of text or speech. In Natural Language Processing (NLP), N-grams are used to capture local word order information, which helps to analyze word sequences and context within text data.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample text data
corpus = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?"
]

# Create a TF-IDF vectorizer with bigrams
vectorizer = TfidfVectorizer(ngram_range=(2, 2))

# Fit and transform the corpus
X = vectorizer.fit_transform(corpus)

# Get the feature names (bigrams)
feature_names = vectorizer.get_feature_names_out()

# Print the feature matrix
print(X.toarray())

print(vectorizer.vocabulary_)
print(vectorizer.idf_)

[[0.         0.         0.52303503 0.42344193 0.         0.
  0.52303503 0.         0.         0.         0.         0.52303503
  0.        ]
 [0.         0.47633035 0.         0.30403549 0.         0.47633035
  0.         0.47633035 0.         0.         0.47633035 0.
  0.        ]
 [0.49819711 0.         0.         0.31799276 0.         0.
  0.         0.         0.49819711 0.49819711 0.         0.39278432
  0.        ]
 [0.         0.         0.43779123 0.         0.55528266 0.
  0.43779123 0.         0.         0.         0.         0.
  0.55528266]]
{'this is': 11, 'is the': 3, 'the first': 6, 'first document': 2, 'this document': 10, 'document is': 1, 'the second': 7, 'second document': 5, 'and this': 0, 'the third': 8, 'third one': 9, 'is this': 4, 'this the': 12}
[1.91629073 1.91629073 1.51082562 1.22314355 1.91629073 1.91629073
 1.51082562 1.91629073 1.91629073 1.91629073 1.91629073 1.51082562
 1.91629073]
