## <font color='blue'>**Text Feature Extraction**</font>

**Term Frequency-Inverse Document Frequency (TF-IDF)** - It consists of two main components:

1. **Term Frequency (TF)**: This measures the frequency of a term (word) within a document. It is calculated as the ratio of the number of times a term appears in a document to the total number of terms in that document. It is often normalized to prevent bias towards longer documents.

   Mathematically, TF is represented as:

   $$
   \text{TF}(t, d) = \frac{\text{number of times term } t \text{ appears in document } d}{\text{total number of terms in document } d}
   $$

2. **Inverse Document Frequency (IDF)**: This measures the importance of a term across a collection of documents. It is calculated as the logarithm of the ratio of the total number of documents to the number of documents containing the term, with smoothing to avoid division by zero.

   Mathematically, IDF is represented as:

   $$
   \text{IDF}(t, D) = \log\left(\frac{\text{total number of documents in the collection } D}{\text{number of documents containing term } t}\right)
   $$

The TF-IDF score for a term \( t \) in a document \( d \) within a collection of documents \( D \) is obtained by multiplying the TF and IDF scores:

$$
\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)
$$

This formula gives higher weights to terms that are frequent within the document but rare across the entire collection, thus helping to identify the most important terms in a document with respect to the entire collection.


In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

vectorizer = TfidfVectorizer(norm='l1')

X = vectorizer.fit_transform(corpus)

feature_names = vectorizer.get_feature_names_out()

print(feature_names)

print(X.shape)

['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
(4, 9)


In [30]:
# Convert TF-IDF matrix to DataFrame
tfidf_df = pd.DataFrame(X.toarray(), columns=feature_names)

tfidf_df.head()

Unnamed: 0,and,document,first,is,one,second,the,third,this
0,0.0,0.213315,0.263487,0.174399,0.0,0.0,0.174399,0.0,0.174399
1,0.0,0.33226,0.0,0.135822,0.0,0.260274,0.135822,0.0,0.135822
2,0.219033,0.0,0.0,0.1143,0.219033,0.0,0.1143,0.219033,0.1143
3,0.0,0.213315,0.263487,0.174399,0.0,0.0,0.174399,0.0,0.174399


In [1]:
text = ['hello! all is well', 'youtube link is https://www.youtube.com/']

In [2]:
import pandas as pd
data = pd.DataFrame()
data['text'] = text
data['target'] = [0, 1]

In [3]:
data.head()

Unnamed: 0,text,target
0,hello! all is well,0
1,youtube link is https://www.youtube.com/,1


In [4]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
stop = set(stopwords.words('english'))

In [6]:
def remove_stop(text):
  t = [word.lower() for word in text.split() if word.lower() not in stop]
  return " ".join(t)

In [7]:
# sentence
sentence = "this is a natural language processing session"

# Apply stopword removal on sentence
sentence = remove_stop(sentence)
print(sentence)

natural language processing session


In [8]:
data['text'] = data['text'].apply(remove_stop)

In [9]:
data.head()

Unnamed: 0,text,target
0,hello! well,0
1,youtube link https://www.youtube.com/,1


In [10]:
# re.compile(r'https?://\S+|www\.\S+'):
# re.compile() compiles the regular expression pattern provided as an argument into a regular expression object.
# r'https?://\S+|www\.\S+' is the regular expression pattern:
# https?:// matches 'http://' or 'https://', where the '?' makes the 's' character optional.
# \S+ matches one or more non-whitespace characters after 'http://' or 'https://'.
# | is the alternation operator, allowing either 'https?://\S+' or 'www.\S+' to match.
# www\.\S+ matches 'www.' followed by one or more non-whitespace characters.

In [11]:
# url.sub(r'', text):
# url.sub() is a method used to substitute occurrences of the pattern matched by the compiled regular expression (url) with the replacement string r''.
# r'' is an empty string, indicating that matched URLs will be replaced with nothing, effectively removing them from the input text.

In [12]:
import re

def remove_url(text):
  url = re.compile(r'https?://\S+|www\.\S+')
  return url.sub(r'',text)

In [13]:
data['text'] = data['text'].apply(remove_url)

In [14]:
sentence1 = "goole.com link is https://www.google.com/"
sentence1 = remove_url(sentence1)
print(sentence1)

goole.com link is 


In [15]:
data.head()

Unnamed: 0,text,target
0,hello! well,0
1,youtube link,1


In [16]:
# str.maketrans("", "", string.punctuation):
# This function call creates a translation table that maps each character in the first argument to the corresponding character in the second argument.

In [17]:
import string

def remove_punct(text):
    table = str.maketrans("", "", string.punctuation)
    return text.translate(table)

In [18]:
sentence2 = "This is a sentence with punctuations! Isn't it amazing?"
sentence2 = remove_punct(sentence2)
print(sentence2)

This is a sentence with punctuations Isnt it amazing


In [19]:
data['text'] = data['text'].apply(remove_punct)

In [20]:
data.head()

Unnamed: 0,text,target
0,hello well,0
1,youtube link,1
