## Features Extraction from text

![image.png](attachment:3829b526-7676-4f8a-89f0-befda595a82a.png)

In [1]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

### Bag of words or CounterVectorizer

In [2]:
# List of text document

text = ["hello, my name is alamin and I am a data scientist."]

text2 = ["You are watching codewithalamin data science, alamin alamin"]

In [3]:
# Create the transform

vectorizer = CountVectorizer()

In [4]:
# Tokenize and build vocab

vectorizer.fit(raw_documents=text)

In [5]:
# Summarize 

print(vectorizer.vocabulary_)

{'hello': 4, 'my': 6, 'name': 7, 'is': 5, 'alamin': 0, 'and': 2, 'am': 1, 'data': 3, 'scientist': 8}


In [6]:
result = vectorizer.vocabulary_

result.get('alamin')

0

In [7]:
newvector = vectorizer.transform(raw_documents=text2)

In [8]:
newvector.toarray()

array([[2, 0, 0, 1, 0, 0, 0, 0, 0]], dtype=int64)

In [9]:
# from IPython.display import Image
# Image(filename="")

### TF-IDF

The purpose of tf-idf is to highight words which are frequent in a document but not across documents

In [10]:
# list of text documents
text3 = ["Aman is a data scientist in India","This is unfold data science","Data Science is a promising career"]

In [11]:
# Create the Transform

vectorizer2 = TfidfVectorizer()

In [12]:
# Tokenize and build vocab

vectorizer2.fit(text3)

In [13]:
print(vectorizer2.vocabulary_)

{'aman': 0, 'is': 5, 'data': 2, 'scientist': 8, 'in': 3, 'india': 4, 'this': 9, 'unfold': 10, 'science': 7, 'promising': 6, 'career': 1}


In [14]:
print(vectorizer2.idf_)

[1.69314718 1.69314718 1.         1.69314718 1.69314718 1.
 1.69314718 1.28768207 1.69314718 1.69314718 1.69314718]


In [15]:
text_as_input = text3[2]

text_as_input

'Data Science is a promising career'

In [16]:
newvector2 = vectorizer2.transform([text_as_input])

In [17]:
newvector2.toarray()

array([[0.        , 0.55249005, 0.32630952, 0.        , 0.        ,
        0.32630952, 0.55249005, 0.42018292, 0.        , 0.        ,
        0.        ]])

## Demo

### Bag of word

In [18]:
corpus = ['Text processing is necessary.', 'Text processing is necessary and important.', 'Text processing is easy.']

corpus

['Text processing is necessary.',
 'Text processing is necessary and important.',
 'Text processing is easy.']

In [19]:
vec = CountVectorizer()
X = vec.fit_transform(corpus)

In [21]:
print(vec.get_feature_names_out())

['and' 'easy' 'important' 'is' 'necessary' 'processing' 'text']


In [22]:
print(X.toarray())

[[0 0 0 1 1 1 1]
 [1 0 1 1 1 1 1]
 [0 1 0 1 0 1 1]]


### Tf-Idf

In [23]:
vec2 = TfidfVectorizer()

X2 = vec2.fit_transform(corpus)

print(X2.toarray())

[[0.         0.         0.         0.46333427 0.59662724 0.46333427
  0.46333427]
 [0.52523431 0.         0.52523431 0.31021184 0.39945423 0.31021184
  0.31021184]
 [0.         0.69903033 0.         0.41285857 0.         0.41285857
  0.41285857]]


![image.png](attachment:4481cdb0-3509-4c3c-8c89-35f4a3a77352.png)