## Vector representation


### Count Vectorizer




- Definition: Converts a collection of text documents to a matrix of token counts.
- Purpose: To represent text data as numerical data for machine learning algorithms.

How it works:
- Each unique word in the corpus is assigned a unique integer index.
- The output is a sparse matrix where each row represents a document and each column represents a word, with the value being the count of the word in that document.

TF-IDF (Term Frequency-Inverse Document Frequency)

- Definition: Converts a collection of raw documents to a matrix of TF-IDF features.
- Purpose: To reflect the importance of a word in a document relative to the entire corpus.
- Components:
  - Term Frequency (TF): The number of times a word appears in a document, divided by the total number of words in that document.
  - Inverse Document Frequency (IDF): The logarithm of the total number of documents divided by the number of documents containing the word. This helps reduce the weight of common words.

How it works:
- Words that are frequent in a document but rare in the corpus get higher scores.
- The output is a sparse matrix similar to Count Vectorizer but with TF-IDF scores instead of counts.

In [5]:
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import nltk

nltk.download('punkt')
nltk.download('stopwords')

# Sample documents
documents = [
    "Human machine interface for lab abc computer applications.",
    "A survey of user opinion of computer system response time.",
    "The EPS user interface management system.",
    "System and human system engineering testing of EPS.",
    "Relation of user perceived response time to error measurement."
]

[nltk_data] Downloading package punkt to /home/frangs/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/frangs/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
# Preprocessing function
def preprocess(text):
    # Lowercase
    text = text.lower()
    # Remove punctuation
    text = re.sub(r'\W', ' ', text)
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    # Stemming
    ps = PorterStemmer()
    tokens = [ps.stem(word) for word in tokens]
    return ' '.join(tokens)

# Preprocess all documents
preprocessed_documents = [preprocess(doc) for doc in documents]

In [7]:
preprocessed_documents

['human machin interfac lab abc comput applic',
 'survey user opinion comput system respons time',
 'ep user interfac manag system',
 'system human system engin test ep',
 'relat user perceiv respons time error measur']

In [None]:
# Count Vectorizer Example
count_vectorizer = CountVectorizer()
X_counts = count_vectorizer.fit_transform(preprocessed_documents)

# TF-IDF Example
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(preprocessed_documents)

In [None]:
# Display the vectorized data
print("Count Vectorizer Matrix:\n", X_counts.toarray())
print("TF-IDF Matrix:\n", X_tfidf.toarray())

# Print the feature names (vocabulary)
print("Count Vectorizer Feature Names:\n", count_vectorizer.get_feature_names_out())
print("TF-IDF Feature Names:\n", tfidf_vectorizer.get_feature_names_out())

Count Vectorizer Matrix:
 [[1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 1 1]
 [0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 1]
 [0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 2 1 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 1 0 0 0 1 1]]
TF-IDF Matrix:
 [[0.40986539 0.40986539 0.33067681 0.         0.         0.
  0.33067681 0.33067681 0.40986539 0.40986539 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.        ]
 [0.         0.         0.36635462 0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.45408711 0.         0.         0.36635462 0.45408711 0.30410743
  0.         0.36635462 0.30410743]
 [0.         0.         0.         0.         0.45109178 0.
  0.         0.45109178 0.         0.         0.55911663 0.
  0.         0.         0.         0.         0.         0.37444693
  0.         0.         0.37444693]
 [0.         0.         0.         0.44298611 0.3573984  0.
  0.3573