# More on Texts Analytics
## Text to Numbers Vectorizers
## Text Classification

# Texts to Numbers

1. **One-Hot Encoding**:
   - Encodes categorical variables into binary vectors with a length equal to the number of categories, where each category is represented by a binary value (1 or 0).



2. **Dummy Coding**:
   - Represents categorical variables as a series of binary variables (0 or 1) for each category, with one less variable than the number of categories.

3. **Label Encoding**:
   - Assigns a unique integer label to each category in a categorical variable, converting it into numerical format.

4. **Binary Vectorization**:
   - Represents the presence or absence of words in documents as binary values (1 for presence, 0 for absence), ignoring word frequency.

5. **CountVectorizer**:
   - Converts a collection of text documents into a matrix of token occurrences, where each row represents a document, and each column represents a unique word, with cell values indicating word frequencies.

6. **TF-IDF Vectorizer**:
   - Transforms text documents into numerical vectors based on the Term Frequency-Inverse Document Frequency (TF-IDF) metric, which reflects the importance of a word in a document relative to its frequency across the entire corpus.

7. HashingVectorizer:
   - Utilizes hashing functions to convert text documents into fixed-sized numerical vectors, enabling memory-efficient representation without storing the vocabulary.

8. **Word Embeddings**:
   - Dense vector representations of words in a continuous vector space, learned from large text corpora using techniques like Word2Vec, GloVe, or FastText.

9. **Doc2Vec**:
   - Extends Word2Vec to generate fixed-length vector representations (embeddings) for entire documents, considering both words and document IDs during training.


In [6]:
import pandas as pd

# Create a DataFrame with categorical variables
data = {'Color': ['Red', 'Blue', 'Green', 'Red', 'Green'],
        'Shape': ['Square', 'Circle', 'Triangle', 'Square', 'Triangle']}
df = pd.DataFrame(data)

# Display the original DataFrame
print("Original DataFrame:")
df

Original DataFrame:


Unnamed: 0,Color,Shape
0,Red,Square
1,Blue,Circle
2,Green,Triangle
3,Red,Square
4,Green,Triangle


In [2]:
# Apply One-Hot Encoding
ohe_df = pd.get_dummies(df).astype(int)

# Display the DataFrame after One-Hot Encoding
print("\nDataFrame after One-Hot Encoding:")
ohe_df


DataFrame after One-Hot Encoding:


Unnamed: 0,Color_Blue,Color_Green,Color_Red,Shape_Circle,Shape_Square,Shape_Triangle
0,0,0,1,0,1,0
1,1,0,0,1,0,0
2,0,1,0,0,0,1
3,0,0,1,0,1,0
4,0,1,0,0,0,1


In [3]:
# Dummy Coded
dummy_df = pd.get_dummies(df, drop_first=True).astype(int)  # Drop first column for dummy coding
dummy_df

Unnamed: 0,Color_Green,Color_Red,Shape_Square,Shape_Triangle
0,0,1,1,0
1,0,0,0,0
2,1,0,0,1
3,0,1,1,0
4,1,0,0,1


## Differences Between One-Hot Encoding and Dummy Coding

### One-Hot Encoding (OHE):
- **Encoding Technique:** Creates \( n \) binary columns for a categorical variable with \( n \) categories.
- **Binary Representation:** Each category is represented by a binary column, where 1 indicates the presence of the category and 0 indicates absence.
- **Use Cases:**
  - Suitable when there is no inherent order or hierarchy among categories.
  - Commonly used in classification tasks where each category is treated equally.
  - Applicable when the number of unique categories is relatively small and manageable.

### Dummy Coding:
- **Encoding Technique:** Creates \( n-1 \) binary columns for a categorical variable with \( n \) categories, treating one category as the reference or baseline.
- **Binary Representation:** Each category, except the reference category, is represented by a binary column.
- **Use Cases:**
  - Used when there is an inherent order or hierarchy among categories, and one category serves as the reference.
  - Ideal for regression analysis, particularly linear regression models, where it helps interpret the coefficients relative to the reference category.
  - Helps avoid multicollinearity issues and reduces dimensionality compared to OHE when dealing with a large number of categories.

### Key Differences:
- OHE creates \( n \) binary columns for \( n \) categories, while Dummy Coding creates \( n-1 \) binary columns, treating one category as the reference.
- In OHE, all categories are equally represented by binary columns, while in Dummy Coding, one category serves as the reference for comparison.
- OHE is suitable for scenarios with no inherent order among categories, while Dummy Coding is used when there is an ordinal relationship among categories.

### Example:
- For a categorical variable like "Color" with categories (Red, Blue, Green):
  - OHE would create three binary columns (Red, Blue, Green), representing each color separately.
  - Dummy Coding would create two binary columns (Blue, Green), treating "Red" as the reference category for comparison.



In [7]:
df

Unnamed: 0,Color,Shape
0,Red,Square
1,Blue,Circle
2,Green,Triangle
3,Red,Square
4,Green,Triangle


In [4]:
from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Apply label encoding to each column
for col in df.columns:
    df[col] = label_encoder.fit_transform(df[col])
label_df = df
# Show the encoded DataFrame
label_df

Unnamed: 0,Color,Shape
0,2,1
1,0,0
2,1,2
3,2,1
4,1,2


# MultiLabelBinarizer

- **Description**: `MultiLabelBinarizer` is a utility class in scikit-learn used for label binarization of multiple labels. It converts lists of labels into binary format.

- **Usage**:
  - Import the `MultiLabelBinarizer` class from `sklearn.preprocessing`.
  - Create an instance of `MultiLabelBinarizer`.
  - Fit the binarizer on the list of labels using the `fit()` method.
  - Transform the list of labels into binary format using the `transform()` method.


In [8]:
from sklearn.preprocessing import MultiLabelBinarizer

# Sample multi-label data as a DataFrame
data = {'Labels': [('apple', 'banana'), ('banana', 'orange'), ('apple', 'orange', 'pear'), ('pear',)]}
df = pd.DataFrame(data)

# Instantiate MultiLabelBinarizer
mlb = MultiLabelBinarizer()

# Fit and transform the labels
binary_labels = mlb.fit_transform(df['Labels'])

# Convert the binary labels into a DataFrame
binary_df = pd.DataFrame(binary_labels, columns=mlb.classes_)

# Concatenate the original DataFrame with the binary labels DataFrame
result_df = pd.concat([df, binary_df], axis=1)
result_df

Unnamed: 0,Labels,apple,banana,orange,pear
0,"(apple, banana)",1,1,0,0
1,"(banana, orange)",0,1,1,0
2,"(apple, orange, pear)",1,0,1,1
3,"(pear,)",0,0,0,1


# Countvectorizer

- `CountVectorizer` is a feature extraction technique used to convert text data into numerical feature vectors.
- It tokenizes text documents and counts the occurrences of each word (or token) in the document.

### Example
- Suppose we have a DataFrame with sentences in each cell of a column named "Text".
- We want to use `CountVectorizer` to tokenize the sentences and count the occurrences of each word.

### How it Works
- When you initialize `CountVectorizer` and call `fit_transform` on your text data, it first tokenizes the text by breaking it into individual words or tokens.
- Then, it builds a vocabulary of all unique words or tokens found in the text data.
- Finally, it counts the occurrences of each word or token in each document and constructs a feature matrix where rows represent documents and columns represent words, with counts indicating the occurrences of each word in each document.

In [9]:
# Sample DataFrame with sentences
data = {'Text': ['The quick brown fox jumps over the lazy dog.', 'The sky is blue.', 'The sun is shining brightly.']}
df = pd.DataFrame(data)
df


Unnamed: 0,Text
0,The quick brown fox jumps over the lazy dog.
1,The sky is blue.
2,The sun is shining brightly.


In [10]:

all_text = ' '.join(df['Text'])
all_text

'The quick brown fox jumps over the lazy dog. The sky is blue. The sun is shining brightly.'

In [11]:
import nltk
tokens = nltk.word_tokenize(all_text)
print(tokens)

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.', 'The', 'sky', 'is', 'blue', '.', 'The', 'sun', 'is', 'shining', 'brightly', '.']


In [12]:
filtered_tokens = sorted(set([token.lower() for token in tokens if token.isalnum()]))
print(filtered_tokens)

['blue', 'brightly', 'brown', 'dog', 'fox', 'is', 'jumps', 'lazy', 'over', 'quick', 'shining', 'sky', 'sun', 'the']


In [13]:
df

Unnamed: 0,Text
0,The quick brown fox jumps over the lazy dog.
1,The sky is blue.
2,The sun is shining brightly.


In [14]:
from sklearn.feature_extraction.text import CountVectorizer
# Initialize CountVectorizer
count_vectorizer = CountVectorizer()

# Fit and transform the text data
text_counts = count_vectorizer.fit_transform(df['Text'])

# Convert the transformed text data into a DataFrame
count_df = pd.DataFrame(text_counts.toarray(), columns=count_vectorizer.get_feature_names_out())

# Concatenate the original DataFrame with the count DataFrame
result_df = pd.concat([df, count_df], axis=1)
result_df

Unnamed: 0,Text,blue,brightly,brown,dog,fox,is,jumps,lazy,over,quick,shining,sky,sun,the
0,The quick brown fox jumps over the lazy dog.,0,0,1,1,1,0,1,1,1,1,0,0,0,2
1,The sky is blue.,1,0,0,0,0,1,0,0,0,0,0,1,0,1
2,The sun is shining brightly.,0,1,0,0,0,1,0,0,0,0,1,0,1,1


In [15]:
df

Unnamed: 0,Text
0,The quick brown fox jumps over the lazy dog.
1,The sky is blue.
2,The sun is shining brightly.


# TF-IDF Vectorizer

TF-IDF (Term Frequency-Inverse Document Frequency) is a commonly used technique in natural language processing for converting text documents into numerical vectors. It aims to reflect the importance of a term within a document relative to its importance across a collection of documents.

## Term Frequency (TF)

TF measures the frequency of a term within a document. It is calculated as:

$$
\text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}
$$

## Inverse Document Frequency (IDF)

IDF measures the importance of a term across a collection of documents. It is calculated as the logarithm of the ratio of the total number of documents to the number of documents containing the term $ t $:

$$
\text{IDF}(t, D) = \log\left(\frac{N}{\text{df}(t, D)}\right)
$$

Where:
- $ N $ is the total number of documents in the collection
- $ \text{df}(t, D) $ is the number of documents containing term $ t $ (document frequency)

## TF-IDF

TF-IDF combines TF and IDF to assign a weight to each term in a document relative to its importance in the entire document collection. The TF-IDF score for a term $ t $ in a document $ d $ is calculated as:

$$
\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)
$$

The TF-IDF Vectorizer converts each document into a vector where each component represents the TF-IDF score of a term in the document. These vectors can then be used as input for various machine learning algorithms.

## CountVectorizer vs TF-IDFVectorizor

| Feature                  | CountVectorizer                                            | TF-IDFVectorizer                                                              |
|--------------------------|------------------------------------------------------------|-------------------------------------------------------------------------------|
| Representation           | Simple frequency-based representation of words in documents | Considers term importance across the entire document corpus                    |
| Model Suitability        | Suitable for basic models and tasks where word frequency suffices | Enhances feature quality, particularly for tasks like text classification     |
| Term Weighting           | Equal weight to all terms                                   | Assigns higher weights to discriminative terms, reduces sensitivity to common words |
| Importance Consideration | Does not consider importance across the document corpus     | Considers term importance across the entire document corpus                    |


In [17]:
# Sample DataFrame
data = {'Text': ['The quick brown fox jumps over the lazy dog.', 'The sky is blue.', 'The sun is shining brightly.']}
df = pd.DataFrame(data)
df

Unnamed: 0,Text
0,The quick brown fox jumps over the lazy dog.
1,The sky is blue.
2,The sun is shining brightly.


In [20]:
1/9

0.1111111111111111

In [21]:

# Step 1: Calculate Term Frequency (TF)
# Tokenize the text and count the frequency of each term in each document
tokenized_text = [text.lower().split() for text in df['Text']]
tf = [{term: counts.count(term) / len(counts) for term in counts} for counts in tokenized_text]
tf

[{'the': 0.2222222222222222,
  'quick': 0.1111111111111111,
  'brown': 0.1111111111111111,
  'fox': 0.1111111111111111,
  'jumps': 0.1111111111111111,
  'over': 0.1111111111111111,
  'lazy': 0.1111111111111111,
  'dog.': 0.1111111111111111},
 {'the': 0.25, 'sky': 0.25, 'is': 0.25, 'blue.': 0.25},
 {'the': 0.2, 'sun': 0.2, 'is': 0.2, 'shining': 0.2, 'brightly.': 0.2}]

In [22]:
import numpy as np
# Step 2: Calculate Inverse Document Frequency (IDF)
# Count the number of documents containing each term
document_frequency = {}
for counts in tokenized_text:
    for term in set(counts):
        document_frequency[term] = document_frequency.get(term, 0) + 1

# Calculate IDF for each term
N = len(df)
idf = {term: np.log(N / document_frequency[term]) for term in document_frequency}
idf

{'dog.': 1.0986122886681098,
 'the': 0.0,
 'fox': 1.0986122886681098,
 'quick': 1.0986122886681098,
 'over': 1.0986122886681098,
 'brown': 1.0986122886681098,
 'lazy': 1.0986122886681098,
 'jumps': 1.0986122886681098,
 'is': 0.4054651081081644,
 'blue.': 1.0986122886681098,
 'sky': 1.0986122886681098,
 'sun': 1.0986122886681098,
 'shining': 1.0986122886681098,
 'brightly.': 1.0986122886681098}

In [23]:

# Step 3: Calculate TF-IDF
tfidf = [{term: tf_doc.get(term, 0) * idf[term] for term in tf_doc} for tf_doc in tf]

# Print TF, IDF, and TF-IDF
for i, (tf_doc, tfidf_doc) in enumerate(zip(tf, tfidf), 1):
    print(f"Document {i}:")
    print("TF:", tf_doc)
    print("TF-IDF:", tfidf_doc)
    print()

Document 1:
TF: {'the': 0.2222222222222222, 'quick': 0.1111111111111111, 'brown': 0.1111111111111111, 'fox': 0.1111111111111111, 'jumps': 0.1111111111111111, 'over': 0.1111111111111111, 'lazy': 0.1111111111111111, 'dog.': 0.1111111111111111}
TF-IDF: {'the': 0.0, 'quick': 0.12206803207423442, 'brown': 0.12206803207423442, 'fox': 0.12206803207423442, 'jumps': 0.12206803207423442, 'over': 0.12206803207423442, 'lazy': 0.12206803207423442, 'dog.': 0.12206803207423442}

Document 2:
TF: {'the': 0.25, 'sky': 0.25, 'is': 0.25, 'blue.': 0.25}
TF-IDF: {'the': 0.0, 'sky': 0.27465307216702745, 'is': 0.1013662770270411, 'blue.': 0.27465307216702745}

Document 3:
TF: {'the': 0.2, 'sun': 0.2, 'is': 0.2, 'shining': 0.2, 'brightly.': 0.2}
TF-IDF: {'the': 0.0, 'sun': 0.21972245773362198, 'is': 0.08109302162163289, 'shining': 0.21972245773362198, 'brightly.': 0.21972245773362198}



In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Initialize TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the text data
tfidf_matrix = tfidf_vectorizer.fit_transform(df['Text'])

# Convert the matrix to DataFrame for better visualization
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df

Unnamed: 0,blue,brightly,brown,dog,fox,is,jumps,lazy,over,quick,shining,sky,sun,the
0,0.0,0.0,0.345129,0.345129,0.345129,0.0,0.345129,0.345129,0.345129,0.345129,0.0,0.0,0.0,0.407678
1,0.584483,0.0,0.0,0.0,0.0,0.444514,0.0,0.0,0.0,0.0,0.0,0.584483,0.0,0.345205
2,0.0,0.504611,0.0,0.0,0.0,0.38377,0.0,0.0,0.0,0.0,0.504611,0.0,0.504611,0.298032


In [25]:
df

Unnamed: 0,Text
0,The quick brown fox jumps over the lazy dog.
1,The sky is blue.
2,The sun is shining brightly.



# Classification of texts

In [26]:
spam = pd.read_csv("spamtext.csv")
spam.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [27]:
spam.shape

(5572, 2)

In [28]:
# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the text data
X = vectorizer.fit_transform(spam['Message'])

# Get the feature names
feature_names = vectorizer.get_feature_names_out()

# Create DataFrame with features
features_df = pd.DataFrame(X.toarray(), columns=feature_names)

# Concatenate features with the original DataFrame
result_df = pd.concat([spam, features_df], axis=1)
result_df.head(3)

Unnamed: 0,Category,Message,00,000,000pes,008704050406,0089,0121,01223585236,01223585334,...,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,èn,ú1,〨ud
0,ham,"Go until jurong point, crazy.. Available only ...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,Ok lar... Joking wif u oni...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [30]:
y = result_df.Category
x = result_df.drop(columns=['Category', 'Message'])

In [31]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression

# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Initialize the logistic regression model
logistic_model = LogisticRegression(max_iter=1000)

# Perform cross-validation on the training set
cv_scores = cross_val_score(logistic_model, x_train, y_train, cv=5)

# Fit the model on the training data
logistic_model.fit(x_train, y_train)

# Calculate the training accuracy
train_accuracy = logistic_model.score(x_train, y_train)

# Calculate the testing accuracy
test_accuracy = logistic_model.score(x_test, y_test)

# Print the results
print("Cross-Validation Scores:", cv_scores)
print("Mean Cross-Validation Score:", cv_scores.mean())
print("Training Accuracy:", train_accuracy)
print("Testing Accuracy:", test_accuracy)

Cross-Validation Scores: [0.98654709 0.9809417  0.97643098 0.97979798 0.97643098]
Mean Cross-Validation Score: 0.9800297443795201
Training Accuracy: 0.99798070450976
Testing Accuracy: 0.9856502242152466


In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the text data
X = vectorizer.fit_transform(spam['Message'])

# Get the feature names
feature_names = vectorizer.get_feature_names_out()

# Create DataFrame with features
features_df = pd.DataFrame(X.toarray(), columns=feature_names)

# Concatenate features with the original DataFrame
result_df = pd.concat([spam, features_df], axis=1)
result_df.head(3)


Unnamed: 0,Category,Message,00,000,000pes,008704050406,0089,0121,01223585236,01223585334,...,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,èn,ú1,〨ud
0,ham,"Go until jurong point, crazy.. Available only ...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,ham,Ok lar... Joking wif u oni...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [35]:
y = result_df.Category
x_tfidf = result_df.drop(columns=['Category', 'Message'])

In [36]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier

# Assuming x_tfidf and y are already defined

# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x_tfidf, y, test_size=0.2, random_state=42)

# Initialize the Random Forest model
random_forest_model = RandomForestClassifier()

# Perform cross-validation on the training set
cv_scores = cross_val_score(random_forest_model, x_train, y_train, cv=5)

# Fit the model on the training data
random_forest_model.fit(x_train, y_train)

# Calculate the training accuracy
train_accuracy = random_forest_model.score(x_train, y_train)

# Calculate the testing accuracy
test_accuracy = random_forest_model.score(x_test, y_test)

# Print the results
print("Cross-Validation Scores:", cv_scores)
print("Mean Cross-Validation Score:", cv_scores.mean())
print("Training Accuracy:", train_accuracy)
print("Testing Accuracy:", test_accuracy)


Cross-Validation Scores: [0.97982063 0.97421525 0.96969697 0.97418631 0.9708193 ]
Mean Cross-Validation Score: 0.9737476911617421
Training Accuracy: 1.0
Testing Accuracy: 0.9811659192825112
