# Text Data

## Types of Datasets
   - **Numerical Dataset:** Contains measurable quantities and can be analyzed mathematically. Examples include temperature, humidity, and test scores.
   - **Categorical Dataset:** Comprises a set of categories or groups. Examples are colors, product categories, and yes/no responses.
   - **Time Series Dataset:** Captures data points at successive time intervals. Useful for analyzing trends over time.
   - **Image Dataset:** Consists of image files, often used in computer vision tasks to identify patterns or objects.
   - **Text Dataset:** Includes collections of words, sentences, or documents, typically analyzed for linguistic patterns or content.

   and many more

## Text Data

1. **Understanding Text Data:**
   - Text data is composed of sequences of characters, forming words, sentences, or paragraphs.
   - It varies in length and complexity, often containing nuanced linguistic features.

2. **Text vs. Categorical Data:**
   - Strings of characters can represent different types of data.
   - Categorical data is derived from a predefined set of options, such as 'red' or 'blue', 'yes' or 'no'.
   - Text data, however, is more fluid, often forming meaningful phrases or sentences that convey complex ideas.

3. **Analyzing Text: Corpus and Documents:**
   - Text analysis typically involves examining a large body of text, known as a [corpus](https://en.wikipedia.org/wiki/Text_corpus).
   - Within this corpus, each individual text entry, whether an article, social media post, or review, is termed a **document**.

## What is bag-of-words?

Bag-of-words is a technique in natural language processing where we ignore the structure of input text and focus solely on word occurrences. Itâ€™s like mentally holding a bag of words and counting how many times each word appears in a document.

## Steps for bag-of-words representation

1. **Tokenization**:
   - **Tokenization** is the process of splitting each document (text) into individual words or tokens.
   - We achieve this by breaking the text at whitespace, punctuation marks, or other delimiters.

2. **Vocabulary Building**:
   - Next, we create a **vocabulary** containing all unique words (tokens) that appear in any of the documents.
   - Each word is assigned a unique **index** (usually in alphabetical order).

3. **Encoding**:
   - For each document, we count how often each word from the vocabulary appears in that document.
   - The resulting vector represents the word frequencies (counts) for that document.

## Implementing Bag-of-Words

In the scikit-learn library, the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) is utilized to transform text data into a bag-of-words representation.

The `CountVectorizer` method standardizes all text data to lowercase, ensuring that words with identical spellings are recognized as the same token.

<font color='Blue'><b>Example:</b></font> Let's create a simple example of the Bag-of-Words (BoW) model using text related to Calgary. We'll follow the steps mentioned earlier:

Each Tokenization can be done as follows

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Step 0: Collect Data
# Define the documents
docs = ["Columbia, Missouri is known for its vibrant college town atmosphere.",
        "The University of Missouri in Columbia is a major research institution.",
        "Columbia's weather can be unpredictable, especially in spring."
        ]

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

tokenizer = vectorizer.build_analyzer()
for (i, doc) in enumerate(docs, 1):
    print(f'\n\033[1m\033[34mDoc {i}:\033[0m')  # Bold and blue
    print("Original document:")
    print(doc)
    print("Tokenized document:")
    print(tokenizer(doc))

In [None]:
# Import the necessary libraries
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Step 0: Collect Data
# Define the documents
docs = ["Columbia, Missouri is known for its vibrant college town atmosphere.",
        "The University of Missouri in Columbia is a major research institution.",
        "Columbia's weather can be unpredictable, especially in spring."
        ]
# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit the vectorizer to the documents
vectorizer.fit(docs)

# Transform the documents into a bag-of-words matrix
bag_of_words = vectorizer.transform(docs)

# Get the feature names (vocabulary)
feature_names = vectorizer.get_feature_names_out()
print('Feature Names (Vocabulary): ' + ', '.join(feature_names))

# Display the vocabulary size and content
print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")
print(f"Vocabulary content: {vectorizer.vocabulary_}")

# Display the dense representation of the bag_of_words
# print("Dense representation of bag_of_words:\n{}".format(bag_of_words.toarray()))

# Create a DataFrame from the BoW matrix
df_bow = pd.DataFrame(bag_of_words.toarray(), columns=feature_names)

# Display the DataFrame
display(df_bow)

## Bag-of-Words Example with Repeating Words

- **Sentences used:**
  - "Columbia, Columbia, a city so vibrant, so vibrant."
  - "The Missouri River, the Missouri River, so scenic, so scenic."

- **How it works:**
  - The CountVectorizer tokenizes the text and counts how many times each word appears in each sentence.
  - The result is a matrix (shown below as a table) where each row is a sentence and each column is a word from the combined vocabulary.

### Vocabulary Learned

| city | columbia | missouri | river | scenic | so | the | vibrant |
|------|----------|----------|-------|--------|----|-----|---------|
|  0   |    1     |    2     |   3   |   4    | 5  |  6  |    7    |

### Bag-of-Words Matrix

|        | city | columbia | missouri | river | scenic | so | the | vibrant |
|--------|------|----------|----------|-------|--------|----|-----|---------|
| **Doc 1** |  1   |    2     |    0     |   0   |   0    | 2  |  0  |    2    |
| **Doc 2** |  0   |    0     |    2     |   2   |   2    | 2  |  2  |    0    |

- For example, in Doc 1, "columbia" appears 2 times, "city" once, "so" twice, and "vibrant" twice.
- In Doc 2, "missouri" and "river" each appear 2 times, as do "so," "scenic," and "the" .

<font color='Blue'><b>Example:</b></font> Example: Bag-of-Words (BoW) Model Using Text Related to Calgary with **repeating words**

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Define a list of sentences with repeating words related to Columbia, Missouri
repeating_words = ["Columbia, Columbia, a city so vibrant, so vibrant.",
                   "The Missouri River, the Missouri River, so scenic, so scenic."]

# Initialize a CountVectorizer object to convert the text data into a matrix of token counts
vect = CountVectorizer()

# Fit the vectorizer to the list of sentences to build the vocabulary
vect.fit(repeating_words)

# Display the vocabulary that has been learned from the input documents
vocab = vect.vocabulary_
print(f"Vocabulary learned from the documents: {vocab}")

# Transform the list of sentences into a bag-of-words matrix
bag_of_words = vect.transform(repeating_words)

# Convert the bag-of-words matrix into a pandas DataFrame for better visualization
df_bow = pd.DataFrame(bag_of_words.toarray(),
                      columns=vect.get_feature_names_out())

# Display the DataFrame that shows the frequency of each word in the given sentences
print("DataFrame showing the Bag-of-Words matrix:")
display(df_bow)

## Enhancing Bag-of-Words: Stopword Removal

In the Bag-of-Words (BoW) model, certain words are so common that they carry minimal useful information about the actual content of the document. These words, known as 'stopwords', can be removed to improve the analysis. There are two primary methods to eliminate stopwords:

1. Utilizing a predefined list of stopwords specific to a language.
2. Excluding words that appear too frequently across the documents.

The scikit-learn library provides a built-in English stopword list in the `feature_extraction.text` module. This list can be used to filter out stopwords from the text data during the vectorization process, resulting in a more meaningful BoW representation.


In [None]:
# Import the set of English stop words from scikit-learn's feature_extraction.text module
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

# Display the total number of stop words provided in the scikit-learn's list
print(f"Number of stop words: {len(ENGLISH_STOP_WORDS)}")

# To provide a sample of this list, print every 20th stop word
# This gives an idea of what kind of words are considered as stop words
print("Every 20th stopword:")
print(list(ENGLISH_STOP_WORDS)[::20])

In [None]:
docs_ext = docs + repeating_words
print(f"Total number of documents: {len(docs_ext)}")
print(docs_ext)
vect = CountVectorizer(stop_words = "english")
vect.fit(docs_ext)
bag_of_words = vect.transform(docs_ext)
pd.DataFrame(bag_of_words.toarray(), columns=vect.get_feature_names_out())