<center><font size=8>Analyzing Text Data - Learners</font></center>

# **Installing and Importing the Necessary Libraries**

In [1]:
!pip install pandas==2.2.2 numpy==2.0.2 nltk==3.9.1 scikit-learn==1.6.1   -q

In [2]:
# to read and manipulate the data
import pandas as pd
import numpy as np

# setting column to the maximum column width as per the data
pd.set_option('max_colwidth', None)

### **Removing special characters from the text**

<h2>Why Remove Special Characters in Text Preprocessing?</h2>

- Special characters can introduce unnecessary elements, complicating text analysis.

- Clean text is simpler to work with and understand.

<h2>How to Implement Special Character Removal in Text Preprocessing?</h2>

- We can start by manually replacing unwanted characters like @, #, or punctuation marks with spaces.

- String functions like `replace()` help with simple replacements across the text.

In [4]:
text = "office.aiml.utaustin@mygreatlearning.com"

In [5]:
upd_text1 = text.replace("@", " ")
upd_text1

'office.aiml.utaustin mygreatlearning.com'

- We can observe that the '@' character has been replaced with a whitespace.
- Additionally, there are some punctuation marks in the email address. Let's replace those as well.

In [6]:
upd_text2 = upd_text1.replace(".", " ")
upd_text2

'office aiml utaustin mygreatlearning com'

- The punctuation marks ('.') are also replaced.


- However, as text become larger and contain various patterns of special characters, manually handling them becomes tedious.

- In such cases, we need more flexible methods to efficiently identify and remove multiple patterns at once - this is where advanced tools like regular expressions come in handy.

<h2>Regular Expressions</h2>

- Regular expression, also called regex in short, is a sequence of symbols and characters expressing a string or pattern to be searched for within a longer piece of text.

- They enable complex text processing tasks, such as finding, replacing, or removing patterns, making data cleaning more efficient.

- Example:
    - **Pattern:** `a`
    - **Description:** This regex pattern matches the letter "a" in any text.
    - **Usage:** In the text "apple pie", the pattern `a` will match the occurrence of "a," allowing for identification or manipulation of that letter.  

- Regular expressions are implemented in Python using the `re` library.

- It is imported using the statement `import re`

In [7]:
import re

- Suppose you have a string, and you are interested in adding a character between each of its characters.

In [8]:
string = "Regular expressions"

- This is where the join method comes to rescue.

  - The `join()` method takes all characters in a string and joins them into one string with a sepataror.

  - A string must be specified as the separator.

  - **Syntax :** string.join(list)

In [9]:
'space'.join(string)

'Rspaceespacegspaceuspacelspaceaspacerspace spaceespacexspacepspacerspaceespacesspacesspaceispaceospacenspaces'

- As we can observe, the string 'space' has been added between each character in the `string`

In [10]:
' '.join(string)

'R e g u l a r   e x p r e s s i o n s'

- As we can observe, a whitespace has been added between each character in the `string`.

<h2>1. Pattern: [a-z]</h2>
<p><strong>Use Case:</strong> Extracting lowercase letters from usernames.</p>
<p><strong>Description:</strong> This pattern matches any single lowercase letter from 'a' to 'z'. It can be used to ensure usernames contain only lowercase letters.</p>
<pre>
pattern = r'[a-z]'
# Finding the specified pattern and replacing lowercase characters with a blank string
new_text = ''.join(re.sub(pattern, ' ', text))
</pre>


- The `r` prefix creates a raw string, preventing Python from interpreting backslashes as escape characters, which is essential when working with regular expressions that often use backslashes.

- [] define a character set, which matches any one character from a specified set of characters inside the brackets.

- For example, [abc] will match either a, b, or c in the input text.

- If you put a hyphen between characters like [a-z], it matches any character in that range, so [a-z] matches any lowercase letter from a to z.

- The `re` library in Python includes many built-in functions, one of which is `re.sub`, used for substituting occurrences of a specified regex pattern in a string.

**```re.sub(pattern, replacement, string)```**

This function is used to replace occurrences of a pattern in a string.

- **`pattern`:** The regular expression that defines the sequence of characters you want to search for in the string.

- **`replacement`:** The string that will replace each occurrence of the pattern found in the original string.

- **`string`:** The input string where the search and replacement will occur

In [11]:
pattern = r'[a-z]'
replacement = ' '
string = 'Python Modules'

output = re.sub(pattern,replacement,string)

output

'P      M      '

- As expected, the characters [a-z] in the string are replaced with a whitespace.

- Let's check a simple example of how the above pattern can be used.

In [12]:
string = "Artificial Intelligence and Machine Learning"
pattern = r'[a-z]'

cleaned_string = ''.join(re.sub(pattern,' ',string))

print(string)
print(cleaned_string)

Artificial Intelligence and Machine Learning
A          I                M       L       


- As we can observe, the lower case characters in the string contained in the variable `string` have been replaced with a whitespace (and not a blank).

- Let's verify the same.

In [13]:
print(len(string))
print(len(cleaned_string))

44
44


- Since the lower case characters are replaced with a whitespace and whitespace is also a character, the length of the `cleaned_string` remains unchanged

- We can also replace the lower case characters in the string with a blank, i.e., remove those characters.

In [14]:
string = "Artificial Intelligence and Machine Learning"
pattern = r'[a-z]'

cleaned_string = ''.join(re.sub(pattern,'',string))

print(string)
print(cleaned_string)

Artificial Intelligence and Machine Learning
A I  M L


In [15]:
print(len(string))
print(len(cleaned_string))

44
8


- Since the lowercase characters are replaced with a blank space, and a blank space is a character of zero length, the length of the `cleaned_string` changed and was reduced

- As we can observe, the lowercase letters have been replaced with a whitespace.

<h2>2. Pattern: [A-Z]</h2>
<p><strong>Use Case:</strong> Validating employee IDs that start with uppercase letters.</p>
<p><strong>Description:</strong> This pattern matches any single uppercase letter from 'A' to 'Z'. It can be used to check that employee IDs start with a capital letter.</p>
<pre>
pattern = r'[A-Z]'
# Finding the specified pattern and replacing uppercase characters with a blank string
new_text = ''.join(re.sub(pattern, ' ', text))
</pre>

In [17]:
string = "ID1230"
pattern = r'[A-Z]'

cleaned_string = ''.join(re.sub(pattern,' ',string))

print(string)
print(cleaned_string)

ID1230
  1230


- As we can observe, the uppercase letters have been replaced with a whitespace

- Every sentence starts with an uppercase letter.
- As we can observe, the uppercase characters have been replaced with a whitespace, along with all other uppercase characters.

<h2>3. Pattern: [0-9]</h2>
<p><strong>Use Case:</strong> Extracting the template from a string.</p>
<p><strong>Description:</strong> This pattern matches any single digit from '0' to '9'.</p>
<pre>
pattern = r'[0-9]'
# Finding the specified pattern and replacing digit characters with a white space
new_text = ''.join(re.sub(pattern, ' ', text))
</pre>

In [19]:
import re

string = "OrderNumber: 12345, TotalAmount: $67.89"
pattern = r'[0-9]'

cleaned_string = ''.join(re.sub(pattern,' ',string))

print(string)
print(cleaned_string)

OrderNumber: 12345, TotalAmount: $67.89
OrderNumber:      , TotalAmount: $  .  


- There is a particular template in the string:
  - It contains an Order Number and Total Amount with some values.
  - If we replace the values with whitespace, we can retrieve the template.

In [20]:
import re

string = "The temperature today is 75 degrees Fahrenheit."
pattern = r'[0-9]'

cleaned_string = ''.join(re.sub(pattern,' ',string))
other_cleaned_string = re.sub(pattern,' ',string)

print(string)
print(cleaned_string)
print(other_cleaned_string)

The temperature today is 75 degrees Fahrenheit.
The temperature today is    degrees Fahrenheit.
The temperature today is    degrees Fahrenheit.


- Imagine there is a simple website that displays the daily temperature in a specific format.
  - In the above string, only the temperature value changes while the format remains the same.
  - If we replace that value with a whitespace or any other value, we can update the temperature.
  - For now, we will retrieve the template alone.


- As expected the string '1930' has been replaced with a whitespace.

<h2>4. Pattern: [^]</h2>

- The ^ character is used as a negation operator in regular expressions.
- Any pattern mentioned after the ^ character will be excluded from consideration.

<strong>Use Case:</strong> Removing all non-numeric characters from a string of digits.

<strong>Description:</strong> The pattern `[^0-9]` matches any character that is not a digit. It can be used to clean up strings that should only contain numbers.

```
pattern = r'[^0-9]'
# Finding the specified pattern and replacing non-digit characters with a blank string
new_text = ''.join(re.sub(pattern, ' ', text))
```

In [21]:
import re

string = "OrderNumber: 12345, TotalAmount: $67.89"
pattern = r'[^0-9]'

cleaned_string = ''.join(re.sub(pattern,' ',string))

print(string)
print(cleaned_string)

OrderNumber: 12345, TotalAmount: $67.89
             12345                67 89


- In one of the previous examples, we have retrieved the template.
  - To retrieve anything apart from the template, we can use the ^ character.

In [22]:
import re

string = "The temperature today is 75 degrees Fahrenheit."
pattern = r'[^0-9]'

cleaned_string = ''.join(re.sub(pattern,' ',string))

print(string)
print(cleaned_string)

The temperature today is 75 degrees Fahrenheit.
                         75                    


- In one of the previous examples, we have retrieved the template.
  - To retrieve anything apart from the template, we can use the ^ character.

- As the review contains no digits, all characters have been replaced with whitespace.

<h2>5. Pattern: [^A-Za-z]</h2>
<p><strong>Use Case:</strong> Sanitizing input fields to allow only letters.</p>
<p><strong>Description:</strong> This pattern matches any character that is not a letter (either uppercase or lowercase). It can be used to sanitize user input in forms to ensure it only contains alphabetic characters.</p>
<pre>
pattern = r'[^A-Za-z]'
# Finding the specified pattern and replacing non-letter characters with a blank string
new_text = ''.join(re.sub(pattern, ' ', text))
</pre>

In [23]:
import re

string = "johan123@gmail.com"
pattern = r'[^A-Za-z]'

cleaned_string = ''.join(re.sub(pattern,' ',string))

print(string)
print(cleaned_string)

johan123@gmail.com
johan    gmail com


- As expected, all characters outside of [A-Za-z] have been replaced with whitespace.

In [24]:
import re

string = "OrderNumber: 12345, TotalAmount: $67.89"
pattern = r'[^A-Za-z]'

cleaned_string = ''.join(re.sub(pattern,' ',string))

print(string)
print(cleaned_string)

OrderNumber: 12345, TotalAmount: $67.89
OrderNumber         TotalAmount        


- As expected, all characters outside of [A-Za-z] have been replaced with whitespace.
- Additionally, this is another way to retrieve the template, which was discussed in one of the previous examples.

- As expected the strings '1900' and '1990' has been replaced with a whitespace.

<h2>6. Pattern: []+</h2>

- The + character is used to indicate that the preceding element must occur one or more times.
- Any pattern mentioned before the + character will be matched as long as it appears at least once.

**Use Case**: Validating hexadecimal color codes.

**Description**: The pattern `[a-fA-F0-9]+` matches one or more characters in the range 'a' to 'f' (lowercase) or 'A' to 'F' (uppercase) or 0-9. It can be used to extract or validate parts of hexadecimal color codes.

```
pattern = r'[a-fA-F]+'
# Finding the specified pattern and replacing non-hexadecimal characters with a blank string
new_text = ''.join(re.sub(pattern, ' ', text))
```


In [25]:
import re

string = "#ff0000 #6a5acd"
pattern = r'[a-fA-F0-9]+'

cleaned_string = ''.join(re.sub(pattern,' ',string))

print(string)
print(cleaned_string)

#ff0000 #6a5acd
#  # 


- ##ff0000 and #6a5acd represents colors in hexadecimal format.

- The entire string 'ff0000' is matched as a single pattern since + matches one or more characters specified.

- Also, the string '6a5acd' is matched as a single pattern.

- Hence, the string 'ff0000' and '6a5acd' will be replaced with a whitespace.

In [26]:
print(len(cleaned_string))

5


- The character '#' is retained and a whitespace between '#ff0000' and '#6a5acd' is retained since it doesn't match with the given pattern

- Hence, the length of `cleaned_string` is 5.

- The above review doesn't contain any digits.
- Hence, the characters in the range a-f and A-F has been replaced with a whitespace.

- Until now, we have been cleaning individual reviews one by one. While this is helpful, it can be time-consuming for larger number of reviews.

- To make our process faster and easier, we want to clean all reviews in the DataFrame at once instead of handling them individually.

- By using the ***apply*** function in Pandas, we can quickly apply our cleaning function, ***remove_special_characters***, to every review in the DataFrame in one go.

- We will define the ***remove_special_characters*** function to remove special characters from the text. Then, we will apply this function to the review column, ensuring all reviews are clean and ready for analysis.

In [43]:
# defining a function to remove special characters
def remove_special_characters(text):
    # Defining the regex pattern to match non-alphanumeric characters
    pattern = '[^A-Za-z0-9]+'

    # Finding the specified pattern and replacing non-alphanumeric characters with a blank string
    new_text = ''.join(re.sub(pattern, ' ', text))

    return new_text

### **Lowercasing**

<h2>Why is Lowercasing Important in Text Preprocessing?</h2>

- It ensures that all words are in the same format, which helps maintain consistency across the dataset.
  - This means "Dog" and "dog" are treated as the same word.

- Converting text to lowercase simplifies the data, making it easier to process and analyze.

<h2>How to Implement Lowercasing?</h2>

- The `lower()` method is a built-in string method in Python that converts all uppercase characters in a string to lowercase.
- This is useful for standardizing text data.

<p><strong>Example:</strong></p>

<pre>
input_string = "Hello, World!"

lowercased_string = input_string.lower()

print(lowercased_string)
</pre>

>**Output** : hello, world!


In [27]:
string = "The Quick Brown Fox"

lowercased_string = string.lower()

print(lowercased_string)

the quick brown fox


- As we can see, the characters in the `string` have been converted to lowercase."

- Until now, we have been converting text to lowercase for individual reviews one by one. While this method works, it can be inefficient when dealing with a larger number of reviews.

- To streamline our process, we want to convert all reviews in the DataFrame to lowercase at once instead of handling each one separately.

- By using the `str.lower()` method in Pandas, we can efficiently apply the lowercase transformation to every review in the DataFrame in a single operation.

- We will apply the `str.lower()` method to the review column, ensuring that all reviews are standardized and ready for analysis.

### **Removing extra whitespace**

<h2>Why is Removing Extra Spaces Important in Text Preprocessing?</h2>

- Removing extra spaces ensures uniformity in the text, making it easier to analyze and process.

- Extra spaces can unnecessarily increase the size of the text data. By eliminating them, the overall storage requirements are reduced, leading to more efficient data handling and processing.

<h2>How to Implement Lowercasing?</h2>

- The `strip()` method is a built-in string method in Python that removes leading and trailing whitespace from a string.
- This is useful for cleaning up user input and ensuring consistent formatting in text data.

**Example:**
<pre>
input_string = "Hello, World! "

stripped_string = input_string.strip()

print(stripped_string)
</pre>

> **Output :** Hello, World!


In [28]:
string = " johan@gmail.com "

stripped_string = string.strip()

print(stripped_string)

johan@gmail.com


- As expected, the trailing and leading whitespace has been removed.

- As expected, the leading whitespace has been removed.

- Until now, we have been removing leading and trailing whitespace from individual reviews one by one. While this method is effective, it can be time-consuming when working with a large number of reviews.

- To make our process more efficient, we want to remove extra spaces from all reviews in the DataFrame simultaneously instead of handling each one separately.

- By utilizing the ***str.strip()*** method in Pandas, we can easily eliminate leading and trailing whitespace from every review in the DataFrame in a single operation.

- We will apply the ***str.strip()*** method to the review column, ensuring that all reviews are cleaned and formatted consistently for analysis.

### **Removing stopwords**

<h2>Why is Removing Stop Words Important?<h2>

- Stop words are common words (like "and," "the," and "is") that are often excluded from text analysis because they add little meaning to the content.

- Excluding frequently occurring words helps emphasize the more significant terms in the text, improving analysis quality.

- Removing pronouns and articles, commonly categorized as stop words, minimizes irrelevant information, allowing algorithms to better identify patterns.

<h2>How to implement stop word removal?</h2>

- Start by using a pre-defined list of stop words to identify common words that can be excluded from our analysis.

- Downloading a stop words dataset ensures we have an updated collection of these words for accurate filtering.

- As the text data grows, manually removing stop words can become tedious and inefficient.

- Implementing an automated method to filter out stop words saves time

- NLTK (Natural Language Toolkit) is a Python library designed for working with human language data (text) and provides tools for various natural language processing tasks.

- NLTK is widely used in academic and research settings, supported by an active community that contributes to its development and documentation, making it a valuable resource for learning and experimentation in NLP.

- It is imported using the statement `import nltk`

In [30]:
import nltk

- NLTK has a module called stopwords that provides a list of common stop words based on the English language.

- This list includes frequently used words that typically do not add significant meaning to text analysis.

- We can download this stop words list using the statement `nltk.download('stopwords')`

In [31]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/vishalkhapre/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

- Once we have downloaded the list, we can use the `corpus` module from `nltk` to load the stopwords

In [32]:
from nltk.corpus import stopwords

- To access the stopwords in the english language, we can use the `words` method from the stopwords module and pass 'english' as the argument.

- We will list out the first 10 stopwords

In [33]:
stopwords.words('english')[:10]

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an']

Using the NLTK stop words list, we can filter reviews by keeping only the words not in this list, allowing us to focus on meaningful content and enhance the quality of our analysis.

- Suppose you have a list of words, and you are interested in converting it into a sentence as a string.

In [34]:
list_of_words = ["I","love","Text","Preprocessing"]

- This is where the join method comes to rescue.

  - The `join()` method takes all items in a list and joins them into one string with a seaparator.

  - A string must be specified as the separator.

  - **Syntax :** string.join(list)

In [35]:
'space'.join(list_of_words)

'IspacelovespaceTextspacePreprocessing'

- As we can observe, the string 'space' has been added between each element in the list, thus forming a sentence.

In [36]:
' '.join(list_of_words)

'I love Text Preprocessing'

- As we can observe, a whitespace has been added between each element in the list, thus forming a sentence.


<h2>1. Splitting the Text</h2>
<p>
    The first step involves splitting the input <code>text</code> into individual words using the <code>split()</code> method. This method separates the text at each space and creates a list of words.
</p>
<pre><code>words = text.split()</code></pre>

<h2>2. Removing Stop Words</h2>
<p>
    Next, we remove the English stop words from the list of words. We use a list comprehension to iterate through each word in the <code>words</code> list and check if it is not present in the stop words list obtained from <code>stopwords.words('english')</code>.
</p>
<pre><code>[word for word in words if word not in stopwords.words('english')]</code></pre>

<h2>3. Creating the New Text</h2>
<p>
    Finally, the remaining words (those not in the stop words list) are joined back together into a single string using the <code>join()</code> method. This creates a new text string that contains only the meaningful words.
</p>
<pre><code>new_text = ' '.join([word for word in words if word not in stopwords.words('english')])</code></pre>

<h2>4. Result</h2>
<p>
    The resulting <code>new_text</code> variable now holds the filtered text, free of common English stop words, which allows for a more relevant analysis of the content.
</p>


In [37]:
string = "The quick brown fox jumps over the lazy dog."

words = string.split(' ')

new_text = ' '.join([word for word in words if word not in stopwords.words('english')])

print(string)

print(new_text)

The quick brown fox jumps over the lazy dog.
The quick brown fox jumps lazy dog.


- As we can observe, the stopwords have been removed.
- Additionally, we can note that the stopwords are 'over' and 'the', as they are not present in `new_text`.

In [38]:
string = "In order to succeed, you must first believe that you can."

words = string.split(' ')

new_text = ' '.join([word for word in words if word not in stopwords.words('english')])

print(string)

print(new_text)

In order to succeed, you must first believe that you can.
In order succeed, must first believe can.


- As we can observe, the stopwords have been removed.
- Additionally, we can note that the stopwords are 'to', 'you' and 'that', as they are not present in `new_text`.

- Until now, we have been removing stop words from individual reviews one by one. While this method is effective, it can be inefficient when processing a larger number of reviews.

- To streamline our process, we want to remove stop words from all reviews in the DataFrame at once instead of handling each one separately.

- By defining a function called ***remove_stopwords*** using the NLTK library, we can efficiently apply the stop word removal to every review in the DataFrame in a single operation.

  - This function splits the text into separate words, removes English language stop words, and then joins the remaining words back into a single string, ensuring that all reviews are cleaned and ready for analysis.

- To implement this, we will apply the ***remove_stopwords*** function to the review column in our DataFrame using the Pandas ***apply()*** method, allowing us to filter out common stop words from each review and enhance the quality of our text analysis.

In [39]:
# defining a function to remove stop words using the NLTK library
def remove_stopwords(text):
    # Split text into separate words
    words = text.split()

    # Removing English language stopwords
    new_text = ' '.join([word for word in words if word not in stopwords.words('english')])

    return new_text

### **Stemming**

<h2>Why is Stemming Important?</h2>

- Stemming transforms different forms of a word (e.g., "running," "ran," "runs") into a single root (e.g., "run"), making data more uniform.

- Fewer unique words mean simpler data, which helps in processing and analyzing text more efficiently.

- It helps in identifying key topics and sentiments by focusing on the core meaning of words rather than their variations.

<h2>How to implement?</h2>

The Porter Stemmer is one of the widely-used algorithms for stemming, and it shorten words to their root form by removing suffixes.
 - This is particularly useful in NLP tasks where you want to analyze the underlying meaning of text without being misled by different grammatical forms of the same word.

NLTK module supports Porter Stemmer algorithm.

Before proceeding, we need to download the `wordnet` database.
 - WordNet is a large lexical database of English words that groups words into sets of synonyms called synsets, providing definitions and semantic relationships between them.
 - By using nltk.download('wordnet'), we can download the WordNet dataset directly into their NLTK environment, enabling easy access to its rich linguistic resources.

In [40]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/vishalkhapre/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Next, we can import the PorterStemmer module from `stem.porter` module available in the `nltk` library

In [41]:
from nltk.stem.porter import PorterStemmer

Creating an instance of the Porter Stemmer class, which is used to stem words by reducing them to their root form.

In [42]:
ps = PorterStemmer()

Specifying the input word that needs to be stemmed.

In [43]:
word = 'Running'

Applying the stemming process to the input word (in this case, 'Running'), and assigns the resulting stemmed form to the variable stemmed_word.

In [44]:
stemmed_word = ps.stem(word)

Displaying the input word and its stemmed form.

In [45]:
print(word)
print(stemmed_word)

Running
run


In [46]:
word = 'Analzying'

stemmed_word = ps.stem(word)

print(stemmed_word)

analzi


- Here, we can note that the stemmed word is 'analzi' . But, the correct root word should be 'analyze'.

- The reason is the Porter Stemmer uses a set of rules to strip suffixes from words.
  - Its algorithm may not always produce linguistically correct stems, focusing instead on reducing words to their root forms based on pattern matching.

- The specific rules for removing suffixes may lead to unusual results.
  - For example, the removal of "-ing" from "Analyzing" can result in "Analyz," which, combined with its handling of consonant patterns, can lead to "analzi."

- Until now, we have been applying stemming to individual words one by one. While this method is effective, it can be inefficient when processing larger texts or multiple reviews.

- To streamline our process, we want to apply stemming to all words in a given text at once instead of handling each word separately.

- By defining a function called ***apply_porter_stemmer*** using the NLTK library, we can efficiently apply the Porter stemming algorithm to every word in the text in a single operation.

  - This function splits the input text into separate words, applies the Porter Stemmer to each word, and then joins the stemmed words back into a single string, ensuring that the text is uniformly stemmed and ready for further analysis.
- To implement this, we will call the ***apply_porter_stemmer*** function for our text data, allowing us to reduce words to their root forms and enhance the quality of our text analysis.

In [47]:
# Loading the Porter Stemmer
ps = PorterStemmer()

# defining a function to perform stemming
def apply_porter_stemmer(text):
    # Split text into separate words
    words = text.split()

    # Applying the Porter Stemmer on every word of a message and joining the stemmed words back into a single string
    new_text = ' '.join([ps.stem(word) for word in words])

    return new_text

# **Text Vectorization**

<h2>Why is Text Vectorization Important?</h2>

- It transforms words and sentences into numerical formats that mathematical algorithms can understand, making it essential for processing text data.

- By representing text as vectors, it helps capture important patterns and relationships in the data, leading to better insights and analysis in text-related tasks.

<p><h3>Definition:</h3> A document-term matrix is a mathematical representation of a collection of documents, where each document is represented as a row, and each unique word (or term) across all documents is represented as a column. The cells in the matrix contain values that indicate the frequency or presence of the corresponding word in the corresponding document.</p>

<p><h3>Usage in Text Preprocessing:</h3> The document-term matrix is essential in text preprocessing as it converts unstructured text data into a structured numerical format. This transformation allows for efficient analysis and processing of textual information. By representing documents as matrices, it facilitates the extraction of important features and patterns, making it easier to prepare text data for further analysis and understanding.</p>

<p><h3>Example:</h3></p>
<p>Consider three short documents:</p>
<ul>
<li>Document 1: "I love programming"</li>
<li>Document 2: "Programming is fun"</li>
<li>Document 3: "I love fun programming"</li>
</ul>

<p><h3>Steps to Create the Document-Term Matrix:</h3></p>

<li>Convert all text to lowercase:
<ul>
<li>Document 1: "i love programming"</li>
<li>Document 2: "programming is fun"</li>
<li>Document 3: "i love fun programming"</li>
</ul>
</li>

<li>Identify all unique words across the documents and sort it:
<ul>
<li>fun , is , love , programming</li>
</ul>
</li>
    
<li>Count the occurrences of each word in each document:
<ul>
<li>Document 1:
<ul>
<li>"fun": 0</li>
<li>"is": 0</li>
<li>"love": 1</li>
<li>"programming": 1</li>
</ul>
</li>
<li>Document 2:
<ul>
<li>"fun": 1</li>
<li>"is": 1</li>
<li>"love": 0</li>
<li>"programming": 1</li>
</ul>
</li>
<li>Document 3:
<ul>
<li>"fun": 1</li>
<li>"is": 0</li>
<li>"love": 1</li>
<li>"programming": 1</li>
</ul>
</li>
</ul>
</li>

<li>Form a matrix using the counts, where each row represents a document and each column represents a unique word:</li><br>

<table border="1" cellspacing="0" cellpadding="12" align="center" rules="all" frame="box" width="80%" style="font-size: 20px; text-align: center;">
<thead>
<tr bgcolor="#ADD8E6">
<th style="color: white;">Document</th>
<th style="color: white;">fun</th>
<th style="color: white;">is</th>
<th style="color: white;">love</th>
<th style="color: white;">programming</th>
</tr>
</thead>
<tbody>
<tr bgcolor="#f2f2f2">
<td>Document 1</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>

</tr>
<tr>
<td>Document 2</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr bgcolor="#f2f2f2">
<td>Document 3</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>1</td>
</tr>
</tbody>
</table>

Creating the documents

In [48]:
Document_1 = "I love programming"
Document_2 = "programming is fun"
Document_3 = "i love fun programming"

Appending all the documents to a list

In [49]:
list_of_docs = [Document_1,Document_2,Document_3]

- To implement this BoW representation, `sklearn` offers a class to work with called `Countvectorizer`

- It is available in the `feature_extraction.text` module

- It is imported using the statement `from sklearn.feature_extraction.text import CountVectorizer`

In [50]:
from sklearn.feature_extraction.text import CountVectorizer

**`CountVectorizer`**:

- This is a tool from `sklearn.feature_extraction.text` used to convert text documents into a matrix of token counts.
- It creates a Bag of Words (BoW) representation of the text, where:
 - Each row represents a document.
 - Each column corresponds to a unique word (token) in the entire corpus.
 - The values in the matrix are the counts of each word's occurrence in the respective document

**`fit_transform()`**:

- The `fit_transform()` method performs both `fit()` and `transform()` in a single operation.
- The `fit()` method learns the vocabulary from the provided documents (i.e., identifies the unique words).
- The `transform()` method converts the documents into a sparse matrix, where:
    - The matrix rows correspond to documents.
    - The columns represent the words from the learned vocabulary.
    - The values in the matrix are the word frequencies.

In [51]:
bow_vec = CountVectorizer()

doc_matrix = bow_vec.fit_transform(list_of_docs)

Compressed Sparse Row format

- It’s a way to store large matrices efficiently when most of the elements are zero.
- Instead of storing every element, including the zeros, CSR compresses the data by only storing non-zero values and their positions.

In [52]:
doc_matrix

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 8 stored elements and shape (3, 4)>

In [53]:
# Converting CSR to a numpy array
doc_matrix.toarray()

array([[0, 0, 1, 1],
       [1, 1, 0, 1],
       [1, 0, 1, 1]])

- As expected, the BoW representation matches with the previous example.

In [54]:
# returns the list of unique words (tokens) extracted from the documents.
bow_vec.get_feature_names_out()

array(['fun', 'is', 'love', 'programming'], dtype=object)

In [55]:
# Converting to a pandas dataframe
pd.DataFrame(doc_matrix.toarray(),columns=bow_vec.get_feature_names_out())

Unnamed: 0,fun,is,love,programming
0,0,0,1,1
1,1,1,0,1
2,1,0,1,1


- The *`max_features`* parameter in CountVectorizer limits the number of words (or tokens) used to create the vocabulary.
- This helps when dealing with large datasets by keeping only the most important or frequent words.

In [None]:
# Initializing CountVectorizer with top 1000 words
bow_vec = CountVectorizer(max_features = 2000)


# **Conclusion**

- We used different text processing techniques to clean the raw text data.

- We then vectorized the cleaned text data using Bag of Words.

<font size=6 color='blue'>Power Ahead</font>
___