## Feature Selection – Language data
### Introduction

When working with text classification, selecting the most relevant words or phrases (features) is a crucial first step. Instead of treating all words equally, we can apply various methods to assess feature importance, ensuring that our model focuses on the most meaningful terms.

Raw text data contains a large number of words, many of which may be irrelevant, redundant, or even misleading for classification.

If we do not apply feature selection to our text data, models may become slower and less efficient, as they have to process too many features. In addition, irrelevant or frequent but uninformative words (e.g., "the", "is", "and") can dilute the signal needed for classification. This mmeans, certain rare but highly indicative words (e.g., "refund" in a complaint review) might get lost in a sea of common terms.

This can also lead to overfitting if the model learns from noise rather than meaningful patterns. We can improve model performance while maintaining interpretability, by filtering and selecting only the most informative features.

We will cover the most common feature selection techniques employed for language data, starting from the simplest approaches, before building up to more sophisticated methods.

### Installing Python libraries

In [None]:
!pip install --upgrade pip

!pip install nltk pandas matplotlib wordcloud seaborn scikit-learn transformers torch gensim scapy

### Downloading the data
We download the data and unzip it to a folder ready for use:

In [None]:
import urllib.request
import tarfile
import os

# IMDb dataset URL. Uncomment to choose the larger version depending on your hardware
# url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz" # Size 80.2MB

# Use an earlier version - smaller for demonstration
url = "http://www.cs.cornell.edu/people/pabo/movie-review-data/mix20_rand700_tokens_0211.tar.gz" # Size 2.2MB

# Download the dataset to the current directory
urllib.request.urlretrieve(url, "aclImdb_v1.tar.gz") 

# Unpack (extract) the dataset
with tarfile.open("aclImdb_v1.tar.gz", "r:gz") as tar:
    tar.extractall()


### Loading the data
We will use the [Movie Review Dataset (Cornell)](http://www.cs.cornell.edu/people/pabo/movie-review-data/mix20_rand700_tokens_0211.tar.gz)

We will assume our dataset is structured as follows, with our reviews split into two folders representing positive and negative sentiment:  

```
tokens/
  ├── neg/
  │    ├── cv000_tok-29416.txt
  │    ├── cv001_tok-19502.txt
  │    ...
  └── pos/
       ├── cv000_tok-29590.txt
       ├── cv001_tok-18431.txt
       ...
```

Here, the dataset consists of two labelled categories:  
- *`pos/` (positive)* – Documents expressing positive sentiment.  
- *`neg/` (negative)* – Documents expressing negative sentiment.  

Before we start, we will prepare the features for a text classification task, specifically with two categories: *'pos'* (positive) and *'neg'* (negative). 

We start by creating two empty lists, *X* and *Y*, which will hold our text data and corresponding labels, respectively. We then loop through each category, assigning each category a numeric label (0 for positive, 1 for negative). 

Within each category, we open every text file in the category's folder, read its contents, and append this content directly to our list `X`. At the same time, we record the category's numeric label in the list `Y`. 
After the loop completes, we have a dataset of texts in `X` along with their respective labels in `Y`, making the data ready for training a machine learning model, such as one for sentiment analysis or document classification:

In [None]:
# Define the two categories (folders): 'pos' (positive), 'neg' (negative)
categories = ['pos', 'neg']

# Initialise empty lists to store data (X) and labels (Y)
X, Y = [], []

# Loop through each category and its index
for idx, cat in enumerate(categories):

    # Loop through all files within the current category's folder
    for filename in os.listdir(os.path.join(root_dir, cat)):
        
        # Open and read the content of each text file
        with open(os.path.join(root_dir, cat, filename), 'r') as f:

            # Add the text content to list X
            X.append(f.read())

            # Add the numeric category label (0 or 1) to list Y
            Y.append(idx)


We will take a sample of the data to speed things up:

In [None]:
import numpy as np
import random

n = 500  # Number of samples to take

# Set the random state, so that we can reproduce the same sample each time we run the code
seed = 7
random.seed(seed)

# Take n samples
sample_indices = random.sample(range(len(X)), n)

# Take a sample of the full dataset and overwrite the lists
X = [X[i] for i in sample_indices]
Y = [Y[i] for i in sample_indices]

# Convert Y to integer type
Y = np.array(Y, dtype=int)

# Inspect 5 of the samples
top_n = 5
for i in range(5):
    print(f"Review {i+1}: {X[i][:100]}...")  # Print first 200 chars for brevity
    print(f"Sentiment: {Y[i]}\n")

## Domain-specific and manual selection
These methods assess feature importance using mathematical criteria. Domain-specific and manual selection includes, many methods we have covered before:
- *Stopword removal*: eliminating common words.
- *N-gram filtering*: selecting meaningful unigrams, bigrams, or trigrams.  
- *POS-based selection*: focusing on key content words like nouns and verbs.
- *Named Entity Recognition (NER)*: extracting entities like names, locations, and organisations to enhance interpretability.

Let's recap on how we can use them for feature selection.

#### Stop word removal
Many words appear frequently in all texts but carry little meaning on their own. For example, words like "the", "is", "and", and "it" occur in almost every sentence but do not help distinguish categories.

Stopwords can dilute important patterns in text data and increase computational complexity. We can reduce noise and focus on more meaningful words that contribute to classification by removing them.

Before stopword removal:
```
"The movie was really amazing, and it had a fantastic plot!"
```
After stopword removal:
```
"Movie amazing fantastic plot!"
```
For our dataset, this helps models focus on important words like "amazing", "fantastic", and "plot", which indicate sentiment.

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download stopwords if not already available
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')

# Function to remove stopwords from reviews
def remove_stopwords(text):
    words = word_tokenize(text.lower())  # Convert to lowercase and tokenize
    filtered_words = [word for word in words if word not in stopwords.words('english')]
    return " ".join(filtered_words)  # Reconstruct sentence

# Apply stopword removal to all reviews in X
X_clean = [remove_stopwords(review) for review in X]

# Display cleaned reviews
top_n = 10 # Show the first ten:
for i, (original, cleaned) in enumerate(zip(X, X_clean)):
    if i >=top_n: break
    print(f"Original: {original}\nCleaned: {cleaned}\n")


#### Part Of Speech (POS)-based selection:
Part-of-Speech (POS) tagging allows us to filter words based on grammatical categories. Not all words contribute equally to classification, so we can focus on those with more semmantic meaning:

- Nouns (e.g., "election", "policy", "economy") are important for tasks such as topic classification.
- Verbs (e.g., "win", "increase", "fail") are useful for sentiment analysis or event detection.
- Adjectives & Adverbs (e.g., "terrible", "wonderfully") often strong indicators in sentiment analysis.

As an example, consider the following review:

     "The [food] was absolutely [terrible], and the [service] was [slow]."

POS-based selection keeps words like "food", "terrible", "service", and "slow", as they are strong indicators of sentiment, and any model we train will learn more efficiently if we filter out these less meaningful words (e.g., determiners, conjunctions).

Part of speech taggers may use more than one tag label. For nouns we have the following breakdown:

- NN: Singular noun (e.g., dog, table, car)
- NNS: Plural noun (e.g., dogs, tables, cars)
- NNP: Proper noun, singular (e.g., John, London, Apple)
- NNPS: Proper noun, plural (e.g., Americans, Beatles)

Therefore, to capture all instances of a particular part-of-speech, we can filter for the first two characters of the POS-tag to capture them all:


In [None]:
import nltk

# Download necessary NLTK models
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Example review - pick the first from our training data
example_review = X[0]

# Tokenize the words
words = nltk.word_tokenize(example_review)

# Perform POS tagging
pos_tags = nltk.pos_tag(words)

# Extract NOUNS (NN*), VERBS (VB*), and ADJECTIVES (JJ*)
nouns = [word for word, pos in pos_tags if pos.startswith('NN')]
verbs = [word for word, pos in pos_tags if pos.startswith('VB')]
adjectives = [word for word, pos in pos_tags if pos.startswith('JJ')]

# Display results
print("Original Sentence:", example_review)
print("\nNOUNS:", nouns)
print("VERBS:", verbs)
print("ADJECTIVES:", adjectives)
print("---")


#### N-gram filtering

Words do not always hold meaning in isolation. Sometimes, phrases (n-grams) provide more context than single words (unigrams).  Unigrams (single words) can be useful but might lack context. However, bigrams (two-word phrases) and trigrams (three-word phrases) often capture more meaningful relationships:

- Unigram: "bank", "account", "fraud"
- Bigram: "bank fraud", "fraud detection"
- Trigram: "bank account fraud", "credit card scam"

In spam detection, for instance, bigrams like "free money" or "limited offer" are stronger indicators of spam than just "free" or "money" alone. However, using too many n-grams can introduce redundancy, so n-gram filtering helps retain only the most useful ones.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

ngram_size = 3

# Extract unigrams (1) bigrams (2), and trigrams (3)
vectoriser = CountVectorizer(ngram_range=(1, ngram_size))

X_count = vectoriser.fit_transform(X)

print("N-grams:", vectoriser.get_feature_names_out())

#### Named Entity Recognition (NER)
Named Entity Recognition (NER) identifies important real-world entities that improve interpretability.  This involves extracting entities like names, locations, and organisations to improve interpretability and relevance. Some words carry specific meaning beyond their literal definition.

For example:

- People (e.g., "Elon Musk", "Shakespeare")
- Locations (e.g., "London", "New York")
- Organisations (e.g., "Google", "United Nations")
- Dates & Events (e.g., "Brexit", "World War II")

In news classification, recognising entities helps distinguish between topics.
Articles mentioning "NASA" and "Mars Rover" likely belong to the "science" category. Articles mentioning "vote" and "Democrats" are probably about "politics".


NER enhances interpretability, making models not just more accurate but also more understandable. Just like POS-tagging, we can label each word, or phrase, representing these entities:

In [None]:
import spacy
!python -m spacy download en_core_web_sm

# Load spaCy's small English model
nlp = spacy.load("en_core_web_sm")

# Function to extract named entities
def extract_named_entities(text):
    doc = nlp(text)
    return [(ent.text, ent.label_) for ent in doc.ents]  # Extract entity text and label

# Apply NER to a sample of our reviews
top_n = 10
X_named_entities = [extract_named_entities(review) for review in X[:top_n]]

# Display results
for i, (review, entities) in enumerate(zip(X, X_named_entities)):
    print(f"Review {i+1}: {review}")
    print(f"Named Entities: {entities}\n")


## Statistical methods
In text classification tasks, we often want to understand how different words contribute to distinguishing between categories. One way to do this is by measuring the correlation between words and labels.  

For example, consider a dataset of customer reviews, where each review is labelled as "positive" or "negative". Some words, like "excellent" and "amazing", are likely to appear more frequently in positive reviews, while words like "terrible" and "disappointing" may be more common in negative reviews.  

If we analyse how often certain words appear in each category compared to others, we can identify words that strongly differentiate one category from another.  We will cover different ways we can measure the relationship between words and labels to figure out which words act as clues for different categories.

#### Term Frequency-Inverse Document Frequency (TF-IDF):
TF-IDF measures how important a word is in a document relative to its frequency across all documents. It assigns higher scores to words that appear frequently in a document but not across the entire dataset, reducing the impact of common words.

This approach helps filter out very frequent but uninformative words (e.g., "the", "is") while keeping meaningful words. It also highlights certain domain-specific words that appear in fewer documents, but are certainly indicative of the topic (e.g., "refund" in complaint reviews) and should be retained.

This method does not capture the relationship between words, nor the context or meaning of words. In addition, rare words may get overemphasised, even if they are not important for classification.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectoriser = TfidfVectorizer()

X_tfidf = vectoriser.fit_transform(X)

print(X_tfidf)

The output `(0, 3080)	0.15187519523325435` represents (row, column, TF-IDF score), which is not human-readable. The statement `TfidfVectorizer.fit_transform(X)` returns a sparse matrix (`scipy.sparse.csr_matrix`). This sparse matrix stores only non-zero values to save memory, since most entries in TF-IDF matrices are zero (i.e., words that do not appear in a document get a 0 score).

If the dataset is small, converting to a dense matrix makes it easier to work with and visualise. We use `.toarray()` to convert the sparse matrix into a NumPy array so we can view all values at once. If your dataset is large, however, keeping it in sparse format is more memory-efficient.

In [None]:
import pandas as pd

# Convert sparse matrix to dense
X_tfidf_dense = X_tfidf.toarray()

# Get feature names
feature_names = vectoriser.get_feature_names_out()

# Create a DataFrame for better visualisation
df_tfidf = pd.DataFrame(X_tfidf_dense, columns=feature_names)

# Select some words and display the results
df_tfidf[['movie', 'film', 'great', 'bad']].head(10)

#### Chi-Square test
The chi-square test measures how strongly a word’s occurrence is associated with a particular category compared to what would be expected by chance. It’s a statistical test often used for feature selection in text classification.

We can use it to help select words that have a strong association with specific labels, improving classification accuracy. It works well for categorical data where words need to be linked to discrete categories. And it also provides an interpretable way to rank features based on their relevance to classification.

Again, it assumes assumes independence between words, which does not always hold true in natural language, since words appear in a relationship with other words due to the syntax (or grammar) of the language. It can also be Less effective for rare words, as low-frequency terms may not show significant statistical association.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_selection import chi2
from sklearn.feature_extraction.text import TfidfVectorizer

# Vectorise text
vectoriser = TfidfVectorizer()
X_tfidf = vectoriser.fit_transform(X)  # Use the fitted vectoriser

# Compute Chi-Square scores
chi2_scores, _ = chi2(X_tfidf, Y)

chi2_scores = np.array(chi2_scores).flatten()  # Ensure correct shape

feature_names = vectoriser.get_feature_names_out()

# Select top 10 features
top_n = 10
top_features = np.argsort(chi2_scores)[-top_n:]


In [None]:
# Print top features
print("Top features by Chi-Square score:")
for i in reversed(top_features):
    print(f"{feature_names[i]}: {chi2_scores[i]:.4f}")

In [None]:
# Plot top features
plt.figure(figsize=(10, 4))

plt.barh([feature_names[i] for i in top_features], [chi2_scores[i] for i in top_features], color='blue')

plt.xlabel("Chi-Square Score")
plt.ylabel("Feature")

plt.title("Top features by Chi-Square Test")

#### Mutual Information
Mutual Information (MI) measures how much knowing the presence (or absence) of a word helps predict a category. It calculates how much uncertainty is reduced when we observe a particular word in a document.

This approach allows us to identify words that are most informative for classification. It works well with imbalanced datasets, as it doesn’t rely on absolute word counts. In addition, it can handle both presence/absence and frequency-based word representations.

However, like chi-square, it can overemphasise rare words, which might not always be meaningful. And it doesn’t consider relationships between words:

In [None]:
import matplotlib.pyplot as plt
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# Vectorise text data
vectoriser = TfidfVectorizer()
X_tfidf = vectoriser.fit_transform(X)

# Convert sparse matrix to dense array
X_dense = X_tfidf.toarray()  # Convert to dense

# Compute Mutual Information scores
mi_scores = mutual_info_classif(X_dense, Y, discrete_features=False)

# Get feature names
feature_names = vectoriser.get_feature_names_out()

# Sort and get top 10 features by MI score
top_mi_features = sorted(zip(feature_names, mi_scores), key=lambda x: x[1], reverse=True)[:10]

# Extract words and scores
top_words, top_scores = zip(*top_mi_features)


In [None]:
# Plot top features
plt.figure(figsize=(10, 4))

plt.barh(top_words, top_scores, color='blue')

plt.xlabel("MI Score")
plt.ylabel("Feature")

plt.title("Top features by MI Score")

plt.gca().invert_yaxis()  # Invert Y-axis for better visualisation

## Dimensionality reduction techniques

When working with text data, we often represent words and documents as numerical features in a very high-dimensional space. For example, if we have 10,000 unique words in a dataset, each document is represented as a vector with 10,000 dimensions. This creates several challenges, including:

- Computational inefficiency – More features mean longer processing times and higher memory usage.
- Redundancy – Many words convey similar meanings, leading to overlapping information.
- Overfitting risk – Too many features can make models learn noise instead of meaningful patterns.

To solve these issues, dimensionality reduction techniques transform high-dimensional word vectors into a smaller, more manageable space while preserving the most important information. This helps models run faster, generalise better, and improve interpretability.

### Latent Semantic Analysis (LSA)
Latent Semantic Analysis (LSA) is a dimensionality reduction technique used to find hidden relationships between words and documents. It helps transform high-dimensional text data into a lower-dimensional representation, making it easier to analyse while preserving key patterns in word usage.

LSA is based on a mathematical technique called Singular Value Decomposition (SVD), which breaks down large document-term matrices into smaller, more manageable ones while retaining the most important information about word meanings.

In text data, there are several challenges:
- *Synonyms and variations*: Different words can have the same meaning (e.g., "car" vs "automobile"). Traditional methods treat these as separate words, but LSA groups them together.
- *High dimensionality*: Large vocabularies create thousands of features (words), making models slow and complex. LSA reduces the number of dimensions while keeping key information.
- *Noise reduction*: Text data has redundant or irrelevant words. LSA helps remove unimportant words while keeping meaningful structure.

The process begins by converting text into a document-term matrix, where each document is represented as a vector of word frequencies or TF-IDF values. This matrix is then decomposed using Singular Value Decomposition (SVD), breaking it into three smaller matrices: *U*, which represents documents in a reduced space; *𝑆*, which contains singular values indicating the importance of each concept; and *𝑉*, which represents words in the same reduced space.

To enhance efficiency and remove noise, LSA retains only the most important concepts, discarding smaller singular values that contribute less to meaning. This results in a transformed representation where both documents and words are mapped to a lower-dimensional space that captures their latent relationships.

There are disadvantages, which include a loss of Interpretability as the reduced dimensions are abstract, making it hard to explain what each new feature represents. It can be computationally expensive since computing SVD on large datasets requires significant processing power. Like some of these other methods, it ignores word order and focuses only on word co-occurrence, but doesn’t consider how words are arranged in a sentence.

In [None]:
import matplotlib.pyplot as plt
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer

# Vectorise text using TF-IDF
vectoriser = TfidfVectorizer(stop_words="english", max_features=5000)  # Limit features to 5000 for efficiency
X_tfidf = vectoriser.fit_transform(X)

seed = 7

# Apply LSA (Truncated SVD for text data)
lsa = TruncatedSVD(n_components=2, random_state=seed)
X_lsa = lsa.fit_transform(X_tfidf)

# Scatter plot with corrected legend
plt.figure(figsize=(8, 8))

# Define colors and labels correctly
for sentiment, color, label in zip([1, 0], ["blue", "red"], ["Positive Reviews", "Negative Reviews"]):
    indices = [i for i, lbl in enumerate(Y) if lbl == sentiment]
    
    plt.scatter(X_lsa[indices, 0], X_lsa[indices, 1], c=color, alpha=0.7, label=label)

plt.xlabel("LSA Component 1")
plt.ylabel("LSA Component 2")

plt.title("LSA Projection of movie reviews")

plt.legend(loc="best")

plt.grid(True)

The plot represents movie reviews transformed into a 2D space using Latent Semantic Analysis (LSA). Each point corresponds to a review, with:

- Blue points representing positive reviews
- Red points representing negative reviews

The x-axis (LSA Component 1) and y-axis (LSA Component 2) capturing the most important semantic patterns in the dataset.

Originally, each review had hundreds of word features (from TF-IDF).
LSA compressed this into just two components while preserving key relationships.

Many positive reviews (blue) and negative reviews (red) appear in different regions, suggesting that LSA captures sentiment-related differences. However, there is some overlap, indicating that some reviews might be harder to distinguish based on LSA alone.

If LSA can separate positive and negative reviews effectively, we can use these components as features for our machine learning models (e.g., logistic regression, SVM) to classify new movie reviews based on sentiment.

### Non-Negative Matrix Factorisation (NMF)
Non-Negative Matrix Factorisation (NMF) is a dimensionality reduction technique that helps break down complex datasets into meaningful components while ensuring that all values remain non-negative (i.e., no negative numbers).

Think of it like finding hidden topics in a collection of documents by identifying important word groupings. Unlike other techniques (like PCA or LSA), NMF ensures that all components make sense in human terms because it doesn’t mix negative and positive numbers.

Many dimensionality reduction techniques, create components where some words have positive weights and others have negative weights. This can make interpretation difficult because negative values do not have a clear meaning in human language. Since all values are positive, there is no ambiguity in meaning — each word adds to the topic.

NMF begins by converting text into a *document-term* matrix, where each document is represented as a vector of word frequencies or *TF-IDF* values.

The matrix is then factorised into two smaller matrices:

- *W* (Document-Topic matrix), which indicates how strongly each document relates to different topics, and
- *H* (Topic-Word matrix), which captures how much each word contributes to different topics.

The final step is to interpret the topics by identifying patterns of words that frequently appear together, revealing the hidden structure within the dataset.

This process helps uncover meaningful word groupings, making it useful for topic modelling, sentiment analysis, and document classification:

In [None]:
from sklearn.decomposition import NMF

# Apply NMF for dimensionality reduction
seed = 7
nmf = NMF(n_components=2, init="nndsvd", random_state=seed)

X_nmf = nmf.fit_transform(X_tfidf)

print(X_nmf)

Here is a bar chart displaying the top words for each NMF component. Each bar represents a word that contributes significantly to a particular NMF component (topic). Higher bars indicate words that are more strongly associated with the component. We can see distinct word groupings, which can help interpret what each component represents.

If a component contains words like "great", "amazing", "fantastic", it likely represents positive sentiment. If a component includes words like "terrible", "awful", "disappointing", it likely represents negative sentiment.

In [None]:
# Extract top words for each NMF component (topic)
num_top_words = 10

feature_names = vectoriser.get_feature_names_out()

# Get the most important words for each component
top_words_per_component = {}
for topic_idx, topic in enumerate(nmf.components_):
    top_words = [feature_names[i] for i in topic.argsort()[:-num_top_words - 1:-1]]
    top_words_per_component[f"Component {topic_idx + 1}"] = top_words

# Convert to DataFrame for easier visualisation
df_top_words = pd.DataFrame(top_words_per_component)

# Plot top words per NMF component
plt.figure(figsize=(10, 6))

for i, column in enumerate(df_top_words.columns):
    plt.barh(df_top_words[column], nmf.components_[i, topic.argsort()[:-num_top_words - 1:-1]])

plt.xlabel("Weight")
plt.ylabel("Words")

plt.title("Top words in each NMF component")

plt.legend(df_top_words.columns)

# Rotate the plot axes to make it easier to interpret
plt.gca().invert_yaxis()

## Embedded Methods (Model-based selection)
When working with text classification, we often deal with high-dimensional data, where each word becomes a feature. However, not all words are equally important—some contribute significantly to classification, while others add noise. Embedded methods help by automatically selecting the most relevant words while training the model.

We will provide a few examples of model-based selection, starting with Lasso (L1 Regression), before moving to Random Forest-based approaches:



#### Lasso (L1 Regression)
Lasso Regression is a machine learning technique that automatically selects the most important features (in this case, words) while removing irrelevant ones.

It does this by adding a penalty (L1 regularisation) to the model, which forces it to shrink the impact of less useful words to zero. This means that only the most important words stay in the model, while unimportant words are removed (e.g. coefficient = 0). Regularisation prevents the model from becoming too complex by penalising large coefficients. This helps in reducing overfitting as it stops the model from memorising noise.

The parameter to the model (see `C` below) is the inverse of regularisation strength, meaning a high `C` (e.g., C=10) provides weak regularisation meaning the model keeps more words (including less useful ones). A low value for `C` (e.g., C=0.1) provides strong regularisation meaning greater feature selection (fewer but stronger words remain). In some instances, this can lead to underfitting.

We first start by selecting all words. Each word in the dataset is given a number (weight) that tells us how important it is. The model checks which words are really necessary and if a word isn’t very useful for classification, Lasso sets its importance to zero, effectively removing it.

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_extraction.text import TfidfVectorizer

# Vectorise text using TF-IDF
vectoriser = TfidfVectorizer(stop_words="english", max_features=10000)  # Limit features for efficiency
X_tfidf = vectoriser.fit_transform(X)

# Train a logistic regression model with L1 regularisation (Lasso)
seed = 7
lasso_model = LogisticRegression(penalty="l1", solver="liblinear", C=2.0, random_state=seed)
lasso_model.fit(X_tfidf, Y)

# Get feature importance (absolute value of coefficients)
word_importance = np.abs(lasso_model.coef_).ravel()

# Get feature names
feature_names = vectoriser.get_feature_names_out()

# Get indices of non-zero coefficients (important words)
important_word_indices = np.where(word_importance > 0)[0]

# Get indices of zero coefficients (unimportant words removed by Lasso)
unimportant_word_indices = np.where(word_importance == 0)[0]

# Extract important and unimportant words
important_words = [feature_names[i] for i in important_word_indices]
unimportant_words = [feature_names[i] for i in unimportant_word_indices]

Let's visualise the results to easily identify the most important words and their importance:

In [None]:
import matplotlib.pyplot as plt

# Plot Important Words (Top 10 by absolute Coefficient value)
top_n = 10  # Limit the number of words for display
important_word_values = word_importance[important_word_indices]
top_indices = important_word_values.argsort()[-top_n:][::-1]  # Get top N important words

plt.figure(figsize=(10, 6))

plt.barh([important_words[i] for i in top_indices], [important_word_values[i] for i in top_indices], color="blue")

plt.xlabel("Importance Score (Absolute Coefficient)")
plt.ylabel("Words")

plt.title("Top 10 important words selected by Lasso")

plt.gca().invert_yaxis()  # Invert y-axis for better readability
plt.show()
# Extract the unimportant words (random sample)
num_unimportant = min(10, len(unimportant_words))  # Show up to 10 words
sample_unimportant_words = np.random.choice(unimportant_words, num_unimportant, replace=False)

print("Unimportant words - those removed by Lasso")
print(sample_unimportant_words)

As we can see from the plot, Lasso automatically picks the most important words and we can see how they may relate to sentiment.

An issue with this approach is that if the penalty `C` is too strong, it may remove too many words and miss useful information, so the appropriate value for `C` does require some experimentation.

### Random Forest-based feature selection
Imagine you’re trying to decide whether a movie review is positive or negative, but instead of making the decision alone, you ask a group of friends for their opinions. Some might focus on keywords like “fantastic” or “terrible,” while others consider different phrases. After gathering everyone’s input, you go with the majority vote. This is the how a Random Forest works.

Random Forest is a machine learning algorithm that makes predictions by combining the results of multiple decision trees. Instead of relying on just one tree, it creates a "forest" of many trees, each trained on a different part of the data. Each tree gives its own prediction, and the final result is determined by a majority vote (for classification) or an average (for regression). This approach helps improve accuracy and makes the model more resistant to errors or noise in the data.

One of the key benefits of Random Forest is its ability to rank feature importance, meaning it can tell us which words are the most useful for classification tasks like sentiment analysis.

In [None]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt

# Vectorise text using TF-IDF
vectoriser = TfidfVectorizer(stop_words="english", max_features=10000)  # Limit features for efficiency
X_tfidf = vectoriser.fit_transform(X)

seed = 7

# Train Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=seed)
rf_model.fit(X_tfidf, Y)

# Extract feature importance
feature_importance = rf_model.feature_importances_
feature_names = vectoriser.get_feature_names_out()

# Select top N important features
top_n = 10
top_indices = np.argsort(feature_importance)[-top_n:][::-1]  # Get top N important words

# Plot feature importance
plt.figure(figsize=(10, 6))

plt.barh([feature_names[i] for i in top_indices], [feature_importance[i] for i in top_indices], color="blue")

plt.xlabel("Feature importance score")
plt.ylabel("Words")

plt.title("Top words in Random Forest Model")

plt.gca().invert_yaxis()  # Invert y-axis for better readability

### Heuristic and Neural Network-based approaches

Word Embeddings, Attention Mechanisms, and Autoencoders—can be used for feature selection and dimensionality reduction in text classification. Each method captures semantic meaning beyond simple word frequency. These are more advanced techniques that leverage deep learning to identify the semantics of a collection of texts (in our case reviews):

- *Word Embeddings (Word2Vec, GloVe, FastText)*: Converts words into dense vectors, allowing for semantic feature selection.
- *Attention Mechanisms*: In transformer-based models (e.g., BERT, GPT), attention scores indicate important words for prediction.
- *Autoencoders*: Neural networks that learn compressed representations of text data, filtering out noise.

#### Word Embeddings
Word embeddings convert words into dense vectors, capturing semantic relationships (e.g., "king" and "queen" are considered similar in meaning).
Unlike TF-IDF, which treats words independently, embeddings group similar words together in a lower-dimensional space.

We can use it for feature selection to identify important words based on meaning, not just frequency. It can also be used to reduce the dimensions of our feature space as words with similar meanings have similar vectors.

We create a model to learn word vectors, which can be used as features in our classification models. We can remove low-importance words by setting a minimum similarity threshold between word vectors:

In [None]:
from gensim.models import Word2Vec

model = Word2Vec(sentences=X, vector_size=100, window=5, min_count=1, workers=4)

word_embeddings = {word: model.wv[word] for word in model.wv.index_to_key}

print(word_embeddings)

We can visualise the results using t-SNE (t-Distributed Stochastic Neighbor Embedding), which is a machine learning algorithm used for visualising high-dimensional data in a lower-dimensional space, typically 2D or 3D. It helps identify patterns, relationships, and clusters in complex datasets, especially in text, images, and word embeddings.

When using t-SNE (TSNE), we adjust parameters like `n_components`, and `perplexity` to control how word embeddings are visualised. t-SNE reduces high-dimensional embeddings (like TF-IDF, Word2Vec) into 2D coordinates, allowing us to plot them. So, setting `n_components=2`, allows us to reduce to 2D for visualisation.  This is because word embeddings exist in hundreds of dimensions, making them hard to visualise. Lowering to 2D helps us interpret relationships between words more easily:

t-SNE balances local and global word relationships using the `perplexity` parameter. High perplexity (30-50) captures global structures (broad relationships). Whereas, low perplexity (3-5) Focuses on local clusters (similar words grouped together).

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

from gensim.models import Word2Vec

model = Word2Vec(sentences=X, vector_size=100, window=5, min_count=1, workers=4)

word_embeddings = {word: model.wv[word] for word in model.wv.index_to_key}

print(word_embeddings)

# Vectorise text using TF-IDF
vectoriser = TfidfVectorizer(stop_words="english", max_features=50)  # Limit features for efficiency

X_tfidf = vectoriser.fit_transform(X)

feature_names = vectoriser.get_feature_names_out()

# Transpose TF-IDF matrix to get word embeddings (words as features)
X_tfidf_transposed = X_tfidf.T.toarray()  # Each row corresponds to a word

seed = 7

# Reduce dimensionality with PCA before applying t-SNE
pca = PCA(n_components=5, random_state=seed)  # Reduce to 5 dimensions max
X_pca = pca.fit_transform(X_tfidf_transposed)

# Apply t-SNE on PCA-reduced word embeddings
tsne = TSNE(n_components=2, random_state=seed, perplexity=2)  # Lower perplexity for small datasets, we can increase this if using all the data
X_embedded = tsne.fit_transform(X_pca)

# Create scatter plot for word embeddings
plt.figure(figsize=(8, 8))
plt.scatter(X_embedded[:, 0], X_embedded[:, 1], alpha=0.7, color="purple")

# Annotate each point with its corresponding word
for i, word in enumerate(feature_names):
    plt.annotate(word, (X_embedded[i, 0], X_embedded[i, 1]), fontsize=9, alpha=0.75)

plt.xlabel("t-SNE Component 1")
plt.ylabel("t-SNE Component 2")

plt.title("t-SNE visualisation of Word Embeddings")
plt.grid(True)

From the plot we can see similar words cluster together.  This helps find semantic relationships in text data (e.g., words like king and queen appearing close). t-SNE preserves local relationships better than PCA.

PCA keeps global structure but doesn't capture local similarities well.
t-SNE maintains local similarities, ensuring words that were close in high dimensions stay close in 2D space. This makes it useful for understanding word embeddings and text data.

If you use TF-IDF, and Word2Vec, t-SNE helps visualise how words are grouped. Therefore, we can use it to analyse sentiment-related words, topic clusters, or semantic relationships. In sentiment analysis, you might find clusters of positive and negative words, for instance.

## What have we learnt?

Through this exploration of feature selection techniques for language data, we have seen how different methods help improve text classification models by reducing noise, enhancing interpretability, and improving computational efficiency.

We discussed manual and domain-specific methods, such as stopword removal, part-of-speech filtering, and named entity recognition to help refine feature selection by focusing on meaningful words.

We covered statistical approaches like TF-IDF to highlight important but uncommon words, as well as, chi-square measures for word-category association. We looked at mutual information, which helps determine words that provide the most predictive value. We also saw that dimensionality reduction methods like Latent Semantic Analysis (LSA) and Non-Negative Matrix Factorisation (NMF) reduce the feature space while preserving key relationships, making models more efficient and interpretable.

Embedded methods like Lasso regression and Random Forest-based selection are useful as they automatically identify important words while eliminating less useful ones. And neural network-based approaches like word embeddings capture semantic meaning beyond frequency counts.

We demonstrated how to implement t-SNE visualisation to help in exploring high-dimensional text data and understanding how words cluster based on their relationships.

In summary, we can enhance the accuracy and efficiency of our text classification models by applying these methods, ensuring that they focus on the most meaningful features. The choice of method depends on the dataset, model complexity, and computational resources available. Combining multiple techniques often leads to the best results in real-world applications.