# You are part of a team developing a text classification system for a news aggregator platform. The platform aims to categorize news articles into different topics automatically. The dataset contains news articles along with their corresponding topics. Perform only the Feature extraction techniques.

Dataset Link: https://www.kaggle.com/datasets/therohk/million-headlines

Data Exploration: Begin by exploring the dataset. What are the different topics/categories present in the dataset? What is the distribution of articles across these topics?

Bag-of-Words (BoW): Implement a Bag-of-Words (BoW) model using CountVectorizer or TF-IDF to transform the text data into numerical features. Discuss the advantages and limitations of Bow in this context. Apply both unigram and bigram techniques and compare their effects on classification accuracy.

N-grams: Explore the use of N-grams (bi-grams, tri-grams) in feature engineering. How do different N-gram ranges impact the performance of the classification model?

TF-IDF: Apply TF-IDF (Term Frequency-Inverse Document Frequency) to the text data. Describe how TF-IDF works and its significance in capturing the importance of words

across documents. Compare the results of TF-IDF with the BoW approach. One-Hot Encoding: Investigate the application of One-Hot Encoding to encode categorical

variables or labels, Can One-Hot Encoding be used directly for text classification? Why or

why not?

Deliverables:

Present insights gathered from data exploration and discuss the impact of different feature engineering techniques (BoW. N-grams, TF-IDF, One-Hot Encoding). Provide recommendations for the best feature engineering strategy. Code in python

In [2]:
# Import necessary libraries
import pandas as pd

# Load the dataset
df = pd.read_csv("C://Users//vaish//abcnews-date-text.csv")

# Display the first few rows of the dataset
print("Sample of the Dataset:")
print(df.head())

# Explore different topics/categories
categories = df['publish_date'].unique()
print("\nCategories present in the dataset:")
print(categories)

# Distribution of articles across topics
print("\nDistribution of articles across topics:")
print(df['publish_date'].value_counts())

Sample of the Dataset:
   publish_date                                      headline_text
0      20030219  aba decides against community broadcasting lic...
1      20030219     act fire witnesses must be aware of defamation
2      20030219     a g calls for infrastructure protection summit
3      20030219           air nz staff in aust strike for pay rise
4      20030219      air nz strike to affect australian travellers

Categories present in the dataset:
[20030219 20030220 20030221 ... 20211229 20211230 20211231]

Distribution of articles across topics:
publish_date
20120824    384
20130412    383
20110222    380
20120814    379
20130514    378
           ... 
20210605      6
20211023      5
20210515      5
20210806      1
20170209      1
Name: count, Length: 6882, dtype: int64


# Bag-of-Words (BoW):

python

In [6]:
!pip install --upgrade pandas


Collecting pandas
  Obtaining dependency information for pandas from https://files.pythonhosted.org/packages/97/d8/dc2f6bff06a799a5603c414afc6de39c6351fe34892d50b6a077df3be6ac/pandas-2.1.3-cp311-cp311-win_amd64.whl.metadata
  Downloading pandas-2.1.3-cp311-cp311-win_amd64.whl.metadata (18 kB)
Downloading pandas-2.1.3-cp311-cp311-win_amd64.whl (10.6 MB)
   ---------------------------------------- 0.0/10.6 MB ? eta -:--:--
   ---------------------------------------- 0.0/10.6 MB ? eta -:--:--
   ---------------------------------------- 0.1/10.6 MB 1.2 MB/s eta 0:00:10
    --------------------------------------- 0.2/10.6 MB 1.4 MB/s eta 0:00:08
    --------------------------------------- 0.3/10.6 MB 1.6 MB/s eta 0:00:07
   - -------------------------------------- 0.3/10.6 MB 1.6 MB/s eta 0:00:07
   - -------------------------------------- 0.4/10.6 MB 1.7 MB/s eta 0:00:06
   -- ------------------------------------- 0.5/10.6 MB 1.7 MB/s eta 0:00:07
   -- -------------------------------------

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
d2l 1.0.3 requires numpy==1.23.5, but you have numpy 1.26.2 which is incompatible.
d2l 1.0.3 requires pandas==2.0.3, but you have pandas 2.1.3 which is incompatible.
d2l 1.0.3 requires scipy==1.10.1, but you have scipy 1.11.4 which is incompatible.


In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Load the dataset
file_path = "C://Users//vaish//abcnews-date-text.csv"  # Replace with the actual path to your downloaded CSV file
df = pd.read_csv(file_path)

# Check the column names in the DataFrame
print(df.columns)

# Assuming the column names are 'headline_text' and 'category', adjust accordingly if different
# For simplicity, let's focus on a smaller subset for faster execution
df = df.sample(frac=0.1, random_state=42)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['headline_text'], df['publish_date'], test_size=0.2, random_state=42)

# Implement Bag-of-Words with unigrams using CountVectorizer
count_vectorizer_unigram = CountVectorizer()
X_train_counts_unigram = count_vectorizer_unigram.fit_transform(X_train)
X_test_counts_unigram = count_vectorizer_unigram.transform(X_test)

# Implement Bag-of-Words with bigrams using CountVectorizer
count_vectorizer_bigram = CountVectorizer(ngram_range=(2, 2))
X_train_counts_bigram = count_vectorizer_bigram.fit_transform(X_train)
X_test_counts_bigram = count_vectorizer_bigram.transform(X_test)

# Implement Bag-of-Words with unigrams using TF-IDF
tfidf_vectorizer_unigram = TfidfVectorizer()
X_train_tfidf_unigram = tfidf_vectorizer_unigram.fit_transform(X_train)
X_test_tfidf_unigram = tfidf_vectorizer_unigram.transform(X_test)

# Implement Bag-of-Words with bigrams using TF-IDF
tfidf_vectorizer_bigram = TfidfVectorizer(ngram_range=(2, 2))
X_train_tfidf_bigram = tfidf_vectorizer_bigram.fit_transform(X_train)
X_test_tfidf_bigram = tfidf_vectorizer_bigram.transform(X_test)

# Train a simple classifier (e.g., Multinomial Naive Bayes) and evaluate accuracy
# (Continue with the classifier and accuracy evaluation as shown in the previous code)


Index(['publish_date', 'headline_text'], dtype='object')


In [7]:
print(df.columns)


Index(['publish_date', 'headline_text'], dtype='object')


In [8]:
print(df.head())


   publish_date                                      headline_text
0      20030219  aba decides against community broadcasting lic...
1      20030219     act fire witnesses must be aware of defamation
2      20030219     a g calls for infrastructure protection summit
3      20030219           air nz staff in aust strike for pay rise
4      20030219      air nz strike to affect australian travellers


In [16]:
X_train, X_test, y_train, y_test = train_test_split(df['headline_text'], df['headline_text'], test_size=0.2, random_state=42)


In [17]:
print(df.columns)


Index(['publish_date', 'headline_text'], dtype='object')


In [18]:
X_train, X_test, y_train, y_test = train_test_split(df['headline_text'], df['publish_date'], test_size=0.2, random_state=42)


Advantages of BoW:

Simple and computationally efficient.
Captures word frequency information.
Limitations of BoW:

Ignores word order and structure.
Doesn't consider the semantic meaning.

# N-grams:

In [13]:
from sklearn.feature_extraction.text import CountVectorizer


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Load the dataset
file_path = "C://Users//vaish//abcnews-date-text.csv"  # Replace with the actual path to your downloaded CSV file
df = pd.read_csv(file_path)

# For simplicity, let's focus on a smaller subset for faster execution
df = df.sample(frac=0.1, random_state=42)

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(df['headline_text'], df['headline_text'], test_size=0.2, random_state=42)

# Function to train and evaluate the model
def train_and_evaluate_model(X_train_features, X_test_features, y_train, y_test):
    # Train a Naive Bayes classifier
    clf = MultinomialNB()
    clf.fit(X_train_features, y_train)

    # Make predictions on the test set
    y_pred = clf.predict(X_test_features)

    # Evaluate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    print(f'Accuracy: {accuracy:.2f}')

# Implement N-grams (bi-grams and tri-grams) with CountVectorizer
for n in range(2, 4):  # n-grams range from 2 to 3 (bi-grams to tri-grams)
    ngram_vectorizer = CountVectorizer(ngram_range=(n, n))
    X_train_ngram = ngram_vectorizer.fit_transform(X_train)
    X_test_ngram = ngram_vectorizer.transform(X_test)

    print(f'\nResults for {n}-grams:')
    train_and_evaluate_model(X_train_ngram, X_test_ngram, y_train, y_test)



Results for 2-grams:


Impact of N-grams:

Higher-order N-grams capture more contextual information but may lead to increased dimensionality.

# TF-IDF:

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Load the dataset
dataset_url = "C://Users//vaish//abcnews-date-text.csv"
df = pd.read_csv(dataset_url)

# For simplicity, let's focus on a smaller subset for faster execution
df = df.sample(frac=0.1, random_state=42)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['headline_text'], df['publish_date'], test_size=0.2, random_state=42)

# TF-IDF with unigrams
tfidf_vectorizer_unigram = TfidfVectorizer()
X_train_tfidf_unigram = tfidf_vectorizer_unigram.fit_transform(X_train)
X_test_tfidf_unigram = tfidf_vectorizer_unigram.transform(X_test)

# Bag-of-Words (BoW) with unigrams using CountVectorizer
count_vectorizer_unigram = CountVectorizer()
X_train_bow_unigram = count_vectorizer_unigram.fit_transform(X_train)
X_test_bow_unigram = count_vectorizer_unigram.transform(X_test)

# Train a simple classifier (e.g., Multinomial Naive Bayes) and evaluate accuracy

# TF-IDF with unigrams
model_tfidf_unigram = MultinomialNB()
model_tfidf_unigram.fit(X_train_tfidf_unigram, y_train)
y_pred_tfidf_unigram = model_tfidf_unigram.predict(X_test_tfidf_unigram)
accuracy_tfidf_unigram = accuracy_score(y_test, y_pred_tfidf_unigram)
print(f'Accuracy with TF-IDF (unigram): {accuracy_tfidf_unigram}')

# BoW with unigrams
model_bow_unigram = MultinomialNB()
model_bow_unigram.fit(X_train_bow_unigram, y_train)
y_pred_bow_unigram = model_bow_unigram.predict(X_test_bow_unigram)
accuracy_bow_unigram = accuracy_score(y_test, y_pred_bow_unigram)
print(f'Accuracy with BoW (unigram): {accuracy_bow_unigram}')

Accuracy with TF-IDF (unigram): 0.0019289503295290146
Accuracy with BoW (unigram): 0.003777527728660987


Significance of TF-IDF:

Weights terms based on their importance in the corpus.
Captures the uniqueness of words across documents.
One-Hot Encoding:
One-Hot Encoding is not directly applicable to text data. It is typically used for encoding categorical variables, not sequential data like text. Consider using techniques like BoW, N-grams, or TF-IDF for text classification.

Recommendations:

TF-IDF often outperforms BoW due to its ability to consider the importance of words.
Experiment with different N-gram ranges based on the size of your dataset and the complexity of language patterns.
Evaluate multiple models and feature sets to find the best combination for your specific task.

# ONEHOT ENCODER

In [12]:
encoder = OneHotEncoder(sparse=False)
y_train_encoded = encoder.fit_transform(y_train.values.reshape(-1, 1))
y_test_encoded = encoder.transform(y_test.values.reshape(-1, 1))



In [20]:
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\vaish\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [21]:
pip install nltk


Note: you may need to restart the kernel to use updated packages.


In [22]:
from nltk.tokenize import word_tokenize

sentence = "This is a sample sentence."
tokens = word_tokenize(sentence)

print(tokens)


['This', 'is', 'a', 'sample', 'sentence', '.']


In [24]:
import nltk
from nltk.util import ngrams

# Example sentence
sentence = "this is a sample sentence"

# Convert sentence to words
words = nltk.word_tokenize(sentence)

# Create unigrams, bigrams, and trigrams
unigrams = list(ngrams(words, 1))
bigrams = list(ngrams(words, 2))
trigrams = list(ngrams(words, 3))

print("Unigrams:",unigrams)
print("Bigrams:",bigrams)
print("Trigrams:",trigrams)

Unigrams: [('this',), ('is',), ('a',), ('sample',), ('sentence',)]
Bigrams: [('this', 'is'), ('is', 'a'), ('a', 'sample'), ('sample', 'sentence')]
Trigrams: [('this', 'is', 'a'), ('is', 'a', 'sample'), ('a', 'sample', 'sentence')]


# one hot encoding

In [26]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Sample data with a categorical variable
data = {'Category': ['A', 'B', 'A', 'C', 'B']}
df = pd.DataFrame(data)

# One-Hot Encoding
encoder = OneHotEncoder(sparse=False)  # Set sparse=False to get a dense array
encoded_data = encoder.fit_transform(df[['Category']])

# Create a DataFrame with the encoded data
encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(['Category']))

# Concatenate the original DataFrame with the encoded DataFrame
df_encoded = pd.concat([df, encoded_df], axis=1)

print(df_encoded)

  Category  Category_A  Category_B  Category_C
0        A         1.0         0.0         0.0
1        B         0.0         1.0         0.0
2        A         1.0         0.0         0.0
3        C         0.0         0.0         1.0
4        B         0.0         1.0         0.0




In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Sample text data
text_data = ['This is a positive example', 'This is a negative example', 'Another positive one', 'Negative example here']

# Sample labels
labels = [10,43, 66, 193]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(text_data, labels, test_size=0.2, random_state=42)

# TF-IDF vectorization
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Train a classifier (e.g., Naive Bayes)
model = MultinomialNB()
model.fit(X_train_tfidf, y_train)

# Make predictions
y_pred = model.predict(X_test_tfidf)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Accuracy: 0.00


# BAG OF WORDS

In [4]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Load the dataset
dataset_path = "C://Users//vaish//abcnews-date-text.csv"  # Replace with the actual path
df = pd.read_csv(dataset_path)

# Create a temporary 'category' column with a placeholder value for demonstration
df['category'] = 'placeholder_category'

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(df['headline_text'], df['category'], test_size=0.2, random_state=42)

# TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Bag-of-Words (BoW)
vectorizer = CountVectorizer()
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

# Classification using Naive Bayes
clf = MultinomialNB()

# Function to train and evaluate the classifier
def train_and_evaluate(X_train, X_test, y_train, y_test, vectorizer_type):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"\nAccuracy for {vectorizer_type}: {acc:.4f}")

# Evaluate TF-IDF
train_and_evaluate(X_train_tfidf, X_test_tfidf, y_train, y_test, "TF-IDF")

# Evaluate Bag-of-Words
train_and_evaluate(X_train_bow, X_test_bow, y_train, y_test,"Bag-of-Words")


Accuracy for TF-IDF: 1.0000

Accuracy for Bag-of-Words: 1.0000
