In [74]:
pip install transformers

Note: you may need to restart the kernel to use updated packages.


In [75]:
import pandas as pd
import re
from gensim.parsing.preprocessing import STOPWORDS
from gensim.parsing.preprocessing import strip_tags
from gensim.parsing.preprocessing import strip_numeric
from gensim.parsing.preprocessing import strip_punctuation
from gensim.parsing.preprocessing import strip_multiple_whitespaces
from gensim.parsing.preprocessing import remove_stopwords, strip_short, stem_text
import pickle
import en_core_web_sm
from transformers import BertTokenizer 
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Vic\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Explanation of those

pandas: Used for data manipulation and analysis.

re: Provides support for regular expressions for string manipulation.

gensim: Popular library for text processing and analysis.

pickle: Used for serializing and deserializing Python objects.

en_core_web_sm: SpaCy language model for English text processing.

transformers.BertTokenizer: Part of the Hugging Face Transformers library, used for tokenization with BERT models.

nltk: Natural Language Toolkit, used for text processing tasks.

nltk.corpus.stopwords: Stopwords corpus from NLTK, used for removing common words in text.

In [76]:
url = "https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json"
df = pd.read_json(url)

In [77]:
df.head()

Unnamed: 0,content,target,target_names
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4,comp.sys.mac.hardware
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1,comp.graphics
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14,sci.space


What kind of data does it contain?

Text contains text data related to newsgroup posts. Each row represents a newsgroup post ('content'), the target category('target') and the corresponding target name('target_names')

How many entries does it contain? --> 11314 entries

In [78]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11314 entries, 0 to 11313
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   content       11314 non-null  object
 1   target        11314 non-null  int64 
 2   target_names  11314 non-null  object
dtypes: int64(1), object(2)
memory usage: 353.6+ KB


Below here are listed all target names & their distribution

In [79]:
print(df['target_names'].value_counts())

target_names
rec.sport.hockey            600
soc.religion.christian      599
rec.motorcycles             598
rec.sport.baseball          597
sci.crypt                   595
rec.autos                   594
sci.med                     594
comp.windows.x              593
sci.space                   593
comp.os.ms-windows.misc     591
sci.electronics             591
comp.sys.ibm.pc.hardware    590
misc.forsale                585
comp.graphics               584
comp.sys.mac.hardware       578
talk.politics.mideast       564
talk.politics.guns          546
alt.atheism                 480
talk.politics.misc          465
talk.religion.misc          377
Name: count, dtype: int64


In [80]:
print(df['content'].iloc[0])

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







The first contents value matches the target name rec.autos because the content is about a car

Which business question can this dataset address?

It can help us understand discussions, trends and opinions related to the topics covered in the newsgroups, which can then be used for market research

In [81]:
def preprocess_text(text): # Here we define the function and specify that it takes in a single argument, text
    # Define the regex pattern to match the unwanted lines and words
    pattern = r'^(From:|Article-I.D.:|Organization:|Lines:|NNTP-Posting-Host:|Distribution:|Reply-To:|XNewsreader:|Expires:|\s+-{1,}|Subject:|Summary:|Keywords:).*$'
    #Here, we define a regular expression pattern (pattern) using r'', which matches lines starting with specific strings (e.g., 'From:', 'Subject:')
    # and unwanted characters (e.g., multiple hyphens preceded by a space). The ^ and $ symbols ensure that the entire line is matched.
    # Use re.sub() to remove unwanted lines and words
    processed_text = re.sub(pattern, '', text, flags=re.IGNORECASE | re.MULTILINE)
    #We use re.sub() to substitute the matched pattern with an empty string '' in the text. The flags argument with re.IGNORECASE makes
    # the matching case-insensitive, and re.MULTILINE allows matching across multiple lines.
    return processed_text.strip()
# Here the processed text is returned after stripping leading trailing whitespaces

In [82]:
# Here we apply the "preprocess_text" function to each entry in the 'content' column and store the result in a new column called "data"
df['data'] = df['content'].apply(preprocess_text)

# Display the first entry after preprocessing
print(df['data'].iloc[0])

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL


Why are we doing this? --> Part B

We are performing this preprocessing to clean the text data by removing metadata lines (starting with specific patterns) and irrelevant words. This step helps focus the analysis on the actual content of the newsgroup posts rather than including irrelevant or repetitive information.

In [83]:
# Apply preprocessing functions to the 'data' column
df['data'] = df['data'].apply(lambda x: strip_numeric(x)) # Remove numbers
df['data'] = df['data'].apply(lambda x: strip_punctuation(x)) # Remove punctuation
df['data'] = df['data'].apply(lambda x: strip_multiple_whitespaces(x)) # Remove multiple whitespaces

# Display the first entry after preprocessing
print(df['data'].iloc[0])

I was wondering if anyone out there could enlighten me on this car I saw the other day It was a door sports car looked to be from the late s early s It was called a Bricklin The doors were really small In addition the front bumper was separate from the rest of the body This is all I know If anyone can tellme a model name engine specs years of production where this car is made history or whatever info you have on this funky looking car please e mail Thanks IL


Part C

We removed, numeric digits, punctuation and multiple whitespaces to prepare the text data in a cleaner format that is more suitable for natural language processing or machine learning tasks

In [84]:
# Transform all letters in 'data' to lowercase
df['data'] = df['data'].apply(lambda x: x.lower())

# Display the first entry after transforming to lowercase
print(df['data'].iloc[0])

i was wondering if anyone out there could enlighten me on this car i saw the other day it was a door sports car looked to be from the late s early s it was called a bricklin the doors were really small in addition the front bumper was separate from the rest of the body this is all i know if anyone can tellme a model name engine specs years of production where this car is made history or whatever info you have on this funky looking car please e mail thanks il


In [85]:
# Print and compare the stopwords
print("Gensim Stopwords (sorted):")
print(sorted(STOPWORDS))

nltk_stopwords = set(stopwords.words('english'))
print("\nNLTK Stopwords (sorted):")
print(sorted(nltk_stopwords))

Gensim Stopwords (sorted):
['a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant', 'co', 'computer', 'con', 'could', 'couldnt', 'cry', 'de', 'describe', 'detail', 'did', 'didn', 'do', 'does', 'doesn', 'doing', 'don', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former',

In [86]:
# Check if negations are included in the nltk stopwords
negations = ['not', 'no', 'nor', 'neither', 'never', 'none', 'cannot', 'could not', 'would not', 'should not']
for neg in negations:
    if neg in nltk_stopwords:
        print(f"\n'{neg}' is included in NLTK stopwords.")
    else:
        print(f"\n'{neg}' is not included in NLTK stopwords.")


'not' is included in NLTK stopwords.

'no' is included in NLTK stopwords.

'nor' is included in NLTK stopwords.

'neither' is not included in NLTK stopwords.

'never' is not included in NLTK stopwords.

'none' is not included in NLTK stopwords.

'cannot' is not included in NLTK stopwords.

'could not' is not included in NLTK stopwords.

'would not' is not included in NLTK stopwords.

'should not' is not included in NLTK stopwords.


Part D -- Is it reasonable to include negotiations in stopwords?

Including negations in stopwords is reasonable because they often do not provide semantic meaning in many text analysis tasks

In [87]:
# Stopwords to be removed
stopwords_to_remove = [
    "aren't", "isn", "isn't", "mightn", "mightn't", "mustn", "mustn't",
    "needn", "needn't", "no", "nor", "not", "shan't", "shouldn", "shouldn't",
    "wasn", "wasn't", "weren't", "wouldn", "wouldn't"
]

# Adjust nltk_stopwords list by removing specified stopwords
for word in stopwords_to_remove:          #Here we iterate trough this list and use remove() method to remove each word from the nltk_stopwords list
    if word in nltk_stopwords:
        nltk_stopwords.remove(word)

print("\nAdjusted NLTK Stopwords (sorted):")
print(sorted(nltk_stopwords))


Adjusted NLTK Stopwords (sorted):
['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in', 'into', 'is', 'it', "it's", 'its', 'itself', 'just', 'll', 'm', 'ma', 'me', 'more', 'most', 'my', 'myself', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', 'she', "she's", 'should', "should've", 'so', 'some', 'such', 't', 'than', 'that', "that'll", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', '

In [88]:
# Remove stopwords using gensim's remove_stopwords with nltk_stopwords
df['data'] = df['data'].apply(lambda x: remove_stopwords(x, stopwords=nltk_stopwords))

# Display the first entry after removing stopwords
print(df['data'][0])

wondering anyone could enlighten car saw day door sports car looked late early called bricklin doors really small addition front bumper separate rest body know anyone tellme model name engine specs years production car made history whatever info funky looking car please e mail thanks il


In [89]:
# Apply Strise_short to remove words with length less than 3	
df['data'] = df['data'].apply(lambda x: strip_short(x, minsize = 3))

# Display the first entry after applying strip_short
print(df['data'][0])

wondering anyone could enlighten car saw day door sports car looked late early called bricklin doors really small addition front bumper separate rest body know anyone tellme model name engine specs years production car made history whatever info funky looking car please mail thanks


Part D 10 --> What does strip_short function do and why?

It removes tokens (words pr parts of words) that are shorter tjan a specified length (here length=3) by using the parameter 'minsize'

Purpose: Noise reduction, because short tokens like "a", "an", "is", "it", etc., often do not carry significant meaning in many natural language processing tasks. Removing them heps reducdd noise in the text data.

Also by removing short tokens, the focus is shifted to longer tokens that may carry more semantic value in the cntext of your task

Further more in many cases, removing short tokens can lead to better model performance.


In [90]:
######
df['data']

0        wondering anyone could enlighten car saw day d...
1        fair number brave souls upgraded clock oscilla...
2        well folks mac plus finally gave ghost weekend...
3        newsreader tin version robert kyanko rob rjck ...
4        article cowcb world std com tombaker world std...
                               ...                        
11309    nyeda cnsvax uwec edu david nye neurology cons...
11310    old mac mac plus problem screens blank sometim...
11311    newsreader tin version installed cpu clone mot...
11312    article qkgbuinnsn shelley washington edu bols...
11313    stolen pasadena blue white honda cbrrr califor...
Name: data, Length: 11314, dtype: object

In [91]:
from nltk.stem import PorterStemmer
# Initialize PorterStemmer
stemmer = PorterStemmer()

# Define a function to apply stemming to a text
def stem_text(text):
    words = text.split()
    stemmed_words = [stemmer.stem(word) for word in words]
    return ' '.join(stemmed_words)

# Apply stemming to the 'data' column and store the result as 'data_stem'
df['data_stem'] = df['data'].apply(lambda x: stem_text(x))

print(df['data_stem'])

0        wonder anyon could enlighten car saw day door ...
1        fair number brave soul upgrad clock oscil shar...
2        well folk mac plu final gave ghost weekend sta...
3        newsread tin version robert kyanko rob rjck uu...
4        articl cowcb world std com tombak world std co...
                               ...                        
11309    nyeda cnsvax uwec edu david nye neurolog consu...
11310    old mac mac plu problem screen blank sometim m...
11311    newsread tin version instal cpu clone motherbo...
11312    articl qkgbuinnsn shelley washington edu bolso...
11313    stolen pasadena blue white honda cbrrr califor...
Name: data_stem, Length: 11314, dtype: object


In [92]:
data_stem = df['data_stem']

In [93]:
data_stem

0        wonder anyon could enlighten car saw day door ...
1        fair number brave soul upgrad clock oscil shar...
2        well folk mac plu final gave ghost weekend sta...
3        newsread tin version robert kyanko rob rjck uu...
4        articl cowcb world std com tombak world std co...
                               ...                        
11309    nyeda cnsvax uwec edu david nye neurolog consu...
11310    old mac mac plu problem screen blank sometim m...
11311    newsread tin version instal cpu clone motherbo...
11312    articl qkgbuinnsn shelley washington edu bolso...
11313    stolen pasadena blue white honda cbrrr califor...
Name: data_stem, Length: 11314, dtype: object

In [94]:
#data_stem = stem_text(df['data'])

#print(data_stem)

Part E 11 --> Here we applied data_stem which applies stemming to words in the text, reducing them to the base or root form. (eg. running becomes run 'fishing' becomes fish and so on). In other words stemming normalizes words by converting different nfected or derived forms of a word to a common base form. This reduces vocabulary size and improves the efficiency of text processing algorithms.

Other benefits: Improved model generalization, because it treats different forms of words as the same word and therefore capturing the underlying semantics more effectively

In [95]:
import spacy
# Initialize Spacy 'en' model
nlp = spacy.load('en_core_web_sm')

# Data to be lemmatized
data = df['data'][0]

# Initialize the list to store lemmatized documents
data_lem = []

# Apply lemmatization using Spacy
for doc in data:
    # Process the document using Spacy
    doc_spacy = nlp(doc)
    
    # Extract lemmas for each token and join them back
    lemmas = [token.lemma_ for token in doc_spacy]
    doc_lem = " ".join(lemmas)
    
    # Append the lemmatized document to data_lem list
    data_lem.append(doc_lem)

# Print the lemmatized documents
for idx, doc in enumerate(data_lem):
    print(f"Lemmatized Document {idx + 1}: {doc}")

Lemmatized Document 1: w
Lemmatized Document 2: o
Lemmatized Document 3: n
Lemmatized Document 4: d
Lemmatized Document 5: e
Lemmatized Document 6: r
Lemmatized Document 7: I
Lemmatized Document 8: n
Lemmatized Document 9: g
Lemmatized Document 10:  
Lemmatized Document 11: a
Lemmatized Document 12: n
Lemmatized Document 13: y
Lemmatized Document 14: o
Lemmatized Document 15: n
Lemmatized Document 16: e
Lemmatized Document 17:  
Lemmatized Document 18: c
Lemmatized Document 19: o
Lemmatized Document 20: u
Lemmatized Document 21: l
Lemmatized Document 22: d
Lemmatized Document 23:  
Lemmatized Document 24: e
Lemmatized Document 25: n
Lemmatized Document 26: l
Lemmatized Document 27: I
Lemmatized Document 28: g
Lemmatized Document 29: h
Lemmatized Document 30: t
Lemmatized Document 31: e
Lemmatized Document 32: n
Lemmatized Document 33:  
Lemmatized Document 34: c
Lemmatized Document 35: a
Lemmatized Document 36: r
Lemmatized Document 37:  
Lemmatized Document 38: s
Lemmatized Document 3

In [96]:
# Load the BERT tokenizer
bert_uncased = BertTokenizer.from_pretrained('bert-base-uncased')

# Print the vocabulary size
print("Vocabulary size:", len(bert_uncased.vocab))

# Print some tokens from the vocabulary
print(list(bert_uncased.vocab.keys())[1997:2100])

Vocabulary size: 30522
['of', 'and', 'in', 'to', 'was', 'he', 'is', 'as', 'for', 'on', 'with', 'that', 'it', 'his', 'by', 'at', 'from', 'her', '##s', 'she', 'you', 'had', 'an', 'were', 'but', 'be', 'this', 'are', 'not', 'my', 'they', 'one', 'which', 'or', 'have', 'him', 'me', 'first', 'all', 'also', 'their', 'has', 'up', 'who', 'out', 'been', 'when', 'after', 'there', 'into', 'new', 'two', 'its', '##a', 'time', 'would', 'no', 'what', 'about', 'said', 'we', 'over', 'then', 'other', 'so', 'more', '##e', 'can', 'if', 'like', 'back', 'them', 'only', 'some', 'could', '##i', 'where', 'just', '##ing', 'during', 'before', '##n', 'do', '##o', 'made', 'school', 'through', 'than', 'now', 'years', 'most', 'world', 'may', 'between', 'down', 'well', 'three', '##d', 'year', 'while', 'will', '##ed', '##r']


Part E 13.a.: What is the vocabulary size? It's 30522 words.

In [97]:
# This is our data
data = df['data'][0]

# Initialize the list to store BERT-tokenized data
data_BERT = []

# Apply BERT-tokenization to each text
for doc in data:
    # Tokenize the text using BERT tokenizer
    tokens = bert_uncased.tokenize(doc)
    
    # Join tokens back to form a string
    tokenized_text = " ".join(tokens)
    
    # Append the tokenized text to data_BERT list
    data_BERT.append(tokenized_text)

# Print the first tokenized document
print("Tokenized Document 0:", data_BERT[0])

Tokenized Document 0: w


In [98]:
data_BERT[0]  

'w'

In [99]:
data

'wondering anyone could enlighten car saw day door sports car looked late early called bricklin doors really small addition front bumper separate rest body know anyone tellme model name engine specs years production car made history whatever info funky looking car please mail thanks'

In [100]:
print(data_lem)

['w', 'o', 'n', 'd', 'e', 'r', 'I', 'n', 'g', ' ', 'a', 'n', 'y', 'o', 'n', 'e', ' ', 'c', 'o', 'u', 'l', 'd', ' ', 'e', 'n', 'l', 'I', 'g', 'h', 't', 'e', 'n', ' ', 'c', 'a', 'r', ' ', 's', 'a', 'w', ' ', 'd', 'a', 'y', ' ', 'd', 'o', 'o', 'r', ' ', 's', 'p', 'o', 'r', 't', 's', ' ', 'c', 'a', 'r', ' ', 'l', 'o', 'o', 'k', 'e', 'd', ' ', 'l', 'a', 't', 'e', ' ', 'e', 'a', 'r', 'l', 'y', ' ', 'c', 'a', 'l', 'l', 'e', 'd', ' ', 'b', 'r', 'I', 'c', 'k', 'l', 'I', 'n', ' ', 'd', 'o', 'o', 'r', 's', ' ', 'r', 'e', 'a', 'l', 'l', 'y', ' ', 's', 'm', 'a', 'l', 'l', ' ', 'a', 'd', 'd', 'I', 't', 'I', 'o', 'n', ' ', 'f', 'r', 'o', 'n', 't', ' ', 'b', 'u', 'm', 'p', 'e', 'r', ' ', 's', 'e', 'p', 'a', 'r', 'a', 't', 'e', ' ', 'r', 'e', 's', 't', ' ', 'b', 'o', 'd', 'y', ' ', 'k', 'n', 'o', 'w', ' ', 'a', 'n', 'y', 'o', 'n', 'e', ' ', 't', 'e', 'l', 'l', 'm', 'e', ' ', 'm', 'o', 'd', 'e', 'l', ' ', 'n', 'a', 'm', 'e', ' ', 'e', 'n', 'g', 'I', 'n', 'e', ' ', 's', 'p', 'e', 'c', 's', ' ', 'y', 'e',

In [101]:
print(data_stem)

0        wonder anyon could enlighten car saw day door ...
1        fair number brave soul upgrad clock oscil shar...
2        well folk mac plu final gave ghost weekend sta...
3        newsread tin version robert kyanko rob rjck uu...
4        articl cowcb world std com tombak world std co...
                               ...                        
11309    nyeda cnsvax uwec edu david nye neurolog consu...
11310    old mac mac plu problem screen blank sometim m...
11311    newsread tin version instal cpu clone motherbo...
11312    articl qkgbuinnsn shelley washington edu bolso...
11313    stolen pasadena blue white honda cbrrr califor...
Name: data_stem, Length: 11314, dtype: object


In [102]:
print(data)

wondering anyone could enlighten car saw day door sports car looked late early called bricklin doors really small addition front bumper separate rest body know anyone tellme model name engine specs years production car made history whatever info funky looking car please mail thanks


In [103]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import Binarizer
from gensim.corpora import Dictionary
from gensim.models import TfidfModel

In [104]:
data_stem

0        wonder anyon could enlighten car saw day door ...
1        fair number brave soul upgrad clock oscil shar...
2        well folk mac plu final gave ghost weekend sta...
3        newsread tin version robert kyanko rob rjck uu...
4        articl cowcb world std com tombak world std co...
                               ...                        
11309    nyeda cnsvax uwec edu david nye neurolog consu...
11310    old mac mac plu problem screen blank sometim m...
11311    newsread tin version instal cpu clone motherbo...
11312    articl qkgbuinnsn shelley washington edu bolso...
11313    stolen pasadena blue white honda cbrrr califor...
Name: data_stem, Length: 11314, dtype: object

In [105]:
data

'wondering anyone could enlighten car saw day door sports car looked late early called bricklin doors really small addition front bumper separate rest body know anyone tellme model name engine specs years production car made history whatever info funky looking car please mail thanks'

3. Bag of Words (sklearn)

In [106]:
# Initialize CountVectorizer
count_vectorizer = CountVectorizer(max_df=0.95, min_df=0.05)

# Fit and transform the documents using CountVectorizer
count_matrix = count_vectorizer.fit_transform(data_stem)

# Get feature names
count_feature_names = count_vectorizer.get_feature_names_out()

# Print number of features and first entry
print("Number of features in CountVectorizer:", len(count_feature_names))
print("First entry in CountVectorizer features:", count_feature_names)

Number of features in CountVectorizer: 300
First entry in CountVectorizer features: ['abl' 'accept' 'actual' 'address' 'advanc' 'ago' 'agre' 'allow' 'alreadi'
 'also' 'alway' 'american' 'anoth' 'answer' 'anyon' 'anyth' 'anyway'
 'appreci' 'apr' 'area' 'around' 'articl' 'ask' 'assum' 'avail' 'away'
 'back' 'bad' 'base' 'becom' 'believ' 'best' 'better' 'big' 'bit' 'book'
 'buy' 'call' 'cannot' 'car' 'card' 'care' 'case' 'caus' 'chang' 'check'
 'christian' 'claim' 'close' 'com' 'come' 'complet' 'comput' 'consid'
 'control' 'correct' 'cost' 'could' 'cours' 'creat' 'current' 'data'
 'david' 'day' 'design' 'differ' 'discuss' 'done' 'drive' 'edu' 'effect'
 'either' 'els' 'email' 'end' 'engin' 'enough' 'etc' 'even' 'ever' 'everi'
 'exampl' 'except' 'exist' 'expect' 'experi' 'fact' 'far' 'fax' 'feel'
 'file' 'find' 'first' 'follow' 'forc' 'found' 'free' 'full' 'game'
 'gener' 'get' 'give' 'given' 'go' 'god' 'good' 'got' 'govern' 'great'
 'group' 'guess' 'hand' 'happen' 'hard' 'heard' 'help' 'hi

 CountVectorizer converts a collection of text documents into a matrix of token counts. Here, max_df and min_df are used to filter out terms based on their document frequency in the corpus. Terms that occur in more than 95% of the documents (max_df=0.95) or less than 5% of the documents (min_df=0.05) are ignored.

 Pro: Simple to understand & implement. It captures the frequency of terms in each document accurately
 Con: Doesn'T consider importance of terms in the document

In [107]:
# Initialize TfidfVectorizer without IDF and with L1 normalization
tfidf_vectorizer_l1 = TfidfVectorizer(max_df=0.95, min_df=0.05, use_idf=False, norm='l1')

# Fit and transform the documents using TfidfVectorizer with L1 normalization
tfidf_matrix_l1 = tfidf_vectorizer_l1.fit_transform(data_stem)

# Get feature names
tfidf_feature_names_l1 = tfidf_vectorizer_l1.get_feature_names_out()

# Print number of features and first entry
print("Number of features in TfidfVectorizer (L1):", len(tfidf_feature_names_l1))
print("First entry in TfidfVectorizer (L1) features:", tfidf_feature_names_l1)

Number of features in TfidfVectorizer (L1): 300
First entry in TfidfVectorizer (L1) features: ['abl' 'accept' 'actual' 'address' 'advanc' 'ago' 'agre' 'allow' 'alreadi'
 'also' 'alway' 'american' 'anoth' 'answer' 'anyon' 'anyth' 'anyway'
 'appreci' 'apr' 'area' 'around' 'articl' 'ask' 'assum' 'avail' 'away'
 'back' 'bad' 'base' 'becom' 'believ' 'best' 'better' 'big' 'bit' 'book'
 'buy' 'call' 'cannot' 'car' 'card' 'care' 'case' 'caus' 'chang' 'check'
 'christian' 'claim' 'close' 'com' 'come' 'complet' 'comput' 'consid'
 'control' 'correct' 'cost' 'could' 'cours' 'creat' 'current' 'data'
 'david' 'day' 'design' 'differ' 'discuss' 'done' 'drive' 'edu' 'effect'
 'either' 'els' 'email' 'end' 'engin' 'enough' 'etc' 'even' 'ever' 'everi'
 'exampl' 'except' 'exist' 'expect' 'experi' 'fact' 'far' 'fax' 'feel'
 'file' 'find' 'first' 'follow' 'forc' 'found' 'free' 'full' 'game'
 'gener' 'get' 'give' 'given' 'go' 'god' 'good' 'got' 'govern' 'great'
 'group' 'guess' 'hand' 'happen' 'hard' 'heard' 

TfidfVectorizer transforms text into a matrix of TF-IDF features. Here, use_idf=False means we are not using Inverse Document Frequency (IDF) weighting, and norm='l1' normalizes each output row to have unit L1 norm.

Pro: Captures term frequency while accounting for document length differences (with L1 noralization)

Con: Doesn't consider IDF, which could ffect the importance of rare terms

In [108]:
# Initialize TfidfVectorizer without IDF smoothing
tfidf_vectorizer_no_smooth = TfidfVectorizer(max_df=0.95, min_df=0.05, smooth_idf=False)

# Fit and transform the documents using TfidfVectorizer without IDF smoothing
tfidf_matrix_no_smooth = tfidf_vectorizer_no_smooth.fit_transform(data_stem)

# Get feature names
tfidf_feature_names_no_smooth = tfidf_vectorizer_no_smooth.get_feature_names_out()

# Print number of features and first entry
print("Number of features in TfidfVectorizer (no smoothing):", len(tfidf_feature_names_no_smooth))
print("First entry in TfidfVectorizer (no smoothing) features:", tfidf_feature_names_no_smooth)

Number of features in TfidfVectorizer (no smoothing): 300
First entry in TfidfVectorizer (no smoothing) features: ['abl' 'accept' 'actual' 'address' 'advanc' 'ago' 'agre' 'allow' 'alreadi'
 'also' 'alway' 'american' 'anoth' 'answer' 'anyon' 'anyth' 'anyway'
 'appreci' 'apr' 'area' 'around' 'articl' 'ask' 'assum' 'avail' 'away'
 'back' 'bad' 'base' 'becom' 'believ' 'best' 'better' 'big' 'bit' 'book'
 'buy' 'call' 'cannot' 'car' 'card' 'care' 'case' 'caus' 'chang' 'check'
 'christian' 'claim' 'close' 'com' 'come' 'complet' 'comput' 'consid'
 'control' 'correct' 'cost' 'could' 'cours' 'creat' 'current' 'data'
 'david' 'day' 'design' 'differ' 'discuss' 'done' 'drive' 'edu' 'effect'
 'either' 'els' 'email' 'end' 'engin' 'enough' 'etc' 'even' 'ever' 'everi'
 'exampl' 'except' 'exist' 'expect' 'experi' 'fact' 'far' 'fax' 'feel'
 'file' 'find' 'first' 'follow' 'forc' 'found' 'free' 'full' 'game'
 'gener' 'get' 'give' 'given' 'go' 'god' 'good' 'got' 'govern' 'great'
 'group' 'guess' 'hand' 'hap

Similar to the previous TfidfVectorizer but with smooth_idf=False, which means we do not apply IDF smoothing to the weights.

Pro: It provides TF-IDF weights without IDF smoothing, which could slightly affect the weights of rare terms.

Con: Without IDF smoothing, rare terms might get overly emphasized in importance

Features are all the same here (I think this is wrong). Changing max_df and min_df affects which terms are considered during vectorization. Lowering max_df includes more terms, while raising min_df excludes more terms from the feature set.

In [109]:
from sklearn.preprocessing import Binarizer

# Initialize Binarizer
binarizer = Binarizer()

# Apply Binarizer to the count matrix
binary_matrix = binarizer.fit_transform(count_matrix)

# Get feature names from CountVectorizer
feature_names = count_vectorizer.get_feature_names_out()

# Print the binary matrix and feature names
print("Binary Matrix:")
print(binary_matrix.toarray())
print("\nFeature Names:")
print(feature_names)


Binary Matrix:
[[0 0 0 ... 0 1 0]
 [0 0 0 ... 0 0 0]
 [0 0 1 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 1]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

Feature Names:
['abl' 'accept' 'actual' 'address' 'advanc' 'ago' 'agre' 'allow' 'alreadi'
 'also' 'alway' 'american' 'anoth' 'answer' 'anyon' 'anyth' 'anyway'
 'appreci' 'apr' 'area' 'around' 'articl' 'ask' 'assum' 'avail' 'away'
 'back' 'bad' 'base' 'becom' 'believ' 'best' 'better' 'big' 'bit' 'book'
 'buy' 'call' 'cannot' 'car' 'card' 'care' 'case' 'caus' 'chang' 'check'
 'christian' 'claim' 'close' 'com' 'come' 'complet' 'comput' 'consid'
 'control' 'correct' 'cost' 'could' 'cours' 'creat' 'current' 'data'
 'david' 'day' 'design' 'differ' 'discuss' 'done' 'drive' 'edu' 'effect'
 'either' 'els' 'email' 'end' 'engin' 'enough' 'etc' 'even' 'ever' 'everi'
 'exampl' 'except' 'exist' 'expect' 'experi' 'fact' 'far' 'fax' 'feel'
 'file' 'find' 'first' 'follow' 'forc' 'found' 'free' 'full' 'game'
 'gener' 'get' 'give' 'given' 'go' 'god' 'good' 'got' 'govern'

When you apply Binarizer() to the result from a CountVectorizer, it converts the count matrix into a binary matrix where non-zero counts are transformed to 1, and zero counts remain as 0.

Pros: 

Simplicity: The binary representation simplifies the data by removing the frequency information, which can sometimes be noisy or less relevant

Ease of Use: Binary data is often easier to work with for certain machine learning algorithms, especially those that require binary input (e.g., association rule mining, certain types of clustering).


Con: 

Loss of information: Binary representation loses the information about word frequency, which can be valuable in some natural language processing (NLP) tasks

Lack of contect: It doesn't capture the degree of word importance or relevance within documents, which can be important for tasks like sentiment analysis, document classification, etc.

In [115]:
# Convert the matrices to arrays
count_array = count_matrix.toarray()
binary_array = binary_matrix.toarray()

# Extract frequencies for the first document
first_doc_count = count_array[0]
first_doc_tfidf = binary_array[0]

# Create DataFrame with feature names, counts, and TF-IDF values for the first document
data = {'keys': feature_names, 'Count_Frequency': first_doc_count, 'TF_IDF_Frequency': first_doc_tfidf}
df = pd.DataFrame(data)

# Keep only rows for words existing in the first document
df = df[df['Count_Frequency'] > 0]

# Sort the DataFrame based on TF-IDF frequencies (descending)
sorted_by_tfidf = df.sort_values(by='TF_IDF_Frequency', ascending=False)

# Sort the DataFrame based on absolute frequencies (descending)
sorted_by_count = df.sort_values(by='Count_Frequency', ascending=False)

print("Sorted by TF-IDF Frequencies:")
print(sorted_by_tfidf)

print("\nSorted by Absolute Frequencies:")
print(sorted_by_count)

Sorted by TF-IDF Frequencies:
       keys  Count_Frequency  TF_IDF_Frequency
14    anyon                2                 1
153    mail                1                 1
289  wonder                1                 1
261   thank                1                 1
241   small                1                 1
216  realli                1                 1
198   pleas                1                 1
170    name                1                 1
152    made                1                 1
37     call                1                 1
149    look                2                 1
134    know                1                 1
75    engin                1                 1
63      day                1                 1
57    could                1                 1
39      car                4                 1
298    year                1                 1

Sorted by Absolute Frequencies:
       keys  Count_Frequency  TF_IDF_Frequency
39      car                4                

4. Bag of Words (gensim)

In [112]:
corpus_gen = [doc.split() for doc in data_stem]

This command creates a corpus for Gensim using a list comprehension. Here's what each part does:

for doc in data_stem: Iterates through each document in the data_stem variable.
doc.split(): Splits each document into a list of words based on whitespace (assuming data_stem contains text strings).

[...]: Wraps the list comprehension to create a list of lists, where each inner list represents a document broken into words.

The corpus_gen variable now contains a list of lists, where each inner list represents a document's words. This format is suitable for building a Gensim corpus.

In [113]:
from gensim.corpora import Dictionary

# Create Dictionary from corpus_gen
id2word = Dictionary(corpus_gen)

# Filter extremes
id2word.filter_extremes(no_below=566, no_above=0.95)

Create a Gensim Dictionary from the corpus:

id2word = Dictionary(corpus_gen): This line creates a Gensim Dictionary object (id2word) based on the corpus_gen we created earlier. The Dictionary object maps each unique word in the corpus to a unique integer ID.

Apply filter_extremes method:
id2word.filter_extremes(no_below=566, no_above=0.95): This method call filters out tokens (words) that are too rare or too common in the corpus.
no_below=566: Specifies that tokens appearing in less than 566 documents will be removed. This helps remove very rare words that might not contribute much to the model.
no_above=0.95: Specifies that tokens appearing in more than 95% of the documents will be removed. This helps remove very common words (like stopwords) that may not be informative.
After applying filter_extremes, the id2word Dictionary object will contain only tokens that are neither too rare nor too common based on the specified thresholds. This cleaned dictionary is then used for further processing or modeling in Gensim.