# Text Processing Basics for Fake News Detection

This notebook covers the fundamental text processing concepts required for our fake news detection project:

1. **Tokenization**: Breaking text into individual words or tokens
2. **Stopword Removal**: Removing common words that don't carry much meaning
3. **Vectorization**: Converting text into numerical format for machine learning

Let's start by importing the necessary libraries and loading our sample data.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import nltk
import re
from collections import Counter

# NLTK specific imports
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

# Scikit-learn imports for vectorization
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')

print("Libraries imported successfully!")

In [None]:
# Load our sample data
df = pd.read_csv('../data/sample_data.csv')
print("Dataset shape:", df.shape)
print("\nFirst few rows:")
df.head()

In [None]:
# Check class distribution
print("Class distribution:")
print(df['label'].value_counts())
print("\n0: Real News")
print("1: Fake News")

# Visualize class distribution
plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='label')
plt.title('Distribution of Real vs Fake News')
plt.xlabel('Label (0=Real, 1=Fake)')
plt.ylabel('Count')
plt.show()

## 1. Tokenization

**Tokenization** is the process of breaking down text into individual words, phrases, symbols, or other meaningful elements called tokens.

### Types of Tokenization:
- **Word Tokenization**: Splitting text into individual words
- **Sentence Tokenization**: Splitting text into sentences
- **Subword Tokenization**: Breaking words into smaller units

### Why is Tokenization Important?
- Computers can't understand raw text
- It's the first step in text preprocessing
- Helps in feature extraction and analysis

In [None]:
# Example text for demonstration
sample_text = df.iloc[0]['text']
print("Original text:")
print(sample_text)
print("\n" + "="*50)

In [None]:
# 1.1 Simple tokenization using split()
simple_tokens = sample_text.split()
print("Simple tokenization using split():")
print(simple_tokens)
print(f"Number of tokens: {len(simple_tokens)}")
print("\n" + "="*50)

In [None]:
# 1.2 NLTK word tokenization (more sophisticated)
nltk_tokens = word_tokenize(sample_text)
print("NLTK word tokenization:")
print(nltk_tokens)
print(f"Number of tokens: {len(nltk_tokens)}")
print("\n" + "="*50)

In [None]:
# 1.3 Regular expression tokenization
# This pattern matches sequences of word characters (letters, digits, underscore)
regex_tokens = re.findall(r'\b\w+\b', sample_text.lower())
print("Regular expression tokenization:")
print(regex_tokens)
print(f"Number of tokens: {len(regex_tokens)}")
print("\n" + "="*50)

In [None]:
# 1.4 Sentence tokenization
sentences = sent_tokenize(sample_text)
print("Sentence tokenization:")
for i, sentence in enumerate(sentences, 1):
    print(f"Sentence {i}: {sentence}")
print(f"\nNumber of sentences: {len(sentences)}")

### Tokenization Function for Our Project

In [None]:
def tokenize_text(text, method='nltk'):
    """
    Tokenize text using different methods
    
    Args:
        text (str): Input text to tokenize
        method (str): Tokenization method ('simple', 'nltk', 'regex')
    
    Returns:
        list: List of tokens
    """
    if method == 'simple':
        return text.split()
    elif method == 'nltk':
        return word_tokenize(text.lower())
    elif method == 'regex':
        return re.findall(r'\b\w+\b', text.lower())
    else:
        raise ValueError("Method must be 'simple', 'nltk', or 'regex'")

# Test the function
test_text = "This is a test! Can you tokenize this?"
print("Original text:", test_text)
print("Simple:", tokenize_text(test_text, 'simple'))
print("NLTK:", tokenize_text(test_text, 'nltk'))
print("Regex:", tokenize_text(test_text, 'regex'))

## 2. Stopword Removal

**Stopwords** are common words that typically don't carry much meaning and are often filtered out during text processing.

### Examples of stopwords:
- Articles: a, an, the
- Prepositions: in, on, at, by
- Pronouns: I, you, he, she, it
- Common verbs: is, are, was, were

### Why remove stopwords?
- Reduces noise in the data
- Decreases computational complexity
- Focuses on meaningful words
- Improves model performance

In [None]:
# Get English stopwords from NLTK
stop_words = set(stopwords.words('english'))
print(f"Number of stopwords in NLTK: {len(stop_words)}")
print("\nFirst 20 stopwords:")
print(list(stop_words)[:20])

In [None]:
# Example: Remove stopwords from our sample text
tokens = word_tokenize(sample_text.lower())
print("Original tokens:")
print(tokens)
print(f"Number of tokens: {len(tokens)}")
print("\n" + "="*50)

# Remove stopwords
filtered_tokens = [word for word in tokens if word not in stop_words and word.isalpha()]
print("Tokens after stopword removal:")
print(filtered_tokens)
print(f"Number of tokens: {len(filtered_tokens)}")
print(f"Reduction: {len(tokens) - len(filtered_tokens)} tokens removed")

In [None]:
# Custom stopword removal function
def remove_stopwords(tokens, custom_stopwords=None):
    """
    Remove stopwords from a list of tokens
    
    Args:
        tokens (list): List of tokens
        custom_stopwords (set): Custom set of stopwords (optional)
    
    Returns:
        list: Filtered tokens without stopwords
    """
    if custom_stopwords is None:
        stop_words = set(stopwords.words('english'))
    else:
        stop_words = custom_stopwords
    
    # Filter out stopwords and non-alphabetic tokens
    filtered_tokens = [token for token in tokens 
                      if token.lower() not in stop_words and token.isalpha()]
    
    return filtered_tokens

# Test the function
test_tokens = ['this', 'is', 'a', 'great', 'example', 'of', 'text', 'processing']
print("Original tokens:", test_tokens)
print("After stopword removal:", remove_stopwords(test_tokens))

In [None]:
# Analyze the impact of stopword removal on our dataset
def analyze_stopword_impact(text):
    """
    Analyze the impact of stopword removal on text
    """
    tokens = word_tokenize(text.lower())
    original_count = len(tokens)
    
    # Remove stopwords and non-alphabetic tokens
    filtered_tokens = remove_stopwords(tokens)
    filtered_count = len(filtered_tokens)
    
    reduction_percentage = ((original_count - filtered_count) / original_count) * 100
    
    return {
        'original_count': original_count,
        'filtered_count': filtered_count,
        'reduction_percentage': reduction_percentage
    }

# Analyze impact on a few sample texts
for i in range(3):
    text = df.iloc[i]['text']
    label = 'Real' if df.iloc[i]['label'] == 0 else 'Fake'
    impact = analyze_stopword_impact(text)
    
    print(f"Text {i+1} ({label} News):")
    print(f"  Original tokens: {impact['original_count']}")
    print(f"  After filtering: {impact['filtered_count']}")
    print(f"  Reduction: {impact['reduction_percentage']:.1f}%")
    print()

## 3. Text Preprocessing Pipeline

Before vectorization, let's create a comprehensive text preprocessing pipeline that includes:
- Text cleaning
- Tokenization
- Stopword removal
- Stemming/Lemmatization

In [None]:
# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    """
    Clean and preprocess text
    """
    # Convert to lowercase
    text = text.lower()
    
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

def preprocess_text(text, remove_stopwords_flag=True, use_stemming=False, use_lemmatization=True):
    """
    Complete text preprocessing pipeline
    
    Args:
        text (str): Input text
        remove_stopwords_flag (bool): Whether to remove stopwords
        use_stemming (bool): Whether to apply stemming
        use_lemmatization (bool): Whether to apply lemmatization
    
    Returns:
        str: Preprocessed text
    """
    # Clean text
    text = clean_text(text)
    
    # Tokenize
    tokens = word_tokenize(text)
    
    # Remove stopwords
    if remove_stopwords_flag:
        tokens = remove_stopwords(tokens)
    
    # Apply stemming or lemmatization
    if use_stemming:
        tokens = [stemmer.stem(token) for token in tokens]
    elif use_lemmatization:
        tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    return ' '.join(tokens)

# Test the preprocessing pipeline
sample_text = df.iloc[1]['text']  # A fake news example
print("Original text:")
print(sample_text)
print("\n" + "="*50)

print("After preprocessing:")
processed_text = preprocess_text(sample_text)
print(processed_text)

## 4. Vectorization

**Vectorization** is the process of converting text into numerical format that machine learning algorithms can understand.

### Common Vectorization Techniques:

1. **Bag of Words (BoW)**: Represents text as a collection of words, ignoring grammar and word order
2. **Term Frequency-Inverse Document Frequency (TF-IDF)**: Reflects how important a word is to a document in a collection of documents
3. **N-grams**: Sequences of n consecutive words

### 4.1 Count Vectorization (Bag of Words)

In [None]:
# Prepare a small sample for demonstration
sample_texts = [
    "This is a real news article about science",
    "This fake news spreads misinformation",
    "Science article provides accurate information",
    "Fake article spreads false claims"
]

# Initialize CountVectorizer
count_vectorizer = CountVectorizer(
    lowercase=True,           # Convert to lowercase
    stop_words='english',     # Remove English stopwords
    max_features=1000,        # Limit vocabulary size
    ngram_range=(1, 1)        # Use unigrams only
)

# Fit and transform the sample texts
count_matrix = count_vectorizer.fit_transform(sample_texts)

# Get feature names (vocabulary)
feature_names = count_vectorizer.get_feature_names_out()
print("Vocabulary (feature names):")
print(feature_names)
print(f"\nVocabulary size: {len(feature_names)}")

# Convert to dense array for better visualization
count_dense = count_matrix.toarray()
print("\nCount matrix shape:", count_dense.shape)
print("\nCount matrix:")
count_df = pd.DataFrame(count_dense, columns=feature_names)
print(count_df)

In [None]:
# Visualize the count matrix
plt.figure(figsize=(12, 6))
sns.heatmap(count_df, annot=True, cmap='Blues', fmt='d')
plt.title('Count Vectorization Matrix')
plt.xlabel('Words (Features)')
plt.ylabel('Documents')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

### 4.2 TF-IDF Vectorization

**TF-IDF** combines two metrics:
- **TF (Term Frequency)**: How frequently a term appears in a document
- **IDF (Inverse Document Frequency)**: How rare or common a term is across all documents

**Formula**: TF-IDF = TF Ã— IDF

- TF = (Number of times term appears in document) / (Total number of terms in document)
- IDF = log(Total number of documents / Number of documents containing the term)

In [None]:
# Initialize TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(
    lowercase=True,           # Convert to lowercase
    stop_words='english',     # Remove English stopwords
    max_features=1000,        # Limit vocabulary size
    ngram_range=(1, 2),       # Use unigrams and bigrams
    min_df=1,                 # Minimum document frequency
    max_df=0.95               # Maximum document frequency
)

# Fit and transform the sample texts
tfidf_matrix = tfidf_vectorizer.fit_transform(sample_texts)

# Get feature names
tfidf_features = tfidf_vectorizer.get_feature_names_out()
print("TF-IDF Features:")
print(tfidf_features)
print(f"\nFeature count: {len(tfidf_features)}")

# Convert to dense array
tfidf_dense = tfidf_matrix.toarray()
print("\nTF-IDF matrix shape:", tfidf_dense.shape)

# Create DataFrame for better visualization
tfidf_df = pd.DataFrame(tfidf_dense, columns=tfidf_features)
print("\nTF-IDF matrix (showing non-zero values):")
# Display only columns with non-zero values
non_zero_cols = tfidf_df.columns[tfidf_df.sum() > 0]
print(tfidf_df[non_zero_cols].round(3))

In [None]:
# Visualize TF-IDF scores
plt.figure(figsize=(14, 6))
sns.heatmap(tfidf_df[non_zero_cols], annot=True, cmap='viridis', fmt='.3f')
plt.title('TF-IDF Vectorization Matrix')
plt.xlabel('Features (Words/N-grams)')
plt.ylabel('Documents')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

### 4.3 Comparing Count vs TF-IDF

In [None]:
def compare_vectorization_methods(texts):
    """
    Compare Count and TF-IDF vectorization methods
    """
    # Count Vectorization
    count_vec = CountVectorizer(stop_words='english', max_features=50)
    count_matrix = count_vec.fit_transform(texts)
    
    # TF-IDF Vectorization
    tfidf_vec = TfidfVectorizer(stop_words='english', max_features=50)
    tfidf_matrix = tfidf_vec.fit_transform(texts)
    
    print("Count Vectorization:")
    print(f"  Matrix shape: {count_matrix.shape}")
    print(f"  Non-zero elements: {count_matrix.nnz}")
    print(f"  Sparsity: {(1 - count_matrix.nnz / (count_matrix.shape[0] * count_matrix.shape[1])) * 100:.2f}%")
    
    print("\nTF-IDF Vectorization:")
    print(f"  Matrix shape: {tfidf_matrix.shape}")
    print(f"  Non-zero elements: {tfidf_matrix.nnz}")
    print(f"  Sparsity: {(1 - tfidf_matrix.nnz / (tfidf_matrix.shape[0] * tfidf_matrix.shape[1])) * 100:.2f}%")
    
    return count_matrix, tfidf_matrix, count_vec, tfidf_vec

# Apply preprocessing to our dataset
df['processed_text'] = df['text'].apply(preprocess_text)

# Compare methods on our dataset
count_mat, tfidf_mat, count_vec, tfidf_vec = compare_vectorization_methods(df['processed_text'])

print("\nFirst few processed texts:")
for i in range(3):
    print(f"Text {i+1}: {df['processed_text'].iloc[i][:100]}...")

### 4.4 N-gram Analysis

In [None]:
# Analyze different n-gram combinations
def analyze_ngrams(texts, ngram_range=(1, 1), max_features=20):
    """
    Analyze n-grams in the text corpus
    """
    vectorizer = TfidfVectorizer(
        ngram_range=ngram_range,
        max_features=max_features,
        stop_words='english'
    )
    
    matrix = vectorizer.fit_transform(texts)
    feature_names = vectorizer.get_feature_names_out()
    
    # Calculate mean TF-IDF scores
    mean_scores = matrix.mean(axis=0).A1
    feature_scores = list(zip(feature_names, mean_scores))
    feature_scores.sort(key=lambda x: x[1], reverse=True)
    
    return feature_scores

# Analyze unigrams
print("Top 15 Unigrams (1-grams):")
unigrams = analyze_ngrams(df['processed_text'], ngram_range=(1, 1), max_features=15)
for feature, score in unigrams:
    print(f"  {feature}: {score:.4f}")

print("\nTop 15 Bigrams (2-grams):")
bigrams = analyze_ngrams(df['processed_text'], ngram_range=(2, 2), max_features=15)
for feature, score in bigrams:
    print(f"  {feature}: {score:.4f}")

print("\nTop 15 Trigrams (3-grams):")
trigrams = analyze_ngrams(df['processed_text'], ngram_range=(3, 3), max_features=15)
for feature, score in trigrams:
    print(f"  {feature}: {score:.4f}")

### 4.5 Feature Engineering for Fake News Detection

In [None]:
# Create a comprehensive vectorization pipeline for fake news detection
def create_features(texts, vectorizer_type='tfidf', ngram_range=(1, 2), max_features=5000):
    """
    Create feature matrix for fake news detection
    
    Args:
        texts (list): List of preprocessed texts
        vectorizer_type (str): 'count' or 'tfidf'
        ngram_range (tuple): N-gram range
        max_features (int): Maximum number of features
    
    Returns:
        sparse matrix: Feature matrix
        vectorizer: Fitted vectorizer
    """
    if vectorizer_type == 'count':
        vectorizer = CountVectorizer(
            ngram_range=ngram_range,
            max_features=max_features,
            stop_words='english',
            min_df=2,  # Ignore terms that appear in less than 2 documents
            max_df=0.8  # Ignore terms that appear in more than 80% of documents
        )
    else:  # tfidf
        vectorizer = TfidfVectorizer(
            ngram_range=ngram_range,
            max_features=max_features,
            stop_words='english',
            min_df=2,
            max_df=0.8,
            sublinear_tf=True  # Apply sublinear tf scaling
        )
    
    feature_matrix = vectorizer.fit_transform(texts)
    return feature_matrix, vectorizer

# Create features for our dataset
X, vectorizer = create_features(df['processed_text'], vectorizer_type='tfidf')
y = df['label']

print(f"Feature matrix shape: {X.shape}")
print(f"Number of samples: {X.shape[0]}")
print(f"Number of features: {X.shape[1]}")
print(f"Sparsity: {(1 - X.nnz / (X.shape[0] * X.shape[1])) * 100:.2f}%")

# Get top features
feature_names = vectorizer.get_feature_names_out()
print(f"\nTop 20 features: {feature_names[:20]}")

## 5. Putting It All Together: Complete Text Processing Pipeline

In [None]:
class TextProcessor:
    """
    Complete text processing pipeline for fake news detection
    """
    
    def __init__(self, vectorizer_type='tfidf', ngram_range=(1, 2), max_features=5000):
        self.vectorizer_type = vectorizer_type
        self.ngram_range = ngram_range
        self.max_features = max_features
        self.vectorizer = None
        self.stemmer = PorterStemmer()
        self.lemmatizer = WordNetLemmatizer()
    
    def clean_text(self, text):
        """Clean and normalize text"""
        text = text.lower()
        text = re.sub(r'[^a-zA-Z\s]', '', text)
        text = re.sub(r'\s+', ' ', text).strip()
        return text
    
    def preprocess_text(self, text, remove_stopwords=True, use_lemmatization=True):
        """Preprocess individual text"""
        # Clean text
        text = self.clean_text(text)
        
        # Tokenize
        tokens = word_tokenize(text)
        
        # Remove stopwords
        if remove_stopwords:
            stop_words = set(stopwords.words('english'))
            tokens = [token for token in tokens if token not in stop_words and token.isalpha()]
        
        # Lemmatization
        if use_lemmatization:
            tokens = [self.lemmatizer.lemmatize(token) for token in tokens]
        
        return ' '.join(tokens)
    
    def fit_transform(self, texts):
        """Fit vectorizer and transform texts"""
        # Preprocess all texts
        processed_texts = [self.preprocess_text(text) for text in texts]
        
        # Initialize vectorizer
        if self.vectorizer_type == 'count':
            self.vectorizer = CountVectorizer(
                ngram_range=self.ngram_range,
                max_features=self.max_features,
                stop_words='english',
                min_df=2,
                max_df=0.8
            )
        else:
            self.vectorizer = TfidfVectorizer(
                ngram_range=self.ngram_range,
                max_features=self.max_features,
                stop_words='english',
                min_df=2,
                max_df=0.8,
                sublinear_tf=True
            )
        
        # Fit and transform
        feature_matrix = self.vectorizer.fit_transform(processed_texts)
        return feature_matrix
    
    def transform(self, texts):
        """Transform new texts using fitted vectorizer"""
        if self.vectorizer is None:
            raise ValueError("Vectorizer not fitted. Call fit_transform first.")
        
        processed_texts = [self.preprocess_text(text) for text in texts]
        return self.vectorizer.transform(processed_texts)
    
    def get_feature_names(self):
        """Get feature names from vectorizer"""
        if self.vectorizer is None:
            raise ValueError("Vectorizer not fitted. Call fit_transform first.")
        return self.vectorizer.get_feature_names_out()

# Test the complete pipeline
processor = TextProcessor(vectorizer_type='tfidf', ngram_range=(1, 2), max_features=100)
X_processed = processor.fit_transform(df['text'])

print(f"Processed feature matrix shape: {X_processed.shape}")
print(f"Feature names (first 10): {processor.get_feature_names()[:10]}")

# Test on new text
new_text = ["This is breaking news about a scientific discovery that will change everything!"]
new_features = processor.transform(new_text)
print(f"\nNew text feature shape: {new_features.shape}")

## 6. Visualization and Analysis

In [None]:
# Analyze vocabulary differences between real and fake news
def analyze_class_vocabulary(df, processor):
    """
    Analyze vocabulary differences between real and fake news
    """
    real_news = df[df['label'] == 0]['text'].tolist()
    fake_news = df[df['label'] == 1]['text'].tolist()
    
    # Process separately
    real_processed = [processor.preprocess_text(text) for text in real_news]
    fake_processed = [processor.preprocess_text(text) for text in fake_news]
    
    # Get word frequencies
    real_words = ' '.join(real_processed).split()
    fake_words = ' '.join(fake_processed).split()
    
    real_freq = Counter(real_words)
    fake_freq = Counter(fake_words)
    
    print("Top 10 words in REAL news:")
    for word, count in real_freq.most_common(10):
        print(f"  {word}: {count}")
    
    print("\nTop 10 words in FAKE news:")
    for word, count in fake_freq.most_common(10):
        print(f"  {word}: {count}")
    
    return real_freq, fake_freq

real_freq, fake_freq = analyze_class_vocabulary(df, processor)

In [None]:
# Create word clouds for real vs fake news
def create_wordclouds(real_freq, fake_freq):
    """
    Create word clouds for real and fake news
    """
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))
    
    # Real news word cloud
    wordcloud_real = WordCloud(width=400, height=400, background_color='white').generate_from_frequencies(real_freq)
    ax1.imshow(wordcloud_real, interpolation='bilinear')
    ax1.set_title('Real News Word Cloud', fontsize=16)
    ax1.axis('off')
    
    # Fake news word cloud
    wordcloud_fake = WordCloud(width=400, height=400, background_color='white').generate_from_frequencies(fake_freq)
    ax2.imshow(wordcloud_fake, interpolation='bilinear')
    ax2.set_title('Fake News Word Cloud', fontsize=16)
    ax2.axis('off')
    
    plt.tight_layout()
    plt.show()

create_wordclouds(real_freq, fake_freq)

## 7. Summary and Key Takeaways

### Text Processing Pipeline Summary:

1. **Text Cleaning**: Remove special characters, normalize case
2. **Tokenization**: Break text into individual words
3. **Stopword Removal**: Filter out common, non-informative words
4. **Normalization**: Apply stemming or lemmatization
5. **Vectorization**: Convert text to numerical features

### Key Concepts Learned:

- **Tokenization**: Essential first step in text processing
- **Stopword Removal**: Reduces noise and improves model focus
- **Count Vectorization**: Simple bag-of-words approach
- **TF-IDF**: Weights words by importance across the corpus
- **N-grams**: Capture word sequences and context

### Best Practices for Fake News Detection:

1. **Use TF-IDF over Count Vectorization**: Better for distinguishing important terms
2. **Include Bigrams**: Capture phrases like "breaking news" or "doctors hate"
3. **Set Appropriate min_df and max_df**: Filter very rare and very common terms
4. **Preprocess Consistently**: Same pipeline for training and inference
5. **Consider Domain-Specific Stopwords**: Add news-specific common words

### Next Steps:

1. **Model Training**: Use these features with ML algorithms (Naive Bayes, SVM, Random Forest)
2. **Feature Engineering**: Add metadata features (text length, punctuation, etc.)
3. **Advanced Techniques**: Explore word embeddings (Word2Vec, GloVe) or transformer models
4. **Evaluation**: Test model performance and analyze feature importance