# Amazon Reviews Sentiment Analysis - Pipeline

This notebook contains a complete pipeline for Amazon reviews sentiment analysis including:
- Data loading from Kaggle
- Text preprocessing and cleaning
- Text analysis and statistics
- TF-IDF vectorization
- Word cloud generation

**Note for Google Colab users:**
- Install required packages: `!pip install kagglehub wordcloud`
- Upload your Kaggle API key if needed

**Note for NotebookLLM users:**
- Some interactive features may not work
- Focus on the analysis results and code structure

## 1. Install Required Libraries (Google Colab)

In [None]:
# Uncomment and run this cell if using Google Colab
# !pip install kagglehub wordcloud scikit-learn nltk
# !pip install matplotlib seaborn pandas numpy

## 2. Import Required Libraries

In [None]:
import os
import sys
import pandas as pd
import numpy as np
import re
from collections import Counter
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
import kagglehub

# Download NLTK data
nltk.download('punkt')
nltk.download('stopwords')

print("All libraries imported successfully!")

## 3. Configuration Settings

In [None]:
# Pipeline Configuration
CONFIG = {
    "train_size": 100000,
    "test_size": 10000,
    "tfidf_max_features": 5000,
    "tfidf_min_df": 2,
    "tfidf_max_df": 0.8,
    "ngram_range": (1, 2),
}

print("=== AMAZON REVIEWS DATA PROCESSING PIPELINE ===")
print(f"Configuration: {CONFIG}")

## 4. KaggleDataLoader Class

In [None]:
class KaggleDataLoader:
    """
    Class for loading and preparing Kaggle Amazon reviews dataset
    
    Features:
    - Downloads dataset from Kaggle
    - Loads CSV data with error handling
    - Validates data structure and labels
    - Applies size limits from configuration
    - Combines title and text columns into unified input
    - Delegates data quality processing to PreProcessor
    """
    
    def __init__(self, config):
        self.config = config
        self.train_df = None
        self.test_df = None
        self.dataset_path = None
    
    def download_dataset(self):
        """Download Amazon reviews dataset from Kaggle using kritanjalijain dataset"""
        print("Downloading Kaggle Amazon reviews dataset...")
        
        try:
            if self.dataset_path is None:
                self.dataset_path = kagglehub.dataset_download(
                    "kritanjalijain/amazon-reviews"
                )
            print(f"KaggleHub download path: {self.dataset_path}")
            return self.dataset_path
        except Exception as e:
            print(f"Error downloading dataset: {e}")
            sys.exit(1)
    
    def load_csv_data(self):
        """Load CSV data from the downloaded dataset"""
        if self.dataset_path is None:
            self.download_dataset()
        
        train_csv_path = os.path.join(self.dataset_path, "train.csv")
        test_csv_path = os.path.join(self.dataset_path, "test.csv")
        
        print("\n=== LOADING DATA ===")
        try:
            self.train_df = pd.read_csv(train_csv_path)
            self.test_df = pd.read_csv(test_csv_path)
            print(f"Successfully loaded data:")
            print(f"   - Train: {self.train_df.shape}")
            print(f"   - Test: {self.test_df.shape}")
        except FileNotFoundError as e:
            print(f"Error: Dataset files not found!")
            print(f"   Expected paths: {train_csv_path}, {test_csv_path}")
            sys.exit(1)
        except Exception as e:
            print(f"Error loading data: {e}")
            sys.exit(1)
        
        return self.train_df, self.test_df
    
    def prepare_dataframes(self):
        """Prepare and validate dataframes with streamlined processing"""
        if self.train_df is None or self.test_df is None:
            self.load_csv_data()
        
        self.train_df.columns = ["label", "title", "text"]
        self.test_df.columns = ["label", "title", "text"]
        
        self.validate_data()
        self.apply_size_limits()
        self.clean_and_combine_data()
        
        return self.train_df, self.test_df
    
    def validate_data(self):
        """Validate loaded data and perform initial quality checks"""
        print("\n=== DATA VALIDATION ===")
        
        print(f"Initial Train data info:")
        print(f"   - Shape: {self.train_df.shape}")
        print(f"Initial Test data info:")
        print(f"   - Shape: {self.test_df.shape}")
        
        print(f"\nInitial label distribution:")
        print(f"   Training: {self.train_df['label'].value_counts().to_dict()}")
        print(f"   Test: {self.test_df['label'].value_counts().to_dict()}")
        
        train_labels = set(self.train_df["label"].unique())
        test_labels = set(self.test_df["label"].unique())
        expected_labels = {1, 2}  # Binary sentiment: 1=negative, 2=positive
        
        train_invalid = train_labels - expected_labels
        test_invalid = test_labels - expected_labels
        
        if train_invalid or test_invalid:
            print(f"Warning: Unexpected labels found")
            if train_invalid:
                print(f"   Training unexpected labels: {train_invalid}")
            if test_invalid:
                print(f"   Test unexpected labels: {test_invalid}")
        else:
            print("All labels are within expected range [1, 2]")
        
        print("Initial data validation completed")
    
    def clean_and_combine_data(self):
        """Combine title/text columns and perform basic data preparation"""
        print("\n=== DATA COMBINATION ===")
        
        self.train_df["title"] = self.train_df["title"].fillna("")
        self.train_df["text"] = self.train_df["text"].fillna("")
        self.test_df["title"] = self.test_df["title"].fillna("")
        self.test_df["text"] = self.test_df["text"].fillna("")
        
        def smart_combine(title, text):
            title_clean = str(title).strip()
            text_clean = str(text).strip()
            
            if title_clean and text_clean:
                return f"{title_clean} {text_clean}"
            elif title_clean:
                return title_clean
            elif text_clean:
                return text_clean
            else:
                return ""
        
        print("Combining title and text columns...")
        self.train_df["input"] = self.train_df.apply(
            lambda row: smart_combine(row["title"], row["text"]), axis=1
        )
        self.test_df["input"] = self.test_df.apply(
            lambda row: smart_combine(row["title"], row["text"]), axis=1
        )
        
        self.train_df = self.train_df.drop(["title", "text"], axis=1).reset_index(
            drop=True
        )
        self.test_df = self.test_df.drop(["title", "text"], axis=1).reset_index(
            drop=True
        )
        
        print(f"Data combination completed:")
        print(f"   Training: {self.train_df.shape}")
        print(f"   Test: {self.test_df.shape}")
        print(
            f"   Average input length - Train: {self.train_df['input'].str.len().mean():.1f}, Test: {self.test_df['input'].str.len().mean():.1f}"
        )
    
    def apply_size_limits(self):
        """Apply size limits from configuration"""
        print(f"\n=== APPLYING SIZE LIMITS ===")
        original_train_size = len(self.train_df)
        original_test_size = len(self.test_df)
        
        self.train_df = self.train_df.iloc[: self.config["train_size"]].copy()
        self.test_df = self.test_df.iloc[: self.config["test_size"]].copy()
        
        print(f"Size limits applied:")
        print(f"   Training: {original_train_size} -> {len(self.train_df)} samples")
        print(f"   Test: {original_test_size} -> {len(self.test_df)} samples")

print("KaggleDataLoader class defined successfully!")

## 5. PreProcessor Class

In [None]:
class PreProcessor:
    # Advanced regex patterns for comprehensive text cleaning
    CLEANING_PATTERNS = [
        # Web content removal
        (r"http[s]?://\S+|www\.\S+", ""),  # URLs
        (r"\S+@\S+\.\S+", ""),  # Email addresses
        (r"<[^>]+>", ""),  # HTML tags
        (r"&[a-zA-Z0-9]+;", ""),  # HTML entities
        # Social media content
        (r"@\w+|#\w+", ""),  # Mentions and hashtags
        # Numbers and digits
        (r"\d+", ""),  # Remove all numbers
        # Character filtering
        (r"[^a-zA-ZÀ-ÿĀ-žА-я\u00C0-\u017F\u0100-\u024F\s]", ""),  # Keep only letters
        (r"(.)\1{2,}", r"\1\1"),  # Repeated characters
        (r"\s+", " "),  # Normalize whitespace
        (r"\b[b-hj-z]\b", ""),  # Single chars except a,i
    ]

    # Meaningful short words to preserve
    MEANINGFUL_SHORT_WORDS = {
        "a",
        "i",
        "is",
        "it",
        "to",
        "go",
        "no",
        "so",
        "me",
        "we",
        "he",
        "my",
        "be",
        "or",
        "in",
        "on",
        "at",
    }

    def __init__(self):
        pass

    def clean_data(self, df):
        """Check and handle null values, and examine data types in DataFrame."""
        print("Number of null values before processing:")
        print(df.isnull().sum())

        # Fill NaN values in 'input' column with empty string
        if "input" in df.columns:
            df["input"] = df["input"].fillna("")
        elif "text" in df.columns:
            df["text"] = df["text"].fillna("")
        elif "title" in df.columns:
            df["title"] = df["title"].fillna("")

        print("\nNumber of null values after processing:")
        print(df.isnull().sum())

        print("\nData types of columns:")
        print(df.dtypes)

        return df

    def remove_duplicates(self, df):
        """Check and remove duplicate records in DataFrame."""
        print(f"Number of records before removing duplicates: {len(df)}")

        # Check and remove duplicate records
        df_cleaned = df.drop_duplicates()

        print(f"Number of records after removing duplicates: {len(df_cleaned)}")

        return df_cleaned

    def clean_text(self, text):
        """Comprehensive text cleaning with advanced preprocessing."""
        if not isinstance(text, str):
            return ""

        # Convert to lowercase first
        text = text.lower()

        # Apply all cleaning patterns in sequence
        for pattern, replacement in self.CLEANING_PATTERNS:
            text = re.sub(pattern, replacement, text, flags=re.IGNORECASE)

        # Strip whitespace after all pattern applications
        text = text.strip()

        # Remove very short words except meaningful ones
        words = text.split()
        words = [
            word
            for word in words
            if len(word) >= 2 or word.lower() in self.MEANINGFUL_SHORT_WORDS
        ]
        text = " ".join(words)

        return text

    def preprocess_text_pipeline(self, text):
        """Complete text preprocessing pipeline that combines all steps efficiently."""
        if not isinstance(text, str):
            return []

        # Step 1: Clean text
        cleaned_text = self.clean_text(text)

        # Step 2: Tokenize
        tokens = self.tokenize_text(cleaned_text)

        # Step 3: Remove stopwords
        tokens_no_stopwords = self.remove_stopwords(tokens)

        # Step 4: Normalize tokens
        normalized_tokens = self.normalize_token(tokens_no_stopwords)

        return normalized_tokens

    def tokenize_text(self, text):
        """Split text into tokens (words)."""
        if isinstance(text, str):
            return word_tokenize(text)
        else:
            return text

    def remove_stopwords(self, tokens):
        """Remove English stopwords from the list of tokens."""
        if not isinstance(tokens, list):
            return tokens
        else:
            # Load English stopwords list from NLTK
            stop_words = set(stopwords.words("english"))
            # Filter out stopwords from the token list
            filtered_tokens = [word for word in tokens if word not in stop_words]
            # Return the filtered token list
            return filtered_tokens

    def normalize_token(self, tokens):
        """Normalize token list by applying English Snowball Stemmer to each token."""
        if not isinstance(tokens, list):
            return tokens
        else:
            # Initialize Snowball Stemmer for English
            stemmer = SnowballStemmer("english")
            # Apply stemmer to each token in the list and return new list
            normalized_tokens = [stemmer.stem(word) for word in tokens]
            return normalized_tokens

print("PreProcessor class defined successfully!")

## 6. TextAnalyzer Class

In [None]:
class TextAnalyzer:
    """
    Class for analyzing text data before TF-IDF vectorization

    Features:
    - Word frequency analysis
    - Top word identification
    - Word cloud generation
    - Average word length calculation
    - Text statistics and insights
    - Dataset comparison
    """

    def __init__(self):
        self.word_count = {}
        self.total_words = 0
        self.total_sentences = 0
        self.analysis_results = {}

    def analyze_word_count(self, sentence):
        """
        Analyze word count for a single sentence

        Args:
            sentence (str): Input sentence to analyze

        Returns:
            dict: Word count dictionary for the sentence
        """
        word_count = {}
        for word in sentence.split():
            if word in word_count:
                word_count[word] += 1
            else:
                word_count[word] = 1
        return word_count

    def build_corpus_word_count(self, dataset, text_column="input"):
        """
        Build word count dictionary for entire corpus

        Args:
            dataset (pd.DataFrame): Dataset containing text data
            text_column (str): Column name containing text data

        Returns:
            dict: Complete word count dictionary
        """
        print("Building corpus word count...")
        self.word_count = {}

        for sentence in dataset[text_column]:
            if isinstance(sentence, str):
                for word in sentence.split():
                    if word in self.word_count:
                        self.word_count[word] += 1
                    else:
                        self.word_count[word] = 1

        self.total_words = sum(self.word_count.values())
        self.total_sentences = len(dataset)

        print(f"   Total unique words: {len(self.word_count):,}")
        print(f"   Total word occurrences: {self.total_words:,}")
        print(f"   Total sentences: {self.total_sentences:,}")

        return self.word_count

    def get_top_words(self, top_n=10):
        """
        Get top N most frequent words

        Args:
            top_n (int): Number of top words to return

        Returns:
            dict: Dictionary of top words with their counts
        """
        if not self.word_count:
            print(
                "Warning: Word count not built yet. Call build_corpus_word_count() first."
            )
            return {}

        top_words = dict(
            sorted(self.word_count.items(), key=lambda x: x[1], reverse=True)[:top_n]
        )

        print(f"\nTop {top_n} most frequent words:")
        for i, (word, count) in enumerate(top_words.items(), 1):
            percentage = (count / self.total_words) * 100
            print(f"   {i:2d}. {word:15s} -> {count:6,} times ({percentage:.2f}%)")

        return top_words

    def generate_wordcloud(
        self,
        dataset,
        text_column="input",
        figsize=(12, 6),
        remove_numbers=True,
        save_path=None,
    ):
        """
        Generate and display word cloud from text data

        Args:
            dataset (pd.DataFrame): Dataset containing text data
            text_column (str): Column name containing text data
            figsize (tuple): Figure size for the plot
            remove_numbers (bool): Whether to remove numbers from text
            save_path (str): Path to save the word cloud image
        """
        print("Generating word cloud...")

        joined_sentences = ""
        for sentence in dataset[text_column]:
            if isinstance(sentence, str):
                if remove_numbers:
                    cleaned_sentence = re.sub(r"\d+", "", sentence)
                else:
                    cleaned_sentence = sentence
                joined_sentences += " " + cleaned_sentence

        if not joined_sentences.strip():
            print("   Warning: No text data available for word cloud generation")
            return None

        try:
            wordcloud = WordCloud(
                width=800,
                height=400,
                background_color="white",
                max_words=100,
                colormap="viridis",
            ).generate(joined_sentences)

            plt.figure(figsize=figsize)
            plt.imshow(wordcloud, interpolation="bilinear")
            plt.axis("off")
            plt.title(
                "Word Cloud - Most Frequent Words", fontsize=16, fontweight="bold"
            )

            if save_path:
                plt.savefig(save_path, bbox_inches="tight", dpi=300)
                print(f"   Word cloud saved to: {save_path}")

            plt.show()
            print("   Word cloud generated successfully")

            return wordcloud

        except Exception as e:
            print(f"   Error generating word cloud: {e}")
            return None

    def calculate_average_word_length(self, dataset, text_column="input"):
        """
        Calculate average word length across the dataset

        Args:
            dataset (pd.DataFrame): Dataset containing text data
            text_column (str): Column name containing text data

        Returns:
            float: Average word length
        """
        print("Calculating average word length...")

        total_length = 0
        total_word_count = 0

        for sentence in dataset[text_column]:
            if isinstance(sentence, str):
                words = sentence.split()
                total_length += sum(len(word) for word in words)
                total_word_count += len(words)

        if total_word_count == 0:
            print("   Warning: No words found in dataset")
            return 0.0

        avg_word_length = round(total_length / total_word_count, 2)

        print(f"   Total characters: {total_length:,}")
        print(f"   Total words: {total_word_count:,}")
        print(f"   Average word length: {avg_word_length} characters")

        return avg_word_length

    def analyze_text_statistics(self, dataset, text_column="input"):
        """
        Comprehensive text analysis including all statistics

        Args:
            dataset (pd.DataFrame): Dataset containing text data
            text_column (str): Column name containing text data

        Returns:
            dict: Complete analysis results
        """
        print(f"\n=== COMPREHENSIVE TEXT ANALYSIS ===")
        print(f"Analyzing {len(dataset):,} text samples...")

        word_count = self.build_corpus_word_count(dataset, text_column)
        top_10_words = self.get_top_words(10)
        avg_word_length = self.calculate_average_word_length(dataset, text_column)

        sentence_lengths = dataset[text_column].str.len()
        word_counts_per_sentence = dataset[text_column].str.split().str.len()

        self.analysis_results = {
            "corpus_statistics": {
                "total_sentences": len(dataset),
                "total_unique_words": len(word_count),
                "total_word_occurrences": self.total_words,
                "vocabulary_size": len(word_count),
            },
            "word_analysis": {
                "top_10_words": top_10_words,
                "average_word_length": avg_word_length,
                "most_frequent_word": (
                    max(word_count.items(), key=lambda x: x[1])
                    if word_count
                    else ("", 0)
                ),
            },
            "sentence_statistics": {
                "average_sentence_length": round(sentence_lengths.mean(), 2),
                "median_sentence_length": sentence_lengths.median(),
                "max_sentence_length": sentence_lengths.max(),
                "min_sentence_length": sentence_lengths.min(),
                "average_words_per_sentence": round(word_counts_per_sentence.mean(), 2),
            },
            "distribution_analysis": {
                "words_appearing_once": sum(
                    1 for count in word_count.values() if count == 1
                ),
                "words_appearing_more_than_10": sum(
                    1 for count in word_count.values() if count > 10
                ),
                "words_appearing_more_than_100": sum(
                    1 for count in word_count.values() if count > 100
                ),
            },
        }

        print(f"\n=== ANALYSIS SUMMARY ===")
        print(f"Corpus Statistics:")
        for key, value in self.analysis_results["corpus_statistics"].items():
            print(f"   {key.replace('_', ' ').title()}: {value:,}")

        print(f"\nWord Analysis:")
        print(f"   Average word length: {avg_word_length} characters")
        print(
            f"   Most frequent word: '{self.analysis_results['word_analysis']['most_frequent_word'][0]}' ({self.analysis_results['word_analysis']['most_frequent_word'][1]:,} times)"
        )

        print(f"\nSentence Statistics:")
        for key, value in self.analysis_results["sentence_statistics"].items():
            print(f"   {key.replace('_', ' ').title()}: {value}")

        print(f"\nDistribution Analysis:")
        for key, value in self.analysis_results["distribution_analysis"].items():
            print(f"   {key.replace('_', ' ').title()}: {value:,}")

        return self.analysis_results

    def get_word_frequency_report(self, min_frequency=1):
        """
        Generate detailed word frequency report

        Args:
            min_frequency (int): Minimum frequency threshold for words to include

        Returns:
            pd.DataFrame: Word frequency report as DataFrame
        """
        if not self.word_count:
            print(
                "Warning: Word count not built yet. Call build_corpus_word_count() first."
            )
            return pd.DataFrame()

        filtered_words = {
            word: count
            for word, count in self.word_count.items()
            if count >= min_frequency
        }

        word_freq_df = pd.DataFrame(
            [
                {
                    "word": word,
                    "frequency": count,
                    "percentage": (count / self.total_words) * 100,
                }
                for word, count in sorted(
                    filtered_words.items(), key=lambda x: x[1], reverse=True
                )
            ]
        )

        print(f"\nWord Frequency Report (min frequency: {min_frequency}):")
        print(f"   Words included: {len(word_freq_df):,}")
        if not word_freq_df.empty:
            print(f"   Coverage: {word_freq_df['percentage'].sum():.2f}% of total words")

        return word_freq_df

    def compare_datasets(self, train_dataset, test_dataset, text_column="input"):
        """
        Compare text statistics between training and test datasets

        Args:
            train_dataset (pd.DataFrame): Training dataset
            test_dataset (pd.DataFrame): Test dataset
            text_column (str): Column name containing text data

        Returns:
            dict: Comparison results
        """
        print(f"\n=== DATASET COMPARISON ===")

        # Analyze training dataset (use current analyzer)
        print("Analyzing training dataset...")
        train_results = self.analyze_text_statistics(train_dataset, text_column)

        # Create new analyzer for test dataset to avoid conflicts
        print("\nAnalyzing test dataset...")
        test_analyzer = TextAnalyzer()
        test_results = test_analyzer.analyze_text_statistics(test_dataset, text_column)

        # Calculate comparison metrics
        comparison = {
            "train_stats": train_results,
            "test_stats": test_results,
            "comparison": {
                "vocabulary_size_ratio": test_results["corpus_statistics"][
                    "vocabulary_size"
                ]
                / train_results["corpus_statistics"]["vocabulary_size"],
                "avg_word_length_diff": test_results["word_analysis"][
                    "average_word_length"
                ]
                - train_results["word_analysis"]["average_word_length"],
                "avg_sentence_length_diff": test_results["sentence_statistics"][
                    "average_sentence_length"
                ]
                - train_results["sentence_statistics"]["average_sentence_length"],
            },
        }

        # Print comparison summary
        print(f"\n=== DATASET COMPARISON SUMMARY ===")
        print(
            f"Vocabulary size ratio (test/train): {comparison['comparison']['vocabulary_size_ratio']:.3f}"
        )
        print(
            f"Average word length difference: {comparison['comparison']['avg_word_length_diff']:.2f} characters"
        )
        print(
            f"Average sentence length difference: {comparison['comparison']['avg_sentence_length_diff']:.2f} characters"
        )

        return comparison

print("TextAnalyzer class defined successfully!")

## 7. TFIDFVectorizer Class

In [None]:
class TFIDFVectorizer:
    def __init__(self, max_features=10000, min_df=2, max_df=0.8, ngram_range=(1, 2)):
        """
        Initialize TF-IDF Vectorizer.

        Args:
            max_features (int): Maximum number of features
            min_df (int): Minimum frequency of words in corpus
            max_df (float): Maximum frequency of words in corpus (ratio)
            ngram_range (tuple): N-gram range (1, 1) for unigram, (1, 2) for unigram + bigram
        """
        self.vectorizer = TfidfVectorizer(
            max_features=max_features,
            min_df=min_df,
            max_df=max_df,
            ngram_range=ngram_range,
            stop_words="english",
        )
        self.is_fitted = False

    def preprocess_tokens_to_text(self, tokens):
        """Convert token list to text string."""
        if isinstance(tokens, list):
            return " ".join(tokens)
        else:
            return str(tokens)

    def fit(self, text_data):
        """Train TF-IDF vectorizer on text data."""
        if isinstance(text_data, pd.Series):
            processed_text = text_data.apply(self.preprocess_tokens_to_text)
        else:
            processed_text = [
                self.preprocess_tokens_to_text(text) for text in text_data
            ]

        print("Training TF-IDF Vectorizer...")
        self.vectorizer.fit(processed_text)
        self.is_fitted = True
        print(
            f"Completed! Number of features: {len(self.vectorizer.get_feature_names_out())}"
        )
        return self

    def transform(self, text_data):
        """Transform text data into TF-IDF matrix."""
        if not self.is_fitted:
            raise ValueError(
                "Vectorizer has not been trained. Please call fit() method first."
            )

        if isinstance(text_data, pd.Series):
            processed_text = text_data.apply(self.preprocess_tokens_to_text)
        else:
            processed_text = [
                self.preprocess_tokens_to_text(text) for text in text_data
            ]

        print("Vectorizing data...")
        tfidf_matrix = self.vectorizer.transform(processed_text)
        print(f"Completed! Matrix shape: {tfidf_matrix.shape}")
        return tfidf_matrix

    def fit_transform(self, text_data):
        """Train and transform data in one step."""
        return self.fit(text_data).transform(text_data)

    def get_feature_names(self):
        """Get list of feature names."""
        if not self.is_fitted:
            raise ValueError(
                "Vectorizer has not been trained. Please call fit() method first."
            )

        return self.vectorizer.get_feature_names_out().tolist()

    def get_top_features(self, tfidf_matrix, top_n=20):
        """Get top N features with highest TF-IDF scores."""
        if not self.is_fitted:
            raise ValueError(
                "Vectorizer has not been trained. Please call fit() method first."
            )

        mean_scores = np.array(tfidf_matrix.mean(axis=0)).flatten()
        feature_names = self.get_feature_names()

        feature_scores = list(zip(feature_names, mean_scores))
        feature_scores.sort(key=lambda x: x[1], reverse=True)

        return feature_scores[:top_n]

print("TFIDFVectorizer class defined successfully!")

## 8. Data Loading and Preparation

In [None]:
print("\n=== INITIALIZING DATA LOADER ===")
data_loader = KaggleDataLoader(CONFIG)
train_df, test_df = data_loader.prepare_dataframes()

print(f"\nDataset loaded successfully!")
print(f"Training data shape: {train_df.shape}")
print(f"Test data shape: {test_df.shape}")
print(f"Columns: {list(train_df.columns)}")

## 9. Text Preprocessing

In [None]:
preprocessor = PreProcessor()

import nltk
nltk.download('punkt_tab')

print("\n=== TEXT PREPROCESSING ===")
print("Processing training data...")
train_df = preprocessor.clean_data(train_df.copy())
train_df = preprocessor.remove_duplicates(train_df)
# Use efficient pipeline method that combines cleaning, tokenization, stopword removal and normalization
train_df = train_df.assign(
    normalized_input=train_df["input"].apply(preprocessor.preprocess_text_pipeline)
)

print("\nProcessing test data...")
test_df = preprocessor.clean_data(test_df.copy())
test_df = preprocessor.remove_duplicates(test_df)
# Use efficient pipeline method that combines cleaning, tokenization, stopword removal and normalization
test_df = test_df.assign(
    normalized_input=test_df["input"].apply(preprocessor.preprocess_text_pipeline)
)

## 10. Post-Preprocessing Validation

In [None]:
print("\n=== POST-PREPROCESSING VALIDATION ===")
train_empty = (
    train_df["normalized_input"]
    .apply(lambda x: len(x) if isinstance(x, list) else 0)
    .eq(0)
    .sum()
)
test_empty = (
    test_df["normalized_input"]
    .apply(lambda x: len(x) if isinstance(x, list) else 0)
    .eq(0)
    .sum()
)

print(f"Training data quality:")
print(f"   - Final shape: {train_df.shape}")
print(f"   - Empty normalized_input: {train_empty}")
print(
    f"   - Average tokens per document: {train_df['normalized_input'].apply(len).mean():.2f}"
)

print(f"Test data quality:")
print(f"   - Final shape: {test_df.shape}")
print(f"   - Empty normalized_input: {test_empty}")
print(
    f"   - Average tokens per document: {test_df['normalized_input'].apply(len).mean():.2f}"
)

print(f"\nSample processed data:")
print(train_df.head(3))

## 11. Text Analysis Before TF-IDF Vectorization

In [None]:
print("\n=== TEXT ANALYSIS BEFORE TF-IDF VECTORIZATION ===")
text_analyzer = TextAnalyzer()

print("\n1. TRAINING DATA ANALYSIS")
train_analysis = text_analyzer.analyze_text_statistics(train_df, "input")

## 12. Word Cloud Generation

In [None]:
print("\n2. WORD CLOUD GENERATION")
try:
    text_analyzer.generate_wordcloud(
        train_df, "input", figsize=(12, 6), save_path="wordcloud_train.png"
    )
except Exception as e:
    print(f"   Could not generate word cloud: {e}")

## 13. Dataset Comparison

In [None]:
print("\n3. DATASET COMPARISON")
comparison_results = text_analyzer.compare_datasets(train_df, test_df, "input")

## 14. Word Frequency Analysis

In [None]:
print("\n4. WORD FREQUENCY ANALYSIS")
word_freq_report = text_analyzer.get_word_frequency_report(min_frequency=5)
if not word_freq_report.empty:
    print("\nTop 15 words with frequency >= 5:")
    print(word_freq_report.head(15).to_string(index=False))

## 15. TF-IDF Vectorization

In [None]:
print("\n=== TF-IDF VECTORIZATION ===")
tfidf_vectorizer = TFIDFVectorizer(
    max_features=CONFIG["tfidf_max_features"],
    min_df=CONFIG["tfidf_min_df"],
    max_df=CONFIG["tfidf_max_df"],
    ngram_range=CONFIG["ngram_range"],
)
print(f"TF-IDF Configuration: {CONFIG}")

print("\nTraining TF-IDF Vectorizer...")
X_train_tfidf = tfidf_vectorizer.fit_transform(train_df["normalized_input"])

print("Transforming test data...")
X_test_tfidf = tfidf_vectorizer.transform(test_df["normalized_input"])

## 16. TF-IDF Matrix Analysis

In [None]:
print(f"\n=== TF-IDF MATRIX ANALYSIS ===")
print(f"Matrix Information:")
print(f"   Train shape: {X_train_tfidf.shape}")
print(f"   Test shape: {X_test_tfidf.shape}")
print(
    f"   Sparsity: {(1 - X_train_tfidf.nnz / (X_train_tfidf.shape[0] * X_train_tfidf.shape[1])):.4f}"
)
print(f"   Memory usage: ~{X_train_tfidf.data.nbytes / (1024**2):.2f} MB")

print(f"\nTop 10 Most Important TF-IDF Features:")
try:
    top_features = tfidf_vectorizer.get_top_features(X_train_tfidf, top_n=10)
    for i, (feature, score) in enumerate(top_features, 1):
        print(f"   {i:2d}. {feature:20s} -> {score:.4f}")
except Exception as e:
    print(f"   Could not extract top features: {e}")

## 17. Pipeline Completion Summary

In [None]:
print(f"\n" + "=" * 60)
print(f"PIPELINE COMPLETION SUMMARY")
print(f"=" * 60)
print(f"Dataset Information:")
print(f"   - Train samples: {len(train_df):,}")
print(f"   - Test samples: {len(test_df):,}")
print(f"   - Features: {X_train_tfidf.shape[1]:,}")

print(f"\nText Analysis Summary:")
if text_analyzer.analysis_results:
    corpus_stats = text_analyzer.analysis_results["corpus_statistics"]
    word_stats = text_analyzer.analysis_results["word_analysis"]
    print(f"   - Vocabulary size: {corpus_stats['vocabulary_size']:,}")
    print(f"   - Total words: {corpus_stats['total_word_occurrences']:,}")
    print(f"   - Average word length: {word_stats['average_word_length']} characters")
    print(
        f"   - Most frequent word: '{word_stats['most_frequent_word'][0]}' ({word_stats['most_frequent_word'][1]:,} times)"
    )

print(f"\nLabel Distribution:")
train_labels = train_df["label"].value_counts()
test_labels = test_df["label"].value_counts()
print(f"   Train: {dict(train_labels)}")
print(f"   Test:  {dict(test_labels)}")

print(f"\nData Ready for machine learning models:")
print(f"   - X_train_tfidf: {X_train_tfidf.shape}")
print(f"   - X_test_tfidf: {X_test_tfidf.shape}")
print(f"   - y_train: {train_df['label'].shape}")
print(f"   - y_test: {test_df['label'].shape}")
print(f"=" * 60)

# Prepare final variables for model training
y_train = train_df['label']
y_test = test_df['label']

print("\nAll variables are ready for machine learning model training!")
print("Variables available:")
print("   - X_train_tfidf: Training features (TF-IDF matrix)")
print("   - X_test_tfidf: Test features (TF-IDF matrix)")
print("   - y_train: Training labels")
print("   - y_test: Test labels")
print("   - train_df: Original training dataframe")
print("   - test_df: Original test dataframe")

## 18. Next Steps for Model Training

Now that the data is preprocessed and vectorized, you can proceed with machine learning model training. Here are some common approaches:

### Option 1: Logistic Regression
```python
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

# Train model
lr_model = LogisticRegression(random_state=42)
lr_model.fit(X_train_tfidf, y_train)

# Make predictions
y_pred = lr_model.predict(X_test_tfidf)

# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred))
```

### Option 2: Random Forest
```python
from sklearn.ensemble import RandomForestClassifier

# Train model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_tfidf, y_train)

# Make predictions and evaluate
y_pred_rf = rf_model.predict(X_test_tfidf)
print(f"RF Accuracy: {accuracy_score(y_test, y_pred_rf)}")
```

### Option 3: SVM
```python
from sklearn.svm import SVC

# Train model
svm_model = SVC(kernel='linear', random_state=42)
svm_model.fit(X_train_tfidf, y_train)

# Make predictions and evaluate
y_pred_svm = svm_model.predict(X_test_tfidf)
print(f"SVM Accuracy: {accuracy_score(y_test, y_pred_svm)}")
```