# Comprehensive Sentiment Analysis and Machine Learning Framework - NLTK


## Executive Summary
This advanced Python-based sentiment analysis framework offers a robust, end-to-end solution for processing text data, extracting meaningful insights, and leveraging machine learning techniques. Designed to handle complex text analysis challenges, the system integrates natural language processing (NLP) techniques with machine learning to provide comprehensive sentiment evaluation and predictive modeling.

## Core Capabilities
The framework combines sophisticated text preprocessing, sentiment analysis, and machine learning to deliver a powerful analytical tool. By leveraging libraries such as NLTK, pandas, and scikit-learn, the system can process multiple CSV files, clean and transform text data, perform nuanced sentiment analysis, and train predictive models.

## Technical Architecture

### Data Processing Pipeline
- **Automated CSV file discovery and processing**
- **Intelligent data combination and deduplication**
- **Advanced null value handling**
- **Comprehensive text cleaning and preprocessing**

### Sentiment Analysis Methodology
The sentiment analysis component employs a multi-layered approach:

#### 1. Text Preprocessing
- Removes URLs, email addresses, and special characters
- Converts text to lowercase
- Eliminates stopwords and tokenizes text

#### 2. Sentiment Classification
Utilizes NLTK's **SentimentIntensityAnalyzer** and provides detailed sentiment categorization:
- **Very Positive**
- **Positive**
- **Slightly Positive**
- **Neutral**
- **Slightly Negative**
- **Negative**
- **Very Negative**

### Machine Learning Integration
- **Random Forest Classifier** for predictive modeling
- Comprehensive **model evaluation metrics**
- **Feature engineering** and **categorical encoding**
- Robust **train-test split methodology**

## Key Technical Components

### Libraries and Dependencies
- **Natural Language Processing:** NLTK
- **Data Manipulation:** pandas, numpy
- **Machine Learning:** scikit-learn
- **System Interaction:** os, glob, datetime

### Primary Functional Modules
1. **Text Cleaning and Preprocessing**
2. **Sentiment Score Calculation**
3. **Machine Learning Model Training**
4. **Performance Metrics Generation**

## Output and Reporting
The framework generates comprehensive outputs:
- Processed data CSV
- Sentiment distribution analysis
- Detailed model performance metrics
- Timestamped output directories
- Evaluation reports with **accuracy**, **precision**, and **recall** scores

## Practical Applications
The versatile framework is ideal for:
- Social media sentiment analysis
- Customer feedback processing
- Brand perception monitoring
- Text classification projects
- Customer experience insights

## Limitations and Considerations
While powerful, the framework has some constraints:
- Primarily designed for **English-language text**
- **Performance** dependent on input data quality
- Requires structured **CSV input**
- Sentiment analysis based on pre-trained lexicons

## Future Development Roadmap
Potential enhancements include:
- **Multi-language support**
- Advanced **deep learning** integration
- More sophisticated **feature engineering**
- Customizable sentiment thresholds
- Enhanced **visualization** capabilities

## Implementation Guidelines

### Prerequisites
- **Python 3.7+**
- Required libraries installed
- Input **CSV files** prepared

### Execution
1. Place CSV files in the script's directory
2. Run the script
3. Review generated output and metrics

## Conclusion
This sentiment analysis framework represents a sophisticated approach to understanding textual data. By combining advanced NLP techniques with machine learning, it provides organizations with powerful tools to extract meaningful insights from text-based sources.


# NLTK

In [211]:
import pandas as pd
import numpy as np
import glob
import os
from nltk.tokenize import word_tokenize
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
import re
from datetime import datetime

# Download required NLTK data (tokenizer, sentiment lexicon, stopwords)
try:
    nltk.download('punkt')
    nltk.download('vader_lexicon')
    nltk.download('stopwords')
    print("Successfully downloaded NLTK resources")
except Exception as e:
    print(f"Error downloading NLTK resources: {str(e)}")

# Function to create an output directory with a timestamp for each run
def create_output_directory():
    """
    Create output directory with timestamp
    """
    output_dir = os.path.join(os.path.abspath('.'), 'nltk_sentiment_analysis_outputs')
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    
    # Create subdirectory with timestamp for current analysis run
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    run_dir = os.path.join(output_dir, f'analysis_{timestamp}')
    os.makedirs(run_dir)
    
    return run_dir

# Function to clean the text by removing URLs, emails, special characters, and extra whitespace
def clean_text(text):
    """
    Enhanced text cleaning function
    """
    if pd.isna(text):
        return ""
    
    text = str(text).lower()
    # Remove URLs, emails, special characters, numbers, and extra whitespace
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    text = re.sub(r'\S+@\S+', '', text)
    text = re.sub(r'[^\w\s]', ' ', text)
    text = re.sub(r'\d+', '', text)
    text = ' '.join(text.split())
    return text

# Function to preprocess text (clean it, tokenize, and remove stopwords)
def preprocess_text(text):
    """
    Enhanced text preprocessing with stopword removal and better tokenization
    """
    if pd.isna(text):
        return ""
    
    text = clean_text(text)
    tokens = word_tokenize(text)  # Tokenize the text
    stop_words = set(stopwords.words('english'))
    
    # Remove stopwords from tokenized text
    tokens = [token for token in tokens if token not in stop_words]
    
    return ' '.join(tokens)

# Function to classify sentiment into detailed categories based on compound score and individual scores
def get_detailed_sentiment(compound_score, positive_score, negative_score, threshold=0.1):
    """
    Enhanced sentiment classification with more nuanced categories
    """
    # Determine sentiment category based on compound score and positive/negative scores
    if compound_score >= threshold:
        if positive_score >= 0.5:
            return 'very positive'
        return 'positive'
    elif compound_score <= -threshold:
        if negative_score >= 0.5:
            return 'very negative'
        return 'negative'
    elif abs(compound_score) < 0.05:
        return 'neutral'
    else:
        if positive_score > negative_score:
            return 'slightly positive'
        elif negative_score > positive_score:
            return 'slightly negative'
        return 'neutral'

# Main function to process CSV files and perform sentiment analysis
def process_csv_and_analyze():
    """
    Enhanced main function with better sentiment analysis and organized output
    """
    # Create output directory and generate subdirectory for the current run
    output_dir = create_output_directory()
    print(f"\nCreated output directory: {output_dir}")
    
    # Step 1: Locate and read CSV files in the current directory
    directory_path = os.path.abspath('.')
    path_pattern = os.path.join(directory_path, '*.csv')
    all_files = glob.glob(path_pattern)
    
    # Write list of processed files to a text file
    with open(os.path.join(output_dir, 'processed_files.txt'), 'w') as f:
        f.write("Processed Files:\n")
        f.write("================\n\n")
        for file in all_files:
            f.write(f"- {os.path.basename(file)}\n")
    
    print(f"\nFound {len(all_files)} CSV files:")
    for file in all_files:
        print(f"- {os.path.basename(file)}")
    
    # Step 2: Read and combine CSV files while removing empty and duplicate rows
    combined_df = None
    for filename in all_files:
        try:
            df = pd.read_csv(filename)
            
            # Drop empty rows and duplicate rows
            df = df.dropna(how='all').drop_duplicates()
            df['source_file'] = os.path.basename(filename)
            
            if combined_df is None:
                combined_df = df
            else:
                # Concatenate dataframes while avoiding content duplicates
                combined_df = pd.concat([combined_df, df], ignore_index=True)
                content_columns = [col for col in combined_df.columns 
                                 if col != 'source_file' and not col.startswith('sentiment_')]
                combined_df = combined_df.drop_duplicates(subset=content_columns, keep='first')
            
            print(f"\nProcessed: {os.path.basename(filename)}")
            print(f"Current shape: {combined_df.shape}")
        except Exception as e:
            print(f"Error reading {os.path.basename(filename)}: {str(e)}")
    
    if combined_df is None:
        raise ValueError("No CSV files were successfully read")
    
    # Step 3: Identify possible text columns
    possible_text_columns = ['text', 'description', 'comment', 'review', 'content', 'message']
    text_columns = [col for col in combined_df.columns if any(text_name in col.lower() 
                                                            for text_name in possible_text_columns)]
    
    # If no obvious text column is found, prompt the user to specify one
    if not text_columns:
        print("\nNo obvious text columns found. Available columns are:")
        print(combined_df.columns.tolist())
        text_column = input("\nPlease enter the name of the text column to analyze: ")
    else:
        print("\nFound potential text columns:", text_columns)
        if len(text_columns) == 1:
            text_column = text_columns[0]
        else:
            text_column = input("\nPlease enter the name of the text column to analyze: ")
    
    # Step 4: Preprocess text by cleaning, tokenizing, and removing stopwords
    print("\nPreprocessing text...")
    combined_df['processed_text'] = combined_df[text_column].apply(preprocess_text)
    combined_df = combined_df[combined_df['processed_text'].str.len() > 0]  # Remove empty rows
    
    # Step 5: Perform sentiment analysis using NLTK's SentimentIntensityAnalyzer
    print("\nPerforming sentiment analysis...")
    sid = SentimentIntensityAnalyzer()
    
    # Get sentiment scores for each processed text
    sentiment_scores = combined_df['processed_text'].apply(lambda x: sid.polarity_scores(str(x)))
    combined_df['negative'] = sentiment_scores.apply(lambda x: x['neg'])
    combined_df['neutral'] = sentiment_scores.apply(lambda x: x['neu'])
    combined_df['positive'] = sentiment_scores.apply(lambda x: x['pos'])
    combined_df['compound'] = sentiment_scores.apply(lambda x: x['compound'])
    
    # Classify sentiment based on enhanced rules
    combined_df['sentiment'] = combined_df.apply(
        lambda row: get_detailed_sentiment(
            row['compound'],
            row['positive'],
            row['negative']
        ), axis=1
    )
    
    # Add a confidence score based on the compound score and sentiment strengths
    combined_df['sentiment_confidence'] = combined_df.apply(
        lambda row: max(abs(row['compound']), max(row['positive'], row['negative'])), 
        axis=1
    )
    
    # Drop the intermediate 'processed_text' column
    final_df = combined_df.drop(['processed_text'], axis=1)
    
    # Step 6: Generate analysis summary
    sentiment_dist = final_df['sentiment'].value_counts()
    avg_by_sentiment = final_df.groupby('sentiment')[['negative', 'neutral', 'positive', 'compound']].mean()
    
    # Step 7: Save the results and summary to output directory
    try:
        # Verify no duplicates in the final dataset
        assert final_df.shape[0] == final_df.drop_duplicates().shape[0], "Duplicates found in final dataset"
        
        # Save results, summary, distribution, and metadata
        results_path = os.path.join(output_dir, 'sentiment_analysis_results.csv')
        final_df.to_csv(results_path, index=False)
        print(f"\nResults saved to: {results_path}")
        
        summary_path = os.path.join(output_dir, 'sentiment_analysis_summary.txt')
        with open(summary_path, 'w') as f:
            f.write("Enhanced Sentiment Analysis Summary\n")
            f.write("================================\n\n")
            f.write("Sentiment Distribution:\n")
            f.write(str(sentiment_dist))
            f.write("\n\nAverage Scores by Sentiment Category:\n")
            f.write(str(avg_by_sentiment))
            f.write("\n\nConfidence Score Statistics:\n")
            f.write(str(final_df['sentiment_confidence'].describe()))
        print(f"Summary saved to: {summary_path}")
        
        dist_path = os.path.join(output_dir, 'sentiment_distribution.csv')
        sentiment_dist.to_frame().to_csv(dist_path)
        
        meta_path = os.path.join(output_dir, 'analysis_metadata.txt')
        with open(meta_path, 'w') as f:
            f.write(f"Analysis Timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
            f.write(f"Total Files Processed: {len(all_files)}\n")
            f.write(f"Total Records Analyzed: {len(final_df)}\n")
            f.write(f"Text Column Analyzed: {text_column}\n")
        
        print(f"\nAll analysis outputs saved in: {output_dir}")
        
    except Exception as e:
        print(f"Error saving results: {str(e)}")
    
    return final_df, output_dir

# Execute the analysis if the script is run as the main program
if __name__ == "__main__":
    try:
        print("Starting enhanced sentiment analysis process...")
        results_df, output_path = process_csv_and_analyze()
        print(f"\nAnalysis completed successfully!")
        print(f"Results are saved in: {output_path}")
    except Exception as e:
        print(f"\nAn error occurred during analysis: {str(e)}")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\1520a\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\1520a\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\1520a\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Successfully downloaded NLTK resources
Starting enhanced sentiment analysis process...

Created output directory: C:\Users\1520a\--- MSU MSDS ---\CSE 482\Project\nltk_sentiment_analysis_outputs\analysis_20241214_165118

Found 5 CSV files:
- apex_ad2600_dvd_player_updated.csv
- canon_g3_updated.csv
- nikon_coolpix_4300_updated.csv
- nokia_6610_updated.csv
- nomad_jukebox_zen_xtra_updated.csv

Processed: apex_ad2600_dvd_player_updated.csv
Current shape: (740, 16)

Processed: canon_g3_updated.csv
Current shape: (1337, 16)

Processed: nikon_coolpix_4300_updated.csv
Current shape: (1683, 16)

Processed: nokia_6610_updated.csv
Current shape: (2227, 16)

Processed: nomad_jukebox_zen_xtra_updated.csv
Current shape: (3943, 16)

No obvious text columns found. Available columns are:
['Unnamed: 0', 'title', 'sentence', 'sentiment_dict', 'sentiment_total', '[u]', '[p]', '[s]', '[cc]', '[cs]', 'annotations', 'title_input_ids', 'title_attention_mask', 'sentence_input_ids', 'sentence_attention_mask', 


Please enter the name of the text column to analyze:  sentiment_dict



Preprocessing text...

Performing sentiment analysis...

Results saved to: C:\Users\1520a\--- MSU MSDS ---\CSE 482\Project\nltk_sentiment_analysis_outputs\analysis_20241214_165118\sentiment_analysis_results.csv
Summary saved to: C:\Users\1520a\--- MSU MSDS ---\CSE 482\Project\nltk_sentiment_analysis_outputs\analysis_20241214_165118\sentiment_analysis_summary.txt

All analysis outputs saved in: C:\Users\1520a\--- MSU MSDS ---\CSE 482\Project\nltk_sentiment_analysis_outputs\analysis_20241214_165118

Analysis completed successfully!
Results are saved in: C:\Users\1520a\--- MSU MSDS ---\CSE 482\Project\nltk_sentiment_analysis_outputs\analysis_20241214_165118


# NLTK - with ML model

In [213]:
import pandas as pd
import numpy as np
import glob
import os
from nltk.tokenize import word_tokenize
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
import re
from datetime import datetime
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.metrics import f1_score, precision_score, recall_score

# Download required NLTK data
try:
    # Download necessary NLTK data files
    nltk.download('punkt')
    nltk.download('vader_lexicon')
    nltk.download('stopwords')
    print("Successfully downloaded NLTK resources")
except Exception as e:
    print(f"Error downloading NLTK resources: {str(e)}")

def create_output_directory():
    """
    Create an output directory with a timestamp for storing results.
    """
    # Define main output directory path
    output_dir = os.path.join(os.path.abspath('.'), 'nltk_sentiment_analysis_outputs')
    
    # If the output directory doesn't exist, create it
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    
    # Generate a timestamped subdirectory for this run
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    run_dir = os.path.join(output_dir, f'analysis_{timestamp}')
    
    # Create the subdirectory
    os.makedirs(run_dir)
    
    return run_dir

def clean_text(text):
    """
    Clean and preprocess the raw text data by removing unwanted characters, URLs, and emails.
    """
    if pd.isna(text):
        return ""
    
    text = str(text).lower()
    
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    
    # Remove email addresses
    text = re.sub(r'\S+@\S+', '', text)
    
    # Remove special characters and punctuation
    text = re.sub(r'[^\w\s]', ' ', text)
    
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    
    # Remove extra whitespaces
    text = ' '.join(text.split())
    
    return text

def preprocess_text(text):
    """
    Tokenize the text and remove stopwords.
    """
    if pd.isna(text):
        return ""
    
    # Clean the raw text first
    text = clean_text(text)
    
    # Tokenize the text
    tokens = word_tokenize(text)
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    
    # Join tokens back into a string
    return ' '.join(tokens)

def handle_null_values(df):
    """
    Replace null values in the dataframe:
    - For numeric columns, replace with 0
    - For text columns, replace with an empty string
    """
    for col in df.columns:
        # Check if the column is numeric
        if df[col].dtype in [np.float64, np.int64]:
            df[col].fillna(0, inplace=True)
        else:
            # For text columns, replace NaN with an empty string
            df[col].fillna("", inplace=True)
    return df

def train_ml_model(df):
    """
    Train a machine learning model (Random Forest) using the sentiment labels and evaluate its performance.
    """
    # Specify the target column for sentiment analysis
    target_column = 'sentiment'
    
    # Check if the target column exists in the dataframe
    if target_column not in df.columns:
        raise ValueError(f"'{target_column}' column not found in the dataset.")

    # Convert categorical sentiment labels to numeric values for model training
    df[target_column] = df[target_column].astype('category').cat.codes

    # Select feature columns and the target variable
    X = df.drop(columns=[target_column, 'source_file'], errors='ignore')  # Drop non-feature columns
    y = df[target_column]  # Target: sentiment labels

    # One-hot encode categorical features
    X = pd.get_dummies(X, drop_first=True)

    # Split the data into training and testing sets (80% train, 20% test)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Train a Random Forest model
    model = RandomForestClassifier(random_state=42)
    model.fit(X_train, y_train)

    # Make predictions on the test set
    y_pred = model.predict(X_test)

    # Print evaluation metrics
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))
    print("Accuracy:", accuracy_score(y_test, y_pred))

    # Print additional performance metrics (F1 Score, Precision, Recall)
    print("F1 Score (Macro):", f1_score(y_test, y_pred, average='macro'))
    print("Precision (Macro):", precision_score(y_test, y_pred, average='macro'))
    print("Recall (Macro):", recall_score(y_test, y_pred, average='macro'))

    return model

def process_csv_and_analyze():
    """
    Process CSV files, handle missing values, perform sentiment analysis and train a machine learning model.
    """
    # Create output directory for storing results
    output_dir = create_output_directory()
    print(f"\nCreated output directory: {output_dir}")

    # Step 1: Read all CSV files from the current directory
    directory_path = os.path.abspath('.')
    path_pattern = os.path.join(directory_path, '*.csv')
    all_files = glob.glob(path_pattern)

    # Combine all CSV files into one DataFrame
    combined_df = pd.concat([pd.read_csv(file).assign(source_file=os.path.basename(file)) for file in all_files], ignore_index=True)

    # Step 2: Handle missing values in the DataFrame
    combined_df = handle_null_values(combined_df)

    # Step 3: Identify possible text columns for sentiment analysis
    possible_text_columns = ['text', 'description', 'comment', 'review', 'content', 'message']
    text_columns = [col for col in combined_df.columns if any(name in col.lower() for name in possible_text_columns)]
    
    # Choose the first identified text column (or prompt user if none are found)
    text_column = text_columns[0] if text_columns else input("Please enter the text column: ")

    # Step 4: Preprocess the text data
    combined_df['processed_text'] = combined_df[text_column].apply(preprocess_text)

    # Step 5: Perform sentiment analysis using VADER sentiment analyzer
    sid = SentimentIntensityAnalyzer()
    
    # Get compound sentiment score for each text
    combined_df['compound'] = combined_df['processed_text'].apply(lambda x: sid.polarity_scores(x)['compound'])
    
    # Classify sentiment based on compound score
    combined_df['sentiment'] = combined_df['compound'].apply(lambda x: 'positive' if x > 0.05 else 'negative' if x < -0.05 else 'neutral')

    # Step 6: Save processed data to CSV
    results_path = os.path.join(output_dir, 'processed_data.csv')
    combined_df.to_csv(results_path, index=False)
    print(f"Processed data saved to: {results_path}")

    # Step 7: Train and evaluate machine learning model
    print("\nTraining ML model...")
    train_ml_model(combined_df)

if __name__ == "__main__":
    # Main execution flow
    print("Starting enhanced sentiment analysis and ML process...")
    process_csv_and_analyze()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\1520a\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\1520a\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\1520a\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate obj

Successfully downloaded NLTK resources
Starting enhanced sentiment analysis and ML process...

Created output directory: C:\Users\1520a\--- MSU MSDS ---\CSE 482\Project\nltk_sentiment_analysis_outputs\analysis_20241214_165208


Please enter the text column:  sentiment_dict


Processed data saved to: C:\Users\1520a\--- MSU MSDS ---\CSE 482\Project\nltk_sentiment_analysis_outputs\analysis_20241214_165208\processed_data.csv

Training ML model...
Confusion Matrix:
 [[  0   5   0]
 [  0 771   0]
 [  0   9   4]]
Classification Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00         5
           1       0.98      1.00      0.99       771
           2       1.00      0.31      0.47        13

    accuracy                           0.98       789
   macro avg       0.66      0.44      0.49       789
weighted avg       0.98      0.98      0.98       789

Accuracy: 0.982256020278834
F1 Score (Macro): 0.48719693532940167
Precision (Macro): 0.6607218683651804
Recall (Macro): 0.4358974358974359


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# Sentiment Analysis and Text Processing Code Overview - BOW Analysis

## Introduction
The provided Python code performs a comprehensive text analysis on CSV files by applying the Bag of Words (BoW) methodology. It includes steps for preprocessing text data, extracting word and bigram frequencies, calculating TF-IDF (Term Frequency-Inverse Document Frequency) scores, generating visualizations (e.g., word clouds), and producing a summary report. The code integrates several libraries and processes text data to generate insightful metrics and visual outputs.

## Key Features

### 1. **Package Installation**
The script checks for required packages and installs them if necessary. This includes:
- **pandas** for data manipulation
- **numpy** for numerical operations
- **scikit-learn** for machine learning and text vectorization
- **nltk** for natural language processing
- **wordcloud** for visualizing text data
- **matplotlib** and **seaborn** for creating visualizations

The `install_requirements` function ensures that the required libraries are installed by checking if they are already available and installing them if they are missing.

### 2. **Text Preprocessing**
The `clean_text` function is responsible for cleaning and preparing text for analysis:
- Converts text to lowercase
- Removes URLs, email addresses, special characters, and numbers
- Eliminates extra whitespaces

This function is applied to the specified text column in the dataset to prepare the text for further analysis.

### 3. **Directory Management**
The `create_output_directory` function creates a new directory with a timestamp where all analysis results will be saved. This helps organize output files for each run.

### 4. **Bag of Words (BoW) and TF-IDF Analysis**
The `perform_bow_analysis` function conducts the core of the analysis:
- **BoW**: This method calculates the frequency of words that appear in the dataset.
- **TF-IDF**: This method calculates the importance of words based on their frequency relative to all documents in the dataset.
- **Bigrams**: It also calculates the frequency of two-word combinations (bigrams) using the CountVectorizer.

Results from these analyses are saved in CSV files for further use. The function also generates visualizations such as a word cloud to represent the most frequent words.

### 5. **Visualization**
- **Word Cloud**: The script uses the `wordcloud` library to generate a word cloud visualizing the frequency of terms found in the text. This is saved as an image file.
- **Bigram Analysis**: Common bigrams (two-word phrases) are calculated, and the results are saved in a CSV file.

### 6. **Summary Report**
A text file summary (`text_analysis_summary.txt`) is created, which includes:
- Document statistics (e.g., total documents, average document length)
- Top 20 most frequent words
- Top 20 most important words based on TF-IDF scores
- Top 20 most common bigrams

### 7. **File Management**
The script reads all CSV files in the current directory, combines them into a single DataFrame, and processes the text column. The user is prompted to select the correct text column if it is not automatically detected.

### 8. **Error Handling**
Error handling is implemented throughout the code:
- If required packages cannot be installed, an error is raised.
- If there are issues reading the CSV files, the script will notify the user.
- Text analysis errors are captured, and detailed error messages are provided.

## Output Files
Upon successful execution, the script generates the following output files:
1. **word_analysis.csv**: Contains word frequencies and corresponding TF-IDF scores.
2. **bigram_analysis.csv**: Contains the frequency of bigrams (two-word combinations).
3. **wordcloud.png**: A visual word cloud generated from the most frequent words.
4. **text_analysis_summary.txt**: A detailed text analysis report.
5. **processed_data.csv**: A cleaned dataset with the text data preprocessed.

## Workflow
1. **Install required packages**: Ensures necessary libraries are installed.
2. **Read CSV files**: All CSV files in the current directory are read and combined into a single DataFrame.
3. **Text preprocessing**: The text data is cleaned (removing URLs, emails, special characters, etc.).
4. **BoW and TF-IDF Analysis**: Word frequencies and TF-IDF scores are calculated.
5. **Bigram Analysis**: Common two-word combinations are identified.
6. **Generate output**: Results are saved in various files, and visualizations are created.

## Conclusion
This Python code provides a comprehensive framework for text analysis, specifically using the Bag of Words and TF-IDF methods. It offers a powerful tool for processing text data, generating insights into word frequency, importance, and common bigrams. Additionally, the script produces visual and textual reports that summarize the findings, making it suitable for various text analysis applications.

By following the outlined structure, users can perform detailed text analysis on their datasets and extract meaningful insights for further decision-making or reporting purposes.


# BOW

In [223]:
import subprocess
import sys

def install_requirements():
    """
    Install required packages if they're not already installed
    """
    required_packages = [
        'pandas',
        'numpy',
        'scikit-learn',
        'nltk',
        'wordcloud',
        'matplotlib',
        'seaborn'
    ]
    
    print("Checking and installing required packages...")
    for package in required_packages:
        try:
            __import__(package)
            print(f"✓ {package} already installed")
        except ImportError:
            print(f"Installing {package}...")
            subprocess.check_call([sys.executable, "-m", "pip", "install", package])
            print(f"✓ {package} installed successfully")

def main():
    """
    Main function to run the analysis
    """
    # First install requirements
    install_requirements()
    
    # Now import required packages
    import pandas as pd
    import numpy as np
    from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
    from collections import Counter
    import os
    from datetime import datetime
    import matplotlib.pyplot as plt
    from wordcloud import WordCloud
    import seaborn as sns
    from nltk.corpus import stopwords
    import nltk
    import re
    import glob
    
    # Download required NLTK data
    try:
        nltk.download('stopwords', quiet=True)
        nltk.download('punkt', quiet=True)
        print("✓ Successfully downloaded NLTK resources")
    except Exception as e:
        print(f"Error downloading NLTK resources: {str(e)}")
        return

    def create_output_directory():
        """
        Create output directory with timestamp
        """
        output_dir = os.path.join(os.path.abspath('.'), 'bow_text_analysis_outputs')
        if not os.path.exists(output_dir):
            os.makedirs(output_dir)
        
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        run_dir = os.path.join(output_dir, f'bow_analysis_{timestamp}')
        os.makedirs(run_dir)
        
        return run_dir

    def clean_text(text):
        """
        Clean text for analysis
        """
        if pd.isna(text):
            return ""
        
        # Convert to string and lowercase
        text = str(text).lower()
        
        # Remove URLs
        text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
        
        # Remove email addresses
        text = re.sub(r'\S+@\S+', '', text)
        
        # Remove special characters and numbers
        text = re.sub(r'[^\w\s]', ' ', text)
        text = re.sub(r'\d+', '', text)
        
        # Remove extra whitespace
        text = ' '.join(text.split())
        
        return text

    def perform_bow_analysis(df, text_column, output_dir):
        """
        Perform Bag of Words analysis
        """
        print("\nStarting Bag of Words analysis...")
        
        # Verify text column exists
        if text_column not in df.columns:
            raise ValueError(f"Column '{text_column}' not found in dataset. Available columns: {', '.join(df.columns)}")
        
        # Clean the text
        print("Cleaning text data...")
        df['cleaned_text'] = df[text_column].apply(clean_text)
        
        # Remove empty texts
        df = df[df['cleaned_text'].str.len() > 0].reset_index(drop=True)
        
        if len(df) == 0:
            raise ValueError("No valid text data remaining after cleaning")
        
        # Initialize vectorizers
        print("Performing text vectorization...")
        # Convert stop words to list instead of set
        stop_words = list(stopwords.words('english'))
        
        # CountVectorizer for basic word frequency
        count_vec = CountVectorizer(max_features=1000, 
                                  stop_words=stop_words,  # Now using list instead of set
                                  min_df=2)
        
        # TF-IDF Vectorizer for word importance
        tfidf_vec = TfidfVectorizer(max_features=1000,
                                   stop_words=stop_words,  # Now using list instead of set
                                   min_df=2)
        
        try:
            # Fit and transform the text
            bow_matrix = count_vec.fit_transform(df['cleaned_text'])
            tfidf_matrix = tfidf_vec.fit_transform(df['cleaned_text'])
            
            # Get feature names
            feature_names = count_vec.get_feature_names_out()
            
            # Calculate word frequencies
            word_freq = pd.DataFrame(bow_matrix.sum(axis=0).T,
                                   index=feature_names,
                                   columns=['frequency']).sort_values('frequency', ascending=False)
            
            # Calculate TF-IDF scores
            tfidf_scores = pd.DataFrame(tfidf_matrix.mean(axis=0).T,
                                      index=feature_names,
                                      columns=['tfidf_score']).sort_values('tfidf_score', ascending=False)
            
            # Combine frequencies and TF-IDF scores
            word_analysis = pd.merge(word_freq, tfidf_scores,
                                   left_index=True, right_index=True,
                                   how='outer').fillna(0)
            
            # Save results
            print("Saving analysis results...")
            word_analysis.to_csv(os.path.join(output_dir, 'word_analysis.csv'))
            
            # Generate and save word clouds
            print("Generating word cloud...")
            plt.figure(figsize=(20,10))
            try:
                wordcloud = WordCloud(width=1600, height=800,
                                    background_color='white',
                                    max_words=100).generate_from_frequencies(
                                        dict(zip(word_freq.index, word_freq['frequency']))
                                    )
                
                plt.imshow(wordcloud, interpolation='bilinear')
                plt.axis('off')
                plt.title('Word Cloud of Most Frequent Terms')
                plt.savefig(os.path.join(output_dir, 'wordcloud.png'), bbox_inches='tight', dpi=300)
            except Exception as e:
                print(f"Warning: Could not generate word cloud: {str(e)}")
            finally:
                plt.close()
            
            # Calculate bigrams
            print("Analyzing bigrams...")
            bigram_vectorizer = CountVectorizer(ngram_range=(2,2),
                                              max_features=100,
                                              stop_words=stop_words)  # Now using list instead of set
            bigram_matrix = bigram_vectorizer.fit_transform(df['cleaned_text'])
            bigram_freq = pd.DataFrame(bigram_matrix.sum(axis=0).T,
                                      index=bigram_vectorizer.get_feature_names_out(),
                                      columns=['frequency']).sort_values('frequency', ascending=False)
            
            bigram_freq.to_csv(os.path.join(output_dir, 'bigram_analysis.csv'))
            
            # Generate summary report
            print("Generating summary report...")
            with open(os.path.join(output_dir, 'text_analysis_summary.txt'), 'w') as f:
                f.write("Text Analysis Summary\n")
                f.write("===================\n\n")
                
                f.write("Document Statistics:\n")
                f.write(f"Total documents analyzed: {len(df)}\n")
                f.write(f"Average document length: {df['cleaned_text'].str.len().mean():.1f} characters\n")
                f.write(f"Unique words analyzed: {len(feature_names)}\n\n")
                
                f.write("Top 20 Most Frequent Words:\n")
                f.write(str(word_freq.head(20)))
                f.write("\n\nTop 20 Most Important Words (TF-IDF):\n")
                f.write(str(tfidf_scores.head(20)))
                f.write("\n\nTop 20 Most Common Bigrams:\n")
                f.write(str(bigram_freq.head(20)))
            
            # Save processed dataset
            df.to_csv(os.path.join(output_dir, 'processed_data.csv'), index=False)
            
            return word_analysis, bigram_freq, df
            
        except Exception as e:
            print(f"Error during text analysis: {str(e)}")
            raise

    # Create output directory
    output_dir = create_output_directory()
    print(f"\nCreated output directory: {output_dir}")
    
    # Read all CSV files in directory
    csv_files = glob.glob('*.csv')
    
    if not csv_files:
        print("No CSV files found in current directory!")
        return
    
    # Read and combine all CSV files
    dfs = []
    for file in csv_files:
        try:
            df = pd.read_csv(file)
            print(f"Read file: {file}")
            dfs.append(df)
        except Exception as e:
            print(f"Error reading {file}: {str(e)}")
    
    if not dfs:
        print("No valid CSV files could be read!")
        return
        
    combined_df = pd.concat(dfs, ignore_index=True)
    
    # Identify text column
    possible_text_columns = ['text', 'description', 'comment', 'review', 'content', 'message']
    text_columns = [col for col in combined_df.columns 
                   if any(text_name in col.lower() for text_name in possible_text_columns)]
    
    if not text_columns:
        print("\nNo obvious text columns found. Available columns are:")
        print(combined_df.columns.tolist())
        text_column = input("\nPlease enter the name of the text column to analyze: ")
    else:
        print("\nFound potential text columns:", text_columns)
        if len(text_columns) == 1:
            text_column = text_columns[0]
        else:
            text_column = input("\nPlease enter the name of the text column to analyze: ")
    
    try:
        # Perform analysis
        word_analysis, bigram_freq, processed_df = perform_bow_analysis(combined_df, text_column, output_dir)
        
        print(f"\nAnalysis completed! Results saved in: {output_dir}")
        print("\nFiles generated:")
        print("1. word_analysis.csv - Word frequencies and TF-IDF scores")
        print("2. bigram_analysis.csv - Common word pairs analysis")
        print("3. wordcloud.png - Visual representation of word frequencies")
        print("4. text_analysis_summary.txt - Detailed analysis report")
        print("5. processed_data.csv - Processed dataset with cleaned text")
        
        return word_analysis, bigram_freq, processed_df, output_dir
    
    except Exception as e:
        print(f"\nAn error occurred during analysis: {str(e)}")
        return None, None, None, output_dir

if __name__ == "__main__":
    try:
        results = main()
        if results[0] is not None:
            print(f"\nAnalysis completed successfully!")
        else:
            print("\nAnalysis completed with errors. Please check the output directory for partial results.")
    except Exception as e:
        print(f"\nAn error occurred: {str(e)}")

Checking and installing required packages...
✓ pandas already installed
✓ numpy already installed
Installing scikit-learn...
✓ scikit-learn installed successfully
✓ nltk already installed
✓ wordcloud already installed
✓ matplotlib already installed
✓ seaborn already installed
✓ Successfully downloaded NLTK resources

Created output directory: C:\Users\1520a\--- MSU MSDS ---\CSE 482\Project\bow_text_analysis_outputs\bow_analysis_20241214_165513
Read file: apex_ad2600_dvd_player_updated.csv
Read file: canon_g3_updated.csv
Read file: nikon_coolpix_4300_updated.csv
Read file: nokia_6610_updated.csv
Read file: nomad_jukebox_zen_xtra_updated.csv

No obvious text columns found. Available columns are:
['Unnamed: 0', 'title', 'sentence', 'sentiment_dict', 'sentiment_total', '[u]', '[p]', '[s]', '[cc]', '[cs]', 'annotations', 'title_input_ids', 'title_attention_mask', 'sentence_input_ids', 'sentence_attention_mask']



Please enter the name of the text column to analyze:  sentence



Starting Bag of Words analysis...
Cleaning text data...
Performing text vectorization...
Saving analysis results...
Generating word cloud...
Analyzing bigrams...
Generating summary report...

Analysis completed! Results saved in: C:\Users\1520a\--- MSU MSDS ---\CSE 482\Project\bow_text_analysis_outputs\bow_analysis_20241214_165513

Files generated:
1. word_analysis.csv - Word frequencies and TF-IDF scores
2. bigram_analysis.csv - Common word pairs analysis
3. wordcloud.png - Visual representation of word frequencies
4. text_analysis_summary.txt - Detailed analysis report
5. processed_data.csv - Processed dataset with cleaned text

Analysis completed successfully!


# Sentiment Analysis and Text Processing Code Overview - TFIDF Analysis

## Overview
This script is designed to perform **TF-IDF (Term Frequency-Inverse Document Frequency)** analysis on text data contained within CSV files. It handles the installation of necessary libraries, prepares the data, performs the TF-IDF analysis, and generates various output files and visualizations to summarize the results.

## Key Steps in the Process

### 1. **Install Required Libraries**
The script first ensures that the required Python libraries are installed, including:
- `pandas`: For data manipulation and analysis.
- `numpy`: For numerical computations.
- `scikit-learn`: For machine learning and vectorization tools (TF-IDF).
- `nltk`: For natural language processing tasks, including stopwords and tokenization.
- `matplotlib` and `seaborn`: For visualizations.

If any required library is not installed, it attempts to install it using `pip`.

### 2. **Data Preprocessing**
The script processes the text data in the following steps:
- **Cleaning Text**: It removes URLs, email addresses, special characters, and numbers. It also converts the text to lowercase and eliminates extra whitespaces.
- **Stopwords Removal**: It uses the list of English stopwords from the `nltk` library to filter out common words like "the", "and", etc.
  
### 3. **TF-IDF Analysis**
The TF-IDF analysis is performed using the `TfidfVectorizer` from the `scikit-learn` library. The key steps in this analysis include:
- **Vectorization**: The text data is converted into a numerical matrix using both **unigrams** and **bigrams** (n-grams of length 1 and 2). The vectorizer is configured to ignore terms that appear in fewer than 2 documents or more than 95% of documents.
- **TF-IDF Calculation**: The script calculates the importance of each term in the context of the documents using the TF-IDF formula:
    - **Term Frequency (TF)**: Measures how often a term appears in a document.
    - **Inverse Document Frequency (IDF)**: Measures how important the term is across all documents. Rare terms have higher IDF scores.
    
### 4. **Output Directory and Files**
The script creates an output directory with a timestamp to store the results, including:
- **`average_tfidf_scores.csv`**: Contains the average TF-IDF score for each term.
- **`document_term_matrix.csv`**: A matrix of terms versus documents showing the TF-IDF scores for each term in each document.
- **Visualizations**:
    - **Top 20 Terms by TF-IDF**: A bar plot showing the terms with the highest average TF-IDF scores.
    - **TF-IDF Distribution**: A histogram showing the distribution of TF-IDF scores across all terms.
    - **Document Similarity Matrix**: A heatmap showing the similarity between documents based on their TF-IDF vectors.
- **`tfidf_analysis_summary.txt`**: A text summary of the analysis, including statistics about the documents and terms, as well as the top terms and most unique terms based on their TF-IDF scores.
  
### 5. **Error Handling**
The script includes error handling for common issues, such as missing text columns, invalid CSV files, and errors during the installation of required packages or during the TF-IDF analysis itself.

## Main Functions

### `install_requirements()`
This function checks if the required libraries (`pandas`, `numpy`, `scikit-learn`, `nltk`, `matplotlib`, `seaborn`) are installed. If any package is missing, it attempts to install it.

### `create_output_directory()`
Creates a directory for storing the analysis results with a timestamp to distinguish between different runs.

### `clean_text(text)`
Cleans the input text by:
- Converting it to lowercase.
- Removing URLs, email addresses, special characters, and digits.
- Removing extra whitespace.

### `perform_tfidf_analysis(df, text_column, output_dir)`
This function performs the actual TF-IDF analysis:
- **Cleaning the text**: Cleans the text data in the specified column of the DataFrame.
- **Vectorization**: Applies the `TfidfVectorizer` to convert the cleaned text into a matrix of TF-IDF scores.
- **Matrix Creation**: Creates a document-term matrix with the TF-IDF scores for each term in each document.
- **Visualizations**: Generates and saves plots for the top terms, the distribution of TF-IDF scores, and document similarity.
- **Summary Report**: Saves a text file with a summary of the analysis, including statistics about the documents, terms, and most unique terms based on TF-IDF scores.

### `main()`
The main function of the script:
1. Installs required libraries.
2. Reads and processes CSV files.
3. Identifies a text column for analysis (based on common column names like 'text', 'description', etc.).
4. Calls the `perform_tfidf_analysis()` function to run the TF-IDF analysis.
5. Saves the results in the output directory.

### Error Handling in `main()`
- The script checks for the existence of text columns and requests user input if no suitable column is automatically found.
- Errors during the analysis or CSV reading are caught and displayed to the user.

## Conclusion
The script provides a comprehensive TF-IDF analysis of textual data, with multiple output files and visualizations to summarize the results. It is designed to handle various types of text columns and ensures that required libraries are installed and available.

## Output Files
Upon successful execution, the following files are generated in the output directory:
1. **`average_tfidf_scores.csv`** - Contains the average TF-IDF scores for each term.
2. **`document_term_matrix.csv`** - Full document-term matrix.
3. **`top_terms_tfidf.png`** - Bar plot of the top 20 terms by average TF-IDF score.
4. **`tfidf_distribution.png`** - Histogram of TF-IDF score distribution.
5. **`document_similarity.png`** - Heatmap of document similarity.
6. **`tfidf_analysis_summary.txt`** - Text summary of the analysis.

# TF IDF

In [144]:
import subprocess
import sys

def install_requirements():
    """
    Install required packages if they're not already installed
    """
    required_packages = [
        'pandas',
        'numpy',
        'scikit-learn',
        'nltk',
        'matplotlib',
        'seaborn'
    ]
    
    print("Checking and installing required packages...")
    for package in required_packages:
        try:
            __import__(package)
            print(f"✓ {package} already installed")
        except ImportError:
            print(f"Installing {package}...")
            subprocess.check_call([sys.executable, "-m", "pip", "install", package])
            print(f"✓ {package} installed successfully")

def main():
    """
    Main function to run the TF-IDF analysis
    """
    # First install requirements
    install_requirements()
    
    # Now import required packages
    import pandas as pd
    import numpy as np
    from sklearn.feature_extraction.text import TfidfVectorizer
    import os
    from datetime import datetime
    import matplotlib.pyplot as plt
    import seaborn as sns
    from nltk.corpus import stopwords
    import nltk
    import re
    import glob
    
    # Download required NLTK data
    try:
        nltk.download('stopwords', quiet=True)
        nltk.download('punkt', quiet=True)
        print("✓ Successfully downloaded NLTK resources")
    except Exception as e:
        print(f"Error downloading NLTK resources: {str(e)}")
        return

    def create_output_directory():
        """Create output directory with timestamp"""
        output_dir = os.path.join(os.path.abspath('.'), 'tfidf_analysis_outputs')
        if not os.path.exists(output_dir):
            os.makedirs(output_dir)
        
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        run_dir = os.path.join(output_dir, f'tfidf_analysis_{timestamp}')
        os.makedirs(run_dir)
        
        return run_dir

    def clean_text(text):
        """Clean text for analysis"""
        if pd.isna(text):
            return ""
        
        # Convert to string and lowercase
        text = str(text).lower()
        
        # Remove URLs
        text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
        
        # Remove email addresses
        text = re.sub(r'\S+@\S+', '', text)
        
        # Remove special characters and numbers
        text = re.sub(r'[^\w\s]', ' ', text)
        text = re.sub(r'\d+', '', text)
        
        # Remove extra whitespace
        text = ' '.join(text.split())
        
        return text

    def perform_tfidf_analysis(df, text_column, output_dir):
        """Perform TF-IDF analysis"""
        print("\nStarting TF-IDF analysis...")
        
        # Verify text column exists
        if text_column not in df.columns:
            raise ValueError(f"Column '{text_column}' not found in dataset. Available columns: {', '.join(df.columns)}")
        
        # Clean the text
        print("Cleaning text data...")
        df['cleaned_text'] = df[text_column].apply(clean_text)
        
        # Remove empty texts
        df = df[df['cleaned_text'].str.len() > 0].reset_index(drop=True)
        
        if len(df) == 0:
            raise ValueError("No valid text data remaining after cleaning")
        
        # Initialize vectorizer
        print("Performing TF-IDF vectorization...")
        stop_words = list(stopwords.words('english'))
        
        tfidf_vectorizer = TfidfVectorizer(
            max_features=1000,
            stop_words=stop_words,
            min_df=2,         # Ignore terms that appear in less than 2 documents
            max_df=0.95,      # Ignore terms that appear in more than 95% of documents
            ngram_range=(1,2) # Include both unigrams and bigrams
        )
        
        try:
            # Fit and transform the text
            tfidf_matrix = tfidf_vectorizer.fit_transform(df['cleaned_text'])
            feature_names = tfidf_vectorizer.get_feature_names_out()
            
            # Convert to dense array for analysis
            dense_tfidf = tfidf_matrix.todense()
            
            # Calculate document-term importance
            doc_term_matrix = pd.DataFrame(
                dense_tfidf,
                columns=feature_names
            )
            
            # Calculate average TF-IDF scores across all documents
            avg_tfidf = pd.DataFrame({
                'term': feature_names,
                'avg_tfidf': doc_term_matrix.mean().values,
                'docs_present': (doc_term_matrix > 0).sum().values
            }).sort_values('avg_tfidf', ascending=False)
            
            # Save results
            print("Saving analysis results...")
            
            # Save average TF-IDF scores
            avg_tfidf.to_csv(os.path.join(output_dir, 'average_tfidf_scores.csv'), index=False)
            
            # Save document-term matrix
            doc_term_matrix.to_csv(os.path.join(output_dir, 'document_term_matrix.csv'))
            
            # Generate visualizations
            print("Generating visualizations...")
            
            # Plot top terms by average TF-IDF score
            plt.figure(figsize=(15, 8))
            sns.barplot(data=avg_tfidf.head(20), x='avg_tfidf', y='term')
            plt.title('Top 20 Terms by Average TF-IDF Score')
            plt.tight_layout()
            plt.savefig(os.path.join(output_dir, 'top_terms_tfidf.png'), dpi=300, bbox_inches='tight')
            plt.close()
            
            # Plot term frequency distribution
            plt.figure(figsize=(15, 8))
            sns.histplot(data=avg_tfidf, x='avg_tfidf', bins=50)
            plt.title('Distribution of TF-IDF Scores')
            plt.xlabel('TF-IDF Score')
            plt.ylabel('Count')
            plt.savefig(os.path.join(output_dir, 'tfidf_distribution.png'), dpi=300, bbox_inches='tight')
            plt.close()
            
            # Generate document similarity matrix
            print("Calculating document similarity...")
            similarity_matrix = (tfidf_matrix * tfidf_matrix.T).toarray()
            
            # Plot document similarity heatmap
            plt.figure(figsize=(12, 8))
            sns.heatmap(similarity_matrix[:50, :50], cmap='YlOrRd')  # Limited to first 50 documents for visibility
            plt.title('Document Similarity Matrix (First 50 Documents)')
            plt.savefig(os.path.join(output_dir, 'document_similarity.png'), dpi=300, bbox_inches='tight')
            plt.close()
            
            # Generate summary report
            print("Generating summary report...")
            with open(os.path.join(output_dir, 'tfidf_analysis_summary.txt'), 'w') as f:
                f.write("TF-IDF Analysis Summary\n")
                f.write("=====================\n\n")
                
                f.write("Document Statistics:\n")
                f.write(f"Total documents analyzed: {len(df)}\n")
                f.write(f"Average document length: {df['cleaned_text'].str.len().mean():.1f} characters\n")
                f.write(f"Unique terms analyzed: {len(feature_names)}\n\n")
                
                f.write("Top 20 Most Important Terms (by average TF-IDF):\n")
                f.write(avg_tfidf.head(20).to_string())
                f.write("\n\nTF-IDF Score Statistics:\n")
                f.write(f"Mean TF-IDF score: {avg_tfidf['avg_tfidf'].mean():.4f}\n")
                f.write(f"Median TF-IDF score: {avg_tfidf['avg_tfidf'].median():.4f}\n")
                f.write(f"Max TF-IDF score: {avg_tfidf['avg_tfidf'].max():.4f}\n")
                
                # Find most unique terms (high TF-IDF, low document frequency)
                unique_terms = avg_tfidf[avg_tfidf['docs_present'] <= len(df) * 0.1].head(20)
                f.write("\n\nMost Unique Terms (high TF-IDF, present in <10% of documents):\n")
                f.write(unique_terms.to_string())
            
            return doc_term_matrix, avg_tfidf, similarity_matrix
            
        except Exception as e:
            print(f"Error during TF-IDF analysis: {str(e)}")
            raise

    # Create output directory
    output_dir = create_output_directory()
    print(f"\nCreated output directory: {output_dir}")
    
    # Read all CSV files in directory
    csv_files = glob.glob('*.csv')
    
    if not csv_files:
        print("No CSV files found in current directory!")
        return
    
    # Read and combine all CSV files
    dfs = []
    for file in csv_files:
        try:
            df = pd.read_csv(file)
            print(f"Read file: {file}")
            dfs.append(df)
        except Exception as e:
            print(f"Error reading {file}: {str(e)}")
    
    if not dfs:
        print("No valid CSV files could be read!")
        return
        
    combined_df = pd.concat(dfs, ignore_index=True)
    
    # Identify text column
    possible_text_columns = ['text', 'description', 'comment', 'review', 'content', 'message']
    text_columns = [col for col in combined_df.columns 
                   if any(text_name in col.lower() for text_name in possible_text_columns)]
    
    if not text_columns:
        print("\nNo obvious text columns found. Available columns are:")
        print(combined_df.columns.tolist())
        text_column = input("\nPlease enter the name of the text column to analyze: ")
    else:
        print("\nFound potential text columns:", text_columns)
        if len(text_columns) == 1:
            text_column = text_columns[0]
        else:
            text_column = input("\nPlease enter the name of the text column to analyze: ")
    
    try:
        # Perform analysis
        doc_term_matrix, avg_tfidf, similarity_matrix = perform_tfidf_analysis(combined_df, text_column, output_dir)
        
        print(f"\nAnalysis completed! Results saved in: {output_dir}")
        print("\nFiles generated:")
        print("1. average_tfidf_scores.csv - Average TF-IDF scores for each term")
        print("2. document_term_matrix.csv - Full document-term matrix")
        print("3. top_terms_tfidf.png - Bar plot of top terms by TF-IDF score")
        print("4. tfidf_distribution.png - Distribution of TF-IDF scores")
        print("5. document_similarity.png - Heatmap of document similarity")
        print("6. tfidf_analysis_summary.txt - Detailed analysis report")
        
        return doc_term_matrix, avg_tfidf, similarity_matrix, output_dir
    
    except Exception as e:
        print(f"\nAn error occurred during analysis: {str(e)}")
        return None, None, None, output_dir

if __name__ == "__main__":
    try:
        results = main()
        if results[0] is not None:
            print(f"\nAnalysis completed successfully!")
        else:
            print("\nAnalysis completed with errors. Please check the output directory for partial results.")
    except Exception as e:
        print(f"\nAn error occurred: {str(e)}")

Checking and installing required packages...
✓ pandas already installed
✓ numpy already installed
Installing scikit-learn...
✓ scikit-learn installed successfully
✓ nltk already installed
✓ matplotlib already installed
✓ seaborn already installed
✓ Successfully downloaded NLTK resources

Created output directory: C:\Users\1520a\--- MSU MSDS ---\CSE 482\Project\tfidf_analysis_outputs\tfidf_analysis_20241125_173018
Read file: apex_ad2600_dvd_player_updated.csv
Read file: canon_g3_updated.csv
Read file: nikon_coolpix_4300_updated.csv
Read file: nokia_6610_updated.csv
Read file: nomad_jukebox_zen_xtra_updated.csv

No obvious text columns found. Available columns are:
['Unnamed: 0', 'title', 'sentence', 'sentiment_dict', 'sentiment_total', '[u]', '[p]', '[s]', '[cc]', '[cs]', 'annotations', 'title_input_ids', 'title_attention_mask', 'sentence_input_ids', 'sentence_attention_mask']



Please enter the name of the text column to analyze:  sentence



Starting TF-IDF analysis...
Cleaning text data...
Performing TF-IDF vectorization...
Saving analysis results...
Generating visualizations...
Calculating document similarity...
Generating summary report...

Analysis completed! Results saved in: C:\Users\1520a\--- MSU MSDS ---\CSE 482\Project\tfidf_analysis_outputs\tfidf_analysis_20241125_173018

Files generated:
1. average_tfidf_scores.csv - Average TF-IDF scores for each term
2. document_term_matrix.csv - Full document-term matrix
3. top_terms_tfidf.png - Bar plot of top terms by TF-IDF score
4. tfidf_distribution.png - Distribution of TF-IDF scores
5. document_similarity.png - Heatmap of document similarity
6. tfidf_analysis_summary.txt - Detailed analysis report

Analysis completed successfully!
