# TikTok Sentiment Analysis - Data Preprocessing

This notebook contains the complete preprocessing pipeline for TikTok comment sentiment analysis.
We'll cover data loading, cleaning, exploratory data analysis, and prepare data for sentiment analysis.

## Import Required Libraries

Import necessary libraries for data manipulation, text processing, and visualization.

In [None]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Text processing
import re
import string
from textblob import TextBlob
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from wordcloud import WordCloud

# Machine Learning
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Warnings
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8')

## Data Loading and Initial Exploration

Load the TikTok comments data and perform initial exploration.

In [None]:
# Load the comments data
df = pd.read_csv('../data/comments.csv')

print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
df.head()

In [None]:
# Basic information about the dataset
print("Dataset Info:")
print(df.info())

print("\nMissing values:")
print(df.isnull().sum())

print("\nBasic statistics:")
print(df.describe())

## Text Preprocessing Functions

Define functions for cleaning and preprocessing text data.

In [None]:
# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)

# Initialize lemmatizer and stopwords
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def clean_text(text):
    """
    Clean and preprocess text data
    """
    if pd.isna(text):
        return ""
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    
    # Remove user mentions and hashtags
    text = re.sub(r'@\w+|#\w+', '', text)
    
    # Remove punctuation and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Remove extra whitespaces
    text = re.sub(r'\s+', ' ', text).strip()
    
    # Tokenize
    tokens = word_tokenize(text)
    
    # Remove stopwords and lemmatize
    tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words and len(token) > 2]
    
    return ' '.join(tokens)

## Apply Text Preprocessing

Clean the comment text and create processed versions.

In [None]:
# Apply text cleaning
df['cleaned_comment'] = df['comment_text'].apply(clean_text)

# Remove empty comments after cleaning
df = df[df['cleaned_comment'].str.len() > 0]

print(f"Dataset shape after cleaning: {df.shape}")
print("\nSample cleaned comments:")
for i in range(3):
    print(f"Original: {df.iloc[i]['comment_text']}")
    print(f"Cleaned: {df.iloc[i]['cleaned_comment']}")
    print("-" * 50)

## Sentiment Analysis with TextBlob

Perform initial sentiment analysis using TextBlob to create labels.

In [None]:
def get_sentiment(text):
    """
    Get sentiment polarity and subjectivity using TextBlob
    """
    blob = TextBlob(text)
    return blob.sentiment.polarity, blob.sentiment.subjectivity

def classify_sentiment(polarity):
    """
    Classify sentiment based on polarity score
    """
    if polarity > 0.1:
        return 'positive'
    elif polarity < -0.1:
        return 'negative'
    else:
        return 'neutral'

# Apply sentiment analysis
sentiment_data = df['comment_text'].apply(get_sentiment)
df['polarity'] = [x[0] for x in sentiment_data]
df['subjectivity'] = [x[1] for x in sentiment_data]
df['sentiment'] = df['polarity'].apply(classify_sentiment)

print("Sentiment distribution:")
print(df['sentiment'].value_counts())
print("\nPercentage distribution:")
print(df['sentiment'].value_counts(normalize=True) * 100)

## Exploratory Data Analysis

Visualize the data and sentiment distributions.

In [None]:
# Set up the plotting environment
plt.figure(figsize=(15, 10))

# 1. Sentiment Distribution
plt.subplot(2, 3, 1)
sentiment_counts = df['sentiment'].value_counts()
plt.pie(sentiment_counts.values, labels=sentiment_counts.index, autopct='%1.1f%%')
plt.title('Sentiment Distribution')

# 2. Polarity Distribution
plt.subplot(2, 3, 2)
plt.hist(df['polarity'], bins=20, alpha=0.7, color='skyblue')
plt.title('Polarity Score Distribution')
plt.xlabel('Polarity')
plt.ylabel('Frequency')

# 3. Subjectivity Distribution
plt.subplot(2, 3, 3)
plt.hist(df['subjectivity'], bins=20, alpha=0.7, color='lightcoral')
plt.title('Subjectivity Score Distribution')
plt.xlabel('Subjectivity')
plt.ylabel('Frequency')

# 4. Comment Length Distribution
plt.subplot(2, 3, 4)
df['comment_length'] = df['comment_text'].str.len()
plt.hist(df['comment_length'], bins=20, alpha=0.7, color='lightgreen')
plt.title('Comment Length Distribution')
plt.xlabel('Characters')
plt.ylabel('Frequency')

# 5. Sentiment vs Length
plt.subplot(2, 3, 5)
sns.boxplot(data=df, x='sentiment', y='comment_length')
plt.title('Comment Length by Sentiment')

# 6. Polarity vs Subjectivity
plt.subplot(2, 3, 6)
colors = {'positive': 'green', 'negative': 'red', 'neutral': 'gray'}
for sentiment in df['sentiment'].unique():
    subset = df[df['sentiment'] == sentiment]
    plt.scatter(subset['polarity'], subset['subjectivity'], 
               c=colors[sentiment], label=sentiment, alpha=0.6)
plt.xlabel('Polarity')
plt.ylabel('Subjectivity')
plt.title('Polarity vs Subjectivity')
plt.legend()

plt.tight_layout()
plt.show()

## Word Cloud Analysis

Generate word clouds for different sentiment categories.

In [None]:
# Generate word clouds for each sentiment
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

sentiments = ['positive', 'negative', 'neutral']
colors = ['Greens', 'Reds', 'Blues']

for i, sentiment in enumerate(sentiments):
    text = ' '.join(df[df['sentiment'] == sentiment]['cleaned_comment'])
    
    if text.strip():  # Only create wordcloud if there's text
        wordcloud = WordCloud(width=400, height=300, 
                            background_color='white',
                            colormap=colors[i]).generate(text)
        
        axes[i].imshow(wordcloud, interpolation='bilinear')
        axes[i].set_title(f'{sentiment.title()} Comments', fontsize=16)
        axes[i].axis('off')
    else:
        axes[i].text(0.5, 0.5, f'No {sentiment} comments', 
                    horizontalalignment='center', verticalalignment='center')
        axes[i].set_title(f'{sentiment.title()} Comments', fontsize=16)

plt.tight_layout()
plt.show()

## Feature Engineering

Create additional features that might be useful for sentiment prediction.

In [None]:
# Create additional features
df['word_count'] = df['cleaned_comment'].str.split().str.len()
df['exclamation_count'] = df['comment_text'].str.count('!')
df['question_count'] = df['comment_text'].str.count('\?')
df['caps_ratio'] = df['comment_text'].apply(lambda x: sum(1 for c in x if c.isupper()) / len(x) if len(x) > 0 else 0)

# Show feature correlations with sentiment
feature_cols = ['word_count', 'exclamation_count', 'question_count', 'caps_ratio', 'polarity', 'subjectivity']
correlation_data = df[feature_cols + ['sentiment']].copy()

# Convert sentiment to numeric for correlation
sentiment_map = {'negative': -1, 'neutral': 0, 'positive': 1}
correlation_data['sentiment_numeric'] = correlation_data['sentiment'].map(sentiment_map)

plt.figure(figsize=(10, 8))
correlation_matrix = correlation_data.drop('sentiment', axis=1).corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

print("Feature statistics by sentiment:")
print(df.groupby('sentiment')[feature_cols].mean())

## Save Preprocessed Data

Save the cleaned and processed dataset for use in modeling.

In [None]:
# Save the preprocessed dataset
output_path = '../data/preprocessed_comments.csv'
df.to_csv(output_path, index=False)

print(f"Preprocessed data saved to: {output_path}")
print(f"Final dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

# Display sample of final dataset
print("\nFinal dataset sample:")
df.head()

## Next Steps

This preprocessing notebook has completed the following steps:

1. ✅ Data loading and exploration
2. ✅ Text cleaning and preprocessing
3. ✅ Initial sentiment analysis with TextBlob
4. ✅ Exploratory data analysis and visualization
5. ✅ Feature engineering
6. ✅ Data export for modeling

**Next steps for the project:**
- Create a machine learning model for sentiment classification
- Implement the TikTok comment fetcher
- Build the Streamlit web application
- Deploy the application

The preprocessed data is now ready for machine learning model training!