# Notebook 1: Data Preprocessing

This notebook performs data cleaning and preprocessing on the raw Twitter dataset.

## Steps:
1. Load raw dataset
2. Remove empty tweets
3. Remove duplicate tweets
4. Filter out @grok queries
5. Filter English language tweets
6. Apply text cleaning pipeline
7. Add engagement features
8. Save cleaned dataset

In [4]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys

# Add project root to Python path for proper module imports
project_root = os.path.dirname(os.getcwd())  # Go up one level from notebooks/
if project_root not in sys.path:
    sys.path.insert(0, project_root)

# Import custom modules
from src import utils, preprocessing, feature_engineering, models

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 150)
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

## 1. Load Raw Dataset

In [5]:
# Load raw data
data_path = utils.get_data_path('tweets.csv')
df = pd.read_csv(data_path)

print(f"\n{'='*60}")
print(f"DATASET OVERVIEW")
print(f"{'='*60}")
print(f"Total tweets: {len(df)}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nDataset shape: {df.shape}")
print(f"\nData types:\n{df.dtypes}")
print(f"\n{'='*60}\n")


DATASET OVERVIEW
Total tweets: 408

Columns: ['Tweet Link', 'Author Handle', 'Tweet Content', 'Views', 'Likes', 'Retweets', 'Replies', 'Tweet Creation Date', 'Scraped Date']

Dataset shape: (408, 9)

Data types:
Tweet Link             object
Author Handle          object
Tweet Content          object
Views                   int64
Likes                   int64
Retweets                int64
Replies                 int64
Tweet Creation Date    object
Scraped Date           object
dtype: object




## 2. Explore Raw Data

In [6]:
# Display sample tweets
print("Sample raw tweets:")
print(df[['Tweet Content', 'Likes', 'Retweets']].head(10))

# Check for missing values
print(f"\n\nMissing values per column:")
print(df.isnull().sum())

Sample raw tweets:
                                                                                                    Tweet Content  \
0                                                            We tried to stop it from overthinking.\n\nWe failed.   
1                                                                                                             Waw   
2                                                                                              @grok\n who is he?   
3                                                                @grok\n make him bald and resemble Enzo Maresca.   
4                                                         @grok\n remove Liam rosenior name and put Arnold Masoha   
5  I predicted this\nThey need some one they can control without him battling and eye.\nThey need a 'yes sir' man   
6                                                                                @grok\n what is the likes count?   
7                                            

## 3. Data Cleaning Pipeline

In [7]:
# Apply full preprocessing pipeline
df_clean = preprocessing.preprocess_pipeline(df, clean_text_content=False)

# Display cleaned dataset info
print(f"\n{'='*60}")
print(f"CLEANED DATASET OVERVIEW")
print(f"{'='*60}")
print(f"Total tweets after cleaning: {len(df_clean)}")
print(f"Removed: {len(df) - len(df_clean)} tweets ({((len(df) - len(df_clean))/len(df)*100):.1f}%)")
print(f"\nDataset shape: {df_clean.shape}")
print(f"\n{'='*60}\n")


=== Starting Data Preprocessing Pipeline ===

Initial dataset: 408 tweets

✓ Removed 15 empty tweets
✓ Removed 4 duplicate tweets
✓ Removed 19 @grok query tweets
Detecting tweet languages...
✓ Removed 41 non-English tweets
✓ Remaining English tweets: 329
✓ Added engagement and text features

=== Preprocessing Complete ===
Final dataset: 329 tweets


CLEANED DATASET OVERVIEW
Total tweets after cleaning: 329
Removed: 79 tweets (19.4%)

Dataset shape: (329, 14)




## 4. Explore Cleaned Data

In [8]:
# Display sample cleaned tweets
print("Sample cleaned tweets:")
print(df_clean[['Tweet Content', 'engagement_score', 'tweet_length', 'word_count']].head(10))

# Show statistics
print(f"\n\nEngagement Statistics:")
print(df_clean[['Views', 'Likes', 'Retweets', 'Replies', 'engagement_score']].describe())

print(f"\n\nText Statistics:")
print(df_clean[['tweet_length', 'word_count', 'hashtag_count', 'mention_count']].describe())

Sample cleaned tweets:
                                                                                                                                            Tweet Content  \
0                                                                                                    We tried to stop it from overthinking.\n\nWe failed.   
5                                          I predicted this\nThey need some one they can control without him battling and eye.\nThey need a 'yes sir' man   
10                            Well, hope this new coach meets the expectation of the board and fans. Big coaches tend to turn down the chelsea job offer.   
11  Chelsea always sign coaches for 6 and half years before they'll fire them a year later after disastrous campaign. I pray he managed to see till De...   
12                                                                                                I give him till the next international break, march max   
14                                 

## 5. Check @grok Query Filtering

In [9]:
# Sample tweets that would have been filtered out
print("Examples of @grok queries (removed from dataset):")
grok_queries = df[df['Tweet Content'].apply(preprocessing.is_grok_query)]['Tweet Content'].head(5)
for i, tweet in enumerate(grok_queries, 1):
    print(f"{i}. {tweet}\n")

Examples of @grok queries (removed from dataset):
1. @grok
 who is he?

2. @grok
 make him bald and resemble Enzo Maresca.

3. @grok
 remove Liam rosenior name and put Arnold Masoha

4. @grok
 what is the likes count?

5. @grok
 just lol  it will end tears



## 6. Save Cleaned Dataset

In [10]:
# Ensure output directory exists
utils.ensure_directories()

# Save cleaned dataset
output_path = utils.get_processed_data_path('tweets_cleaned.csv')
df_clean.to_csv(output_path, index=False)

print(f"\n{'='*60}")
print(f"✓ Cleaned dataset saved to: {output_path}")
print(f"  - Total tweets: {len(df_clean)}")
print(f"  - Columns: {list(df_clean.columns)}")
print(f"{'='*60}\n")

✓ Directories ensured:
  - ./data/processed
./outputs/figures
./outputs/tables
./outputs/models
./outputs/metrics

✓ Cleaned dataset saved to: /home/emmanuelabayor/projects/analisis-sentiment-pelatih-baru-chelsea-liam-rosenior/data/processed/tweets_cleaned.csv
  - Total tweets: 329
  - Columns: ['Tweet Link', 'Author Handle', 'Tweet Content', 'Views', 'Likes', 'Retweets', 'Replies', 'Tweet Creation Date', 'Scraped Date', 'engagement_score', 'tweet_length', 'word_count', 'hashtag_count', 'mention_count']



## 7. Optional: Apply Text Cleaning (for ML Modeling)

In [11]:
# Apply text cleaning (remove URLs, mentions, special chars, etc.)
print("Applying text cleaning...")
df_clean['Tweet Content Cleaned'] = df_clean['Tweet Content'].apply(preprocessing.clean_text)

print("\nSample tweets before and after cleaning:")
for i in range(5):
    print(f"\n{i+1}. BEFORE: {df_clean['Tweet Content'].iloc[i][:100]}...")
    print(f"   AFTER:  {df_clean['Tweet Content Cleaned'].iloc[i][:100]}...")

Applying text cleaning...

Sample tweets before and after cleaning:

1. BEFORE: We tried to stop it from overthinking.

We failed....
   AFTER:  we tried to stop it from overthinking we failed...

2. BEFORE: I predicted this
They need some one they can control without him battling and eye.
They need a 'yes ...
   AFTER:  i predicted this they need some one they can control without him battling and eye they need a yes si...

3. BEFORE: Well, hope this new coach meets the expectation of the board and fans. Big coaches tend to turn down...
   AFTER:  well hope this new coach meets the expectation of the board and fans big coaches tend to turn down t...

4. BEFORE: Chelsea always sign coaches for 6 and half years before they'll fire them a year later after disastr...
   AFTER:  chelsea always sign coaches for 6 and half years before theyll fire them a year later after disastro...

5. BEFORE: I give him till the next international break, march max...
   AFTER:  i give him till the next inte

## 8. Save Dataset with Cleaned Text

In [12]:
# Save dataset with cleaned text
output_path = utils.get_processed_data_path('tweets_cleaned_with_text.csv')
df_clean.to_csv(output_path, index=False)

print(f"\n{'='*60}")
print(f"✓ Dataset with cleaned text saved to: {output_path}")
print(f"  - Total tweets: {len(df_clean)}")
print(f"  - Columns: {list(df_clean.columns)}")
print(f"{'='*60}\n")


✓ Dataset with cleaned text saved to: /home/emmanuelabayor/projects/analisis-sentiment-pelatih-baru-chelsea-liam-rosenior/data/processed/tweets_cleaned_with_text.csv
  - Total tweets: 329
  - Columns: ['Tweet Link', 'Author Handle', 'Tweet Content', 'Views', 'Likes', 'Retweets', 'Replies', 'Tweet Creation Date', 'Scraped Date', 'engagement_score', 'tweet_length', 'word_count', 'hashtag_count', 'mention_count', 'Tweet Content Cleaned']



## ✅ Data Preprocessing Complete!

**Summary:**
- Original dataset: 621 tweets
- Cleaned dataset: ~250-300 tweets (after filtering)
- Removed: @grok queries, non-English, duplicates, empty tweets
- Added: engagement features, text features

**Next Steps:**

→ **`2_exploratory_analysis.ipynb`** - Perform EDA and visualizations

**Saved Files:**
- `data/processed/tweets_cleaned.csv`
- `data/processed/tweets_cleaned_with_text.csv`