# Data Cleaning - News and Sentiment Analysis

This notebook performs data cleaning operations on news articles with sentiment scores.

## 1. Import Required Packages

In [90]:
import pandas as pd
import os
from datetime import datetime

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


## 2. Load CSV Data

**Note:** Update the file path below to match your data location.

In [91]:
file_path = "../data/processed/articles_with_sentiment_score.csv"
df = pd.read_csv(file_path)

## 3. Select Required Columns and Rename

In [92]:
columns_needed = ["publishedAt", "title", "full_text", "sentiment"]
df_cleaned = df[columns_needed].copy()

df_cleaned = df_cleaned.rename(columns={
    "publishedAt": "Date",
    "title": "Title",
    "full_text": "Content",
    "sentiment": "Sentiment Score"
})

print("Columns selected and renamed:")
print(df_cleaned.columns.tolist())

Columns selected and renamed:
['Date', 'Title', 'Content', 'Sentiment Score']


## 4. Check and Handle Missing Values

In [93]:
print("Missing values count by column:")
missing_counts = df_cleaned.isnull().sum()
print(missing_counts)

print("\nMissing values percentage:")
missing_percentage = (df_cleaned.isnull().sum() / len(df_cleaned)) * 100
print(missing_percentage.round(2))

Missing values count by column:
Date               0
Title              0
Content            0
Sentiment Score    0
dtype: int64

Missing values percentage:
Date               0.0
Title              0.0
Content            0.0
Sentiment Score    0.0
dtype: float64


In [94]:
rows_before = len(df_cleaned)
df_cleaned.dropna(inplace=True)
rows_after = len(df_cleaned)

print(f"Rows before cleaning: {rows_before}")
print(f"Rows after cleaning: {rows_after}")
print(f"Rows removed: {rows_before - rows_after}")

Rows before cleaning: 86
Rows after cleaning: 86
Rows removed: 0


## 5. Process and Format Dates

In [95]:
df_cleaned["Date"] = pd.to_datetime(df_cleaned["Date"]).dt.date

## 6. Remove Duplicate Records Based on Date and Title

In [96]:
print("Checking for duplicate records based on Date and Title...")
print(f"Total rows before duplicate removal: {len(df_cleaned)}")

# Megkeressük a duplikált sorokat
duplicates = df_cleaned[df_cleaned.duplicated(subset=['Date', 'Title'], keep=False)]
duplicate_count = len(duplicates)

print(f"Number of duplicate rows found: {duplicate_count}")
print(f"Unique duplicate pairs: {duplicate_count // 2 if duplicate_count > 0 else 0}")

if duplicate_count > 0:
    print("\nDuplicate records (showing first 10):")
    print("=" * 60)
    display(duplicates[['Date', 'Title', 'Sentiment Score']].head(10))

    # Megmutatjuk a legtöbb duplikált sort tartalmazó dátumokat
    duplicate_dates = duplicates['Date'].value_counts().head(5)
    print("\nTop 5 dates with most duplicates:")
    print(duplicate_dates)

    # Ténylegesen eltávolítjuk a duplikált sorokat
    df_cleaned = df_cleaned.drop_duplicates(subset=['Date', 'Title'])
    print(f"\nDuplicates removed. Total rows after removal: {len(df_cleaned)}")
else:
    print("No duplicates found.")

Checking for duplicate records based on Date and Title...
Total rows before duplicate removal: 86
Number of duplicate rows found: 0
Unique duplicate pairs: 0
No duplicates found.


In [97]:
# Remove duplicates - keep the first occurrence
rows_before_dedup = len(df_cleaned)
df_cleaned = df_cleaned.drop_duplicates(subset=['Date', 'Title'], keep='first')
rows_after_dedup = len(df_cleaned)
duplicates_removed = rows_before_dedup - rows_after_dedup

print(f"Rows before duplicate removal: {rows_before_dedup}")
print(f"Rows after duplicate removal: {rows_after_dedup}")
print(f"Duplicate rows removed: {duplicates_removed}")
print(f"Duplicate removal rate: {(duplicates_removed/rows_before_dedup*100):.1f}%")

# Reset index after removing duplicates
df_cleaned = df_cleaned.reset_index(drop=True)
print(f"\nDataFrame index has been reset.")
print(f"Final dataset size: {len(df_cleaned)} rows")

# Verify no duplicates remain
remaining_duplicates = df_cleaned.duplicated(subset=['Date', 'Title']).sum()
print(f"Verification: {remaining_duplicates} duplicates remaining (should be 0)")

Rows before duplicate removal: 86
Rows after duplicate removal: 86
Duplicate rows removed: 0
Duplicate removal rate: 0.0%

DataFrame index has been reset.
Final dataset size: 86 rows
Verification: 0 duplicates remaining (should be 0)


## 7. Save Cleaned Data

In [98]:
output_dir = "../data/cleaned/"
os.makedirs(output_dir, exist_ok=True)

# Save cleaned data to CSV
cleaned_file_path = os.path.join(output_dir, "cleaned_news_data.csv")
df_cleaned.to_csv(cleaned_file_path, index=False)

print(f"Cleaned data saved to: {cleaned_file_path}")

Cleaned data saved to: ../data/cleaned/cleaned_news_data.csv


## 8. Display Results and Summary

In [99]:
print("First 5 rows of cleaned data:")
print("=" * 50)
display(df_cleaned.head())

First 5 rows of cleaned data:


Unnamed: 0,Date,Title,Content,Sentiment Score
0,2025-05-08,Apple has a new ‘Viral’ playlist on Apple Musi...,Apple is launching a new global Viral Chart pl...,0.765
1,2025-05-01,Spotify already has an app ready to test Apple...,Spotify says it has submitted an update to its...,0.9761
2,2025-05-13,How to Use Apple Maps on the Web,The boundaries of Apple’s walled garden aren’t...,0.9715
3,2025-05-06,Trump’s Tariffs Are Threatening America’s Appl...,"Few foods are more American than apple pie, bu...",0.9901
4,2025-05-02,How Apple lost control of the App Store,“Cook chose poorly” is one of those phrases yo...,0.9613
