# Web Articles Data Cleaning

This notebook performs data cleaning, preprocessing, and feature engineering on a dataset of web articles to prepare it for analysis like topic modeling or text classification.

## Objectives
- Remove duplicate articles
- Handle missing values
- Clean text data (content and titles)
- Calculate similarity scores between original and cleaned text
- Save the cleaned data to a CSV file

## Workflow

The notebook is structured as follows:

1.  **Data Loading and Initial Exploration**: Load the dataset and examine its structure and basic statistics.
2.  **Data Cleaning**:
    *   Handle missing values in the dataset.
    *   Remove duplicate entries based on article ID, title, and content.
3.  **Text Cleaning and Summarization**:
    *   Clean the article content by removing unnecessary characters and summarizing the text.
    *   Clean the article titles by removing patterns.
4.  **Similarity Analysis**:
    *   Calculate the cosine similarity between the original and cleaned content and titles using Sentence Transformers.
5.  **Final Data Preparation and Export**:
    *   Select relevant columns and rename them for clarity.
    *   Save the cleaned and preprocessed data to a new CSV file.

## Import Libraries and Load Model

In [1]:
import pandas as pd
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
import re
from sentence_transformers import SentenceTransformer, util


# ===================== download model =====================
model = SentenceTransformer('all-MiniLM-L6-v2')

## Simple Data Exploratory And Cleaning

### load data and explore it

In [2]:
df = pd.read_csv(r'../data/raw/data1.csv')

In [3]:
df.info()
df.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1477 entries, 0 to 1476
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           1477 non-null   int64  
 1   source       1477 non-null   object 
 2   url          1477 non-null   object 
 3   title        1477 non-null   object 
 4   fetched_at   1477 non-null   object 
 5   t_total_sec  1477 non-null   float64
 6   content      1475 non-null   object 
 7   h1           1475 non-null   object 
 8   h2           1095 non-null   object 
dtypes: float64(1), int64(1), object(7)
memory usage: 104.0+ KB


(1477, 9)

In [4]:
df.columns

Index(['id', 'source', 'url', 'title', 'fetched_at', 't_total_sec', 'content',
       'h1', 'h2'],
      dtype='object')

In [5]:
df['id'] = pd.to_numeric(df['id'], errors='coerce')

In [6]:
df.drop_duplicates(subset= ['id'],keep='first', inplace=True)

In [7]:
df['id'].value_counts()

id
1136    1
1135    1
1134    1
1133    1
1132    1
       ..
5       1
4       1
3       1
2       1
1       1
Name: count, Length: 1152, dtype: int64

In [8]:

df = df.dropna(subset=['id'])
df['id'] = df['id'].astype(int)

In [9]:
df.head(100)

Unnamed: 0,id,source,url,title,fetched_at,t_total_sec,content,h1,h2
0,1,cnn,https://edition.cnn.com/2025/09/09/europe/pola...,NATO shoots down Russian drones in Polish airs...,2025-09-10T15:13:31.532053Z,1.136,NATO fighter jets shot down multiple Russian d...,NATO shoots down Russian drones in Polish airs...,‘Unprecedented violation’ Allies react and a s...
1,2,cnn,https://edition.cnn.com/2025/09/10/europe/fran...,France hit by protests and disruption as new p...,2025-09-10T15:13:32.937497Z,0.903,Nationwide unrest broke out across France on W...,France hit by protests and disruption as new p...,‘No one listens to us’
2,3,cnn,https://edition.cnn.com/2025/09/10/middleeast/...,Israel strikes Hamas leadership in Qatar in un...,2025-09-10T15:13:34.092127Z,0.652,Israel launched a series of strikes targeting ...,Israel strikes Hamas leadership in Qatar in un...,What happened? Why was the attack controversia...
3,4,cnn,https://edition.cnn.com/2025/09/09/europe/russ...,Russian aerial bomb kills at least 25 civilian...,2025-09-10T15:13:35.474307Z,0.880,A Russian aerial bomb attack on Tuesday killed...,Russian aerial bomb kills at least 25 civilian...,
4,5,cnn,https://edition.cnn.com/2025/09/10/americas/am...,"She was deported nine times. Then, like others...",2025-09-10T15:13:36.747389Z,0.771,Esther Morales lived in the United States for ...,"She was deported nine times. Then, like others...",‘On this side there are also dreams’ Adapting ...
...,...,...,...,...,...,...,...,...,...
95,96,bbc,https://www.bbc.com/news/videos/c1edl9qppelo?a...,More than a dozen robbers storm jewellery shop...,2025-09-10T15:16:21.728079Z,0.450,An 88-year-old jewellery store owner in San Jo...,More than a dozen robbers storm jewellery shop...,Watch: BBC interviews man who helped rescue ch...
96,97,bbc,https://www.bbc.com/news/videos/crl5ngr2069o?a...,"VMAs: Ariana Grande, Lady Gaga and Sabrina Car...",2025-09-10T15:16:23.026879Z,0.798,The MTV Video Music Awards (VMAs) have taken p...,"Watch: Glitz, glamour and emotional speeches a...",'He doesn't follow trends': How celebrities de...
97,98,nytimes,https://www.nytimes.com/2025/09/10/world/europ...,NATO Says It Scrambled Fighter Jets to Shoot D...,2025-09-10T15:16:26.723700Z,2.013,More than a dozen Russian drones entered Polan...,NATO Says It Scrambled Fighter Jets to Shoot D...,Related Content Site Index Site Information Na...
98,99,nytimes,https://www.nytimes.com/2025/09/10/world/middl...,Israel’s Attack on Qatari Soil Leads Gulf Powe...,2025-09-10T15:16:30.468179Z,3.242,Qatar hosts the largest American military base...,Gulf Powers Question U.S. Protection After Isr...,Related Content Site Index Site Information Na...


In [10]:
df.tail(100)

Unnamed: 0,id,source,url,title,fetched_at,t_total_sec,content,h1,h2
1377,1053,meduza,https://meduza.io/en/feature/2025/03/19/new-me...,New Meduza merch hits the shelves — Meduza,2025-09-11T10:50:21.118934Z,0.264,"In late 2024, to mark Meduza’s 10th anniversar...",New Meduza merch hits the shelves,Site Index Newsletters
1378,1054,meduza,https://meduza.io/en/feature/2024/11/26/invest...,"Investigations, long reads, and open-data anal...",2025-09-11T10:50:21.918743Z,0.299,Meduza has published tens of thousands of feat...,"Investigations, long reads, and open-data anal...",Site Index Newsletters
1379,1055,meduza,https://meduza.io/en/pages/codex,Meduza’s code of conduct — Meduza,2025-09-11T10:50:23.347228Z,0.927,Meduza is an international Russian- and Englis...,Meduza’s code of conduct,Site Index Newsletters
1380,1056,meduza,https://meduza.io/en/feature/2018/05/25/how-we...,How we process your personal data stored on Me...,2025-09-11T10:50:24.627672Z,0.779,"In some cases, “Meduza” (more precisely, the L...",How we process your personal data stored on Me...,Site Index Newsletters
1381,1057,meduza,https://meduza.io/en/feature/2018/05/25/how-me...,How Meduza uses cookies Keeping it as simple a...,2025-09-11T10:50:25.810291Z,0.681,"Just like most websites, Meduza (or Medusa Pro...",How Meduza uses cookies Keeping it as simple a...,Site Index Newsletters
...,...,...,...,...,...,...,...,...,...
1472,1148,echo,https://echo.msk.ru/news/exchange-cny-11-senty...,Курс юаня к рублю на сегодня 11 сентября 2025 ...,2025-09-11T10:53:31.215272Z,1.030,Сегодня Центральный банк России установил новы...,Курс юаня к рублю на сегодня 11 сентября 2025 ...,Основные причины изменения курса юаня Практиче...
1473,1149,echo,https://echo.msk.ru/news/exchange-kzt-11-senty...,Курс тенге к рублю на сегодня 11 сентября 2025...,2025-09-11T10:53:32.701359Z,0.985,Сегодня Центральный банк России установил новы...,Курс тенге к рублю на сегодня 11 сентября 2025...,Анализ динамики курса тенге Советы для путешес...
1474,1150,echo,https://echo.msk.ru/news/peresechenie-sploshno...,"Пересечение сплошной линии: 5 ситуаций, когда ...",2025-09-11T10:53:34.190308Z,0.988,Для большинства автомобилистов сплошная размет...,"Пересечение сплошной линии: 5 ситуаций, когда ...",
1475,1151,echo,https://echo.msk.ru/news/uchenye-raskryli-sekr...,Ученые раскрыли секрет: какое молоко идеально ...,2025-09-11T10:53:35.649328Z,0.952,Неожиданные выводы исследователей: какое молок...,Ученые раскрыли секрет: какое молоко идеально ...,


In [11]:
df[['content', 'title']] = df[['content', 'title']].astype(str)

df = df.dropna(subset=['content', 'title'])

In [12]:
duplicated_values = df[df.title.duplicated(keep=False)].title
print(duplicated_values)

Series([], Name: title, dtype: object)


In [13]:
df.drop_duplicates(subset=['title'], keep='first', inplace=True)

In [14]:
duplicated_values = df[df.content.duplicated(keep=False)].content
duplicated_values

103     Israel attempted to kill senior members of Ham...
124     President Trump announced on Tuesday that the ...
260     Connections is the one of the most popular New...
261     Oh hey there! If you're here, it must be time ...
263     If you're reading this, you're looking for a l...
298     If you like playing daily word games like Word...
315     Save now on essential digital access to qualit...
682     Israel attempted to kill senior members of Ham...
699     President Trump announced on Tuesday that the ...
773     If you like playing daily word games like Word...
775     Connections is the one of the most popular New...
776     Oh hey there! If you're here, it must be time ...
778     If you're reading this, you're looking for a l...
924     Then $75 per month. Complete digital access to...
926     Save now on essential digital access to qualit...
927     Then $75 per month. Complete digital access to...
928     Save now on essential digital access to qualit...
929     Save n

In [15]:
df.drop_duplicates(subset=['content'], keep='first', inplace=True)

In [16]:
duplicated_values = df[df.h1.duplicated(keep=False)].h1
print(duplicated_values)

98     Gulf Powers Question U.S. Protection After Isr...
680    Gulf Powers Question U.S. Protection After Isr...
Name: h1, dtype: object


In [17]:
df.drop_duplicates(subset=['h1'], keep='first', inplace=True)

In [18]:
df = df.drop(columns=["h2"])

In [19]:
df.dropna(inplace=True)

## Advance Content and title  Cleaning

### Clean Content

In [20]:
def clean_and_summarize(text, sentence_count=4):
    """
    Cleans text by removing newlines and other artifacts, then summarizes it.

    Args:
        text (str): The raw text content to be processed.
        sentence_count (int): The desired number of sentences in the summary.

    Returns:
        str: A cleaned and summarized version of the text.
    """
    if not isinstance(text, str):
        # Handle potential non-string values in the column
        return ""

    # 1. Clean the text
    cleaned_text = text.replace('\n', ' ')
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()

    # If the text is too short, return it as is to avoid errors
    if len(cleaned_text.split()) < 50:
        return cleaned_text

    # 2. Summarize the cleaned text
    try:
        parser = PlaintextParser.from_string(cleaned_text, Tokenizer("english"))
        summarizer = LsaSummarizer()
        summary_sentences = summarizer(parser.document, sentence_count)
        summary = ' '.join([str(sentence) for sentence in summary_sentences])
        return summary
    except Exception as e:
        # In case of any summarization error, return the cleaned text
        print(f"Summarization error: {e}")
        return cleaned_text

In [21]:
df["cleaned_content"] = df["content"].apply(lambda x: clean_and_summarize(x, sentence_count=4))

In [22]:
df.dropna(subset=['cleaned_content'], inplace=True)

In [23]:
df["cleaned_content"]

0       Poland’s interior ministry said seven drones a...
1       Appointed by President Emmanuel Macron on Tues...
2       The attack could have a chilling effect in the...
3       Ukraine’s President Volodymyr Zelensky describ...
4       “There are opportunities everywhere,” he said,...
                              ...                        
1472    Сегодня Центральный банк России установил новы...
1473    Сегодня Центральный банк России установил новы...
1474    Однако настоящая дорога не всегда черно-белая,...
1475    Неожиданные выводы исследователей: какое молок...
1476    Сегодня Центральным банком России установлен н...
Name: cleaned_content, Length: 1119, dtype: object

### Clean Titles

In [24]:
def clean_title(title):
    """
    Cleans a single title string by removing specific patterns.

    Args:
        title (str): The raw title string to be cleaned.

    Returns:
        str: The cleaned title.
    """
    if not isinstance(title, str):
        return ""

    title = re.sub(r'\s*\(.*\)', '', title)

    title = re.sub(r'\s*(?:\||—|–|:).*', '', title)

    title = title.strip()

    return title

In [25]:
df.dropna(subset=['title'], inplace=True)

In [26]:
df['cleaned_title'] = df['title'].apply(clean_title)
df['cleaned_title']

0       NATO shoots down Russian drones in Polish airs...
1       France hit by protests and disruption as new p...
2       Israel strikes Hamas leadership in Qatar in un...
3       Russian aerial bomb kills at least 25 civilian...
4       She was deported nine times. Then, like others...
                              ...                        
1472    Курс юаня к рублю на сегодня 11 сентября 2025 ...
1473    Курс тенге к рублю на сегодня 11 сентября 2025...
1474                           Пересечение сплошной линии
1475                               Ученые раскрыли секрет
1476    Курс евро к рублю на сегодня 11 сентября 2025 ...
Name: cleaned_title, Length: 1119, dtype: object

## Compare Original content and title with Semantic Similarity Scores

### Length compere

In [27]:
df_comparison = df[['content', 'cleaned_content']].copy()
df_comparison['original_word_count'] = df_comparison['content'].apply(lambda x: len(x.split()))
df_comparison['cleaned_word_count'] = df_comparison['cleaned_content'].apply(lambda x: len(x.split()))


print("Comparison of Original vs. Cleaned Content:")
print("\n" + "="*50 + "\n")
print("Word Count Comparison:")
print(df_comparison[['original_word_count', 'cleaned_word_count']].head())

Comparison of Original vs. Cleaned Content:


Word Count Comparison:
   original_word_count  cleaned_word_count
0                 1461                  79
1                  524                  90
2                 1209                 172
3                  486                  83
4                 1560                  85


In [31]:
df_comparison.describe()

Unnamed: 0,original_word_count,cleaned_word_count
count,1119.0,1119.0
mean,657.320822,97.319929
std,679.124522,37.042516
min,10.0,0.0
25%,247.0,77.0
50%,438.0,91.0
75%,862.0,110.0
max,6630.0,421.0


### Semantic Similarity Scores

In [28]:

df['similarity_score_between_content_&_cleanedContent'] = df.apply(
    lambda row: util.cos_sim(
        model.encode(row['content'], convert_to_tensor=True),
        model.encode(row['cleaned_content'], convert_to_tensor=True)
    ).item(),
    axis=1
)

df['similarity_score_between_Title_&_Title'] = df.apply(
    lambda row: util.cos_sim(
        model.encode(row['title'], convert_to_tensor=True),
        model.encode(row['cleaned_title'], convert_to_tensor=True)
    ).item(),
    axis=1
)
df[['similarity_score_between_content_&_cleanedContent', 'similarity_score_between_Title_&_Title']]

Unnamed: 0,similarity_score_between_content_&_cleanedContent,similarity_score_between_Title_&_Title
0,0.656067,0.977465
1,0.808313,0.924281
2,0.751935,0.969533
3,0.762566,0.989257
4,0.251561,0.968969
...,...,...
1472,0.871424,1.000000
1473,0.885209,1.000000
1474,0.850105,0.693393
1475,1.000000,0.663835


In [29]:
df[['similarity_score_between_content_&_cleanedContent', 'similarity_score_between_Title_&_Title']].describe()

Unnamed: 0,similarity_score_between_content_&_cleanedContent,similarity_score_between_Title_&_Title
count,1119.0,1119.0
mean,0.763858,0.926825
std,0.176433,0.143987
min,0.014341,0.106713
25%,0.673685,0.927115
50%,0.799851,1.0
75%,0.88529,1.0
max,1.0,1.0


## Save the dataFrame

In [30]:
dfc = df[['id', 'source', 'url', 'cleaned_title', 'fetched_at',
       't_total_sec', 'cleaned_content']]

# rename columns
dfc = dfc.rename(columns={
    'cleaned_title': 'title',
    'cleaned_content': 'content'
})

dfc.to_csv(r'../data/cleaned_data/clean_data.csv', index=False)