# DataCleaner: A Python Class for Cleaning CSV Data  

## Overview  
The `DataCleaner` class provides a structured approach to cleaning and processing data stored in CSV files. It includes functionalities for handling missing values, standardizing text, extracting and removing emojis, managing YouTube links, and formatting the data for compatibility with PostgreSQL.

## Features  
- **Loading CSV files**  
- **Handling missing values**  
- **Extracting and removing emojis**  
- **Extracting and removing YouTube links**  
- **Standardizing text formatting**  
- **Removing duplicates**  
- **Ensuring PostgreSQL compatibility**  
- **Saving cleaned data back to CSV**  
- **Logging all actions for traceability**  

 Code Breakdown  


In [1]:
import os
import sys

In [2]:
sys.path.append(os.path.abspath('..'))

In [3]:
from scripts.cleaning import DataCleaner

In [4]:
data_path = '/home/nahomnadew/Desktop/10x/week7/Kara_Solutions/data/telegram_data2.csv'
cleaned_data_path = '/home/nahomnadew/Desktop/10x/week7/Kara_Solutions/data/cleaned_telegram_data.csv'



### 1. **Class Initialization**  
- Sets up logging to store messages in a file (`../logs/data_cleaning.log`) and display them in the console.
- Ensures that the `logs` directory exists.








In [5]:
clean = DataCleaner()

### 2. **Loading CSV Data (`load_csv`)**  
- Reads a CSV file into a Pandas DataFrame.
- Logs success or failure.

In [6]:
df = clean.load_csv(data_path)

2025-01-31 11:14:19,549 - INFO -  CSV file '/home/nahomnadew/Desktop/10x/week7/Kara_Solutions/data/telegram_data2.csv' loaded successfully.


In [7]:
print(df.columns)
print(df.head())

Index(['Channel Title', 'Channel Username', 'ID', 'Message', 'Date',
       'Media Path'],
      dtype='object')
                                       Channel Title Channel Username    ID  \
0  ETHIO-AMERICAN MEDICAL TRAININGS( CPD ) & HEAL...           @EAHCI  2603   
1  ETHIO-AMERICAN MEDICAL TRAININGS( CPD ) & HEAL...           @EAHCI  2602   
2  ETHIO-AMERICAN MEDICAL TRAININGS( CPD ) & HEAL...           @EAHCI  2601   
3  ETHIO-AMERICAN MEDICAL TRAININGS( CPD ) & HEAL...           @EAHCI  2600   
4  ETHIO-AMERICAN MEDICAL TRAININGS( CPD ) & HEAL...           @EAHCI  2598   

                                             Message  \
0  #የግርዛት_ስልጠና_ወላይታ_ሶዶ\n#Circumcision_Skill_Train...   
1  #ENGLISH_LANGUAGE_TRAINING\n👉Grammar\n👉Vocabul...   
2  Congratulations to our beloved trainees on com...   
3  #የግርዛት_ስልጠና_Addis_Ababa \n#Circumcision_Skill_...   
4   #💥CPD_አሁን_ይመዝገቡ #የሞያ_ፈቃድ_ለማሳደስ_CPD_ይመዝገቡ\n#Ti...   

                        Date  Media Path  
0  2025-01-30 12:42:18+00:00    


### 3. **Emoji Handling**  
- `extract_emojis(text)`: Extracts emojis from text, storing them in a new column.  
- `remove_emojis(text)`: Removes emojis from text.  

### 4. **YouTube Link Handling**  
- `extract_youtube_links(text)`: Extracts YouTube links from text and stores them in a new column.  
- `remove_youtube_links(text)`: Removes YouTube links from the message text.  

### 5. **Text Cleaning (`clean_text`)**  
- Replaces newline characters with spaces.
- Ensures proper text formatting.

### 6. **Data Cleaning (`clean_dataframe`)**  
Performs multiple cleaning operations:
- **Remove duplicates** based on the `ID` column.
- **Convert date columns** to datetime format.
- **Convert IDs** to integers (PostgreSQL `BIGINT` compatibility).
- **Handle missing values** by filling with placeholders (`"No Message"`, `"No Media"`).
- **Standardize text fields** (removing unnecessary spaces).
- **Extract and remove emojis** from the `Message` column.
- **Extract and remove YouTube links** from the `Message` column.
- **Rename columns** to match PostgreSQL schema.


In [8]:
df2 = clean.clean_dataframe(df)

2025-01-31 11:14:19,598 - INFO -  Duplicates removed from dataset.
2025-01-31 11:14:19,637 - INFO -  Date column formatted to datetime.


  df.loc[:, 'Media Path'] = df['Media Path'].fillna("No Media")
2025-01-31 11:14:19,645 - INFO -  Missing values filled.
2025-01-31 11:14:19,685 - INFO -  Text columns standardized.
2025-01-31 11:14:19,762 - INFO -  Emojis extracted and stored in 'emoji_used' column.
2025-01-31 11:14:19,864 - INFO - YouTube links extracted and stored in 'youtube_links' column.
2025-01-31 11:14:19,871 - INFO -  Data cleaning completed successfully.


### 7. **Saving Cleaned Data (`save_cleaned_data`)**  
- Saves the cleaned DataFrame to a new CSV file.  
- Logs success or failure.

In [10]:
clean.save_cleaned_data(df2, cleaned_data_path)

2025-01-31 11:14:30,587 - INFO -  Cleaned data saved successfully to '/home/nahomnadew/Desktop/10x/week7/Kara_Solutions/data/cleaned_telegram_data.csv'.


 Cleaned data saved successfully to '/home/nahomnadew/Desktop/10x/week7/Kara_Solutions/data/cleaned_telegram_data.csv'.
