<a href="https://colab.research.google.com/github/akshayajju6083/Simon-says-game/blob/main/Untitled18_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# **📰 News Category Dataset**

This notebook covers the complete **data cleaning process** for the *News Category Dataset*.
We performed the following steps:

- Introduced and handled missing values
- Fixed incorrect data types
- Standardized text data
- Exported the cleaned dataset for further analysis


SyntaxError: invalid syntax (ipython-input-3878373599.py, line 3)

## Project Overview

The **News Category Dataset** (by Misra, Kaggle) contains news headlines and short descriptions from the Huffington Post, along with publication dates and assigned categories.

Our task in this phase is to **perform thorough data cleaning** so the dataset is ready for analysis and machine learning models.

# News Category Dataset — Data Cleaning & Preprocessing   
**Description:** Cleaning and preprocessing the “News Category Dataset” from Kaggle for downstream tasks (e.g. classification, analysis).


## Objectives

- Load and explore the dataset  
- Handle missing, duplicate, and inconsistent data  
- Clean and normalize text fields  
- Verify category distribution  
- Save the cleaned dataset for further use

In [68]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

In [69]:
df = pd.read_json('/content/News_Category_Dataset_v3.json', lines=True)

In [70]:
df.head()
df.info()
df.describe(include='all')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209527 entries, 0 to 209526
Data columns (total 6 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   link               209527 non-null  object        
 1   headline           209527 non-null  object        
 2   category           209527 non-null  object        
 3   short_description  209527 non-null  object        
 4   authors            209527 non-null  object        
 5   date               209527 non-null  datetime64[ns]
dtypes: datetime64[ns](1), object(5)
memory usage: 9.6+ MB


Unnamed: 0,link,headline,category,short_description,authors,date
count,209527,209527,209527,209527.0,209527.0,209527
unique,209486,207996,42,187022.0,29169.0,
top,https://www.huffingtonpost.comhttp://stylelike...,Sunday Roundup,POLITICS,,,
freq,2,90,35602,19712.0,37418.0,
mean,,,,,,2015-04-30 00:44:14.344308736
min,,,,,,2012-01-28 00:00:00
25%,,,,,,2013-08-10 00:00:00
50%,,,,,,2015-03-16 00:00:00
75%,,,,,,2016-11-01 00:00:00
max,,,,,,2022-09-23 00:00:00


## Step 3: Initial Data Exploration (EDA)

In [71]:
df.isna().sum()
df['category'].value_counts().head(10)
df['headline'].str.len().describe()
df['short_description'].str.len().describe()

Unnamed: 0,short_description
count,209527.0
mean,114.20867
std,80.840575
min,0.0
25%,59.0
50%,120.0
75%,134.0
max,1472.0


In [72]:
df.isna().sum()

Unnamed: 0,0
link,0
headline,0
category,0
short_description,0
authors,0
date,0


### Introducing Missing Values for Demonstration

To simulate a more realistic dataset and practice handling missing data, we will randomly introduce some `NaN` (null) values in the following columns:

- **authors** → 5 randomly selected rows will have missing author names.  
- **short_description** → 5 randomly selected rows will have missing descriptions.  
- **category** → 5 randomly selected rows will have missing category labels.  

This will help us demonstrate data cleaning techniques such as handling null values later in the notebook.


In [73]:
import numpy as np
np.random.seed(42)

null_indices = np.random.choice(df.index, 5, replace=False)
df.loc[null_indices, 'authors'] = np.nan  #authors missing

null_indices = np.random.choice(df.index, 8, replace=False)
df.loc[null_indices, 'short_description'] = np.nan  #descriptions missing

null_indices = np.random.choice(df.index, 3, replace=False)
df.loc[null_indices, 'category'] = np.nan  #categories missing


In [74]:
df.columns

Index(['link', 'headline', 'category', 'short_description', 'authors', 'date'], dtype='object')

In [75]:
df.isnull().sum()

Unnamed: 0,0
link,0
headline,0
category,3
short_description,8
authors,5
date,0


In [76]:
df = df.dropna(subset=['authors', 'short_description', 'category'])

In [77]:
df.isnull().sum()

Unnamed: 0,0
link,0
headline,0
category,0
short_description,0
authors,0
date,0


## Fixing Incorrect Data Types

In [78]:
print("Before fixing types:\n", df.dtypes)
df['date'] = pd.to_datetime(df['date'], errors='coerce') # we just changed data column to datetime
print("\nAfter fixing types:\n", df.dtypes)

Before fixing types:
 link                         object
headline                     object
category                     object
short_description            object
authors                      object
date                 datetime64[ns]
dtype: object

After fixing types:
 link                         object
headline                     object
category                     object
short_description            object
authors                      object
date                 datetime64[ns]
dtype: object


## Standardizing Text Data

Text data can have inconsistent capitalization or extra spaces.  
We'll clean and standardize the text columns to make them uniform.

In [79]:
text_columns = ['headline', 'category', 'authors']
for col in text_columns:
    df[col] = df[col].astype(str).str.strip().str.lower()

print("Unique categories after cleaning:")
print(df['category'].unique()[:10])

Unique categories after cleaning:
['u.s. news' 'comedy' 'parenting' 'world news' 'culture & arts' 'tech'
 'sports' 'entertainment' 'politics' 'weird news']


In [80]:
df.to_csv("News_Category_Dataset_v3_cleaned.csv", index=False)
print("✅ Cleaned dataset successfully saved as 'News_Category_Dataset_v3_cleaned.csv'")

✅ Cleaned dataset successfully saved as 'News_Category_Dataset_v3_cleaned.csv'


We have now cleaned our dataset by:
- Handling missing values  
- Fixing incorrect data types (date column)  
- Standardizing text data (headline, category, authors)
The dataset is now ready for further exploration or analysis.