
#**📊 Automated Content Tagging Provision 📑**
##**Objective:**
###**Build a system to analyze rich text input and return 10-30 relevant 1-4 word tags with weightage scores.**

In [91]:
!pip install nltk spacy yake rake-nltk transformers scikit-learn fastapi uvicorn --quiet

In [92]:
pd.set_option('display.max_colwidth', None)

# **0. Importing necessary libraries**

In [93]:
import pandas as pd
import numpy as np
import nltk
import spacy
import yake
from rake_nltk import Rake
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from transformers import pipeline
import re
import string

# **1. Data Collection**

In [94]:
from google.colab import files
files.upload()

!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

!kaggle datasets download -d rmisra/news-category-dataset
!unzip news-category-dataset.zip

Saving kaggle.json to kaggle (3).json
Dataset URL: https://www.kaggle.com/datasets/rmisra/news-category-dataset
License(s): Attribution 4.0 International (CC BY 4.0)
news-category-dataset.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  news-category-dataset.zip
replace News_Category_Dataset_v3.json? [y]es, [n]o, [A]ll, [N]one, [r]ename: n


In [95]:
df = pd.read_json("News_Category_Dataset_v3.json", lines=True)

# **2. Data Pre-Processing**

In [96]:
df.shape

(209527, 6)

In [97]:
df.head()

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-uptake-us_n_632d719ee4b087fae6feaac9,Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters,U.S. NEWS,Health experts said it is too early to predict whether demand would match up with the 171 million doses of the new boosters the U.S. ordered for the fall.,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlines-passenger-banned-flight-attendant-punch-justice-department_n_632e25d3e4b0e247890329fe,"American Airlines Flyer Charged, Banned For Life After Punching Flight Attendant On Video",U.S. NEWS,"He was subdued by passengers and crew when he fled to the back of the aircraft after the confrontation, according to the U.S. attorney's office in Los Angeles.",Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets-cats-dogs-september-17-23_n_632de332e4b0695c1d81dc02,23 Of The Funniest Tweets About Cats And Dogs This Week (Sept. 17-23),COMEDY,"""Until you have a dog you don't understand what could be eaten.""",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parenting-tweets_l_632d7d15e4b0d12b5403e479,The Funniest Tweets From Parents This Week (Sept. 17-23),PARENTING,"""Accidentally put grown-up toothpaste on my toddler’s toothbrush and he screamed like I was cleaning his teeth with a Carolina Reaper dipped in Tabasco sauce.""",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-loses-discrimination-lawsuit-franklin-templeton_n_632c6463e4b09d8701bd227e,Woman Who Called Cops On Black Bird-Watcher Loses Lawsuit Against Ex-Employer,U.S. NEWS,Amy Cooper accused investment firm Franklin Templeton of unfairly firing her and branding her a racist after video of the Central Park encounter went viral.,Nina Golgowski,2022-09-22


In [98]:
df.columns

Index(['link', 'headline', 'category', 'short_description', 'authors', 'date'], dtype='object')

### **We will keep only those columns that are necessary for our Problem Statement**

In [99]:
df = df[['headline', 'short_description', 'category']]

In [100]:
df['text'] = df['headline'] + ' ' + df['short_description']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'] = df['headline'] + ' ' + df['short_description']


In [101]:
df = df[['text', 'category']]

In [102]:
df['category'].unique()

array(['U.S. NEWS', 'COMEDY', 'PARENTING', 'WORLD NEWS', 'CULTURE & ARTS',
       'TECH', 'SPORTS', 'ENTERTAINMENT', 'POLITICS', 'WEIRD NEWS',
       'ENVIRONMENT', 'EDUCATION', 'CRIME', 'SCIENCE', 'WELLNESS',
       'BUSINESS', 'STYLE & BEAUTY', 'FOOD & DRINK', 'MEDIA',
       'QUEER VOICES', 'HOME & LIVING', 'WOMEN', 'BLACK VOICES', 'TRAVEL',
       'MONEY', 'RELIGION', 'LATINO VOICES', 'IMPACT', 'WEDDINGS',
       'COLLEGE', 'PARENTS', 'ARTS & CULTURE', 'STYLE', 'GREEN', 'TASTE',
       'HEALTHY LIVING', 'THE WORLDPOST', 'GOOD NEWS', 'WORLDPOST',
       'FIFTY', 'ARTS', 'DIVORCE'], dtype=object)

### **As clearly observed we have duplicates or names thus can be defined broadly, this would not only help us in solving class imbalance but also is a pre-processing step**

In [103]:
category_merge_map = {
    'CULTURE & ARTS': 'ARTS & CULTURE',
    'ARTS': 'ARTS & CULTURE',
    'STYLE': 'STYLE & BEAUTY',
    'GREEN': 'ENVIRONMENT',
    'TASTE': 'FOOD & DRINK',
    'WORLDPOST': 'THE WORLDPOST',
    'PARENTS': 'PARENTING',
    'BLACK VOICES': 'QUEER VOICES',
    'LATINO VOICES': 'QUEER VOICES',
    'FIFTY': 'WOMEN',
    'WORLD NEWS': 'U.S. NEWS',
    'DIVORCE': 'PARENTING',
    'HEALTHY LIVING': 'WELLNESS',
    'COMEDY': 'ENTERTAINMENT',
    'MONEY': 'BUSINESS',
    'IMPACT': 'POLITICS',
    'COLLEGE': 'EDUCATION'
}

In [104]:
df['category'] = df['category'].replace(category_merge_map)

In [105]:
df.category.unique()

array(['U.S. NEWS', 'ENTERTAINMENT', 'PARENTING', 'ARTS & CULTURE',
       'TECH', 'SPORTS', 'POLITICS', 'WEIRD NEWS', 'ENVIRONMENT',
       'EDUCATION', 'CRIME', 'SCIENCE', 'WELLNESS', 'BUSINESS',
       'STYLE & BEAUTY', 'FOOD & DRINK', 'MEDIA', 'QUEER VOICES',
       'HOME & LIVING', 'WOMEN', 'TRAVEL', 'RELIGION', 'WEDDINGS',
       'THE WORLDPOST', 'GOOD NEWS'], dtype=object)

### **Still imabalnce exists but is manageable**

In [106]:
df['category'].value_counts()

Unnamed: 0_level_0,count
category,Unnamed: 1_level_1
POLITICS,39086
WELLNESS,24639
ENTERTAINMENT,22762
PARENTING,16172
STYLE & BEAUTY,12068
QUEER VOICES,12060
TRAVEL,9900
FOOD & DRINK,8436
BUSINESS,7748
THE WORLDPOST,6243


### **Encoding Category column using Label Encoder**

In [107]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['label'] = le.fit_transform(df['category'])

print("Encoded classes:", le.classes_)
print(df[['category', 'label']].head())

import pickle
with open('label_encoder.pkl', 'wb') as f:
    pickle.dump(le, f)

Encoded classes: ['ARTS & CULTURE' 'BUSINESS' 'CRIME' 'EDUCATION' 'ENTERTAINMENT'
 'ENVIRONMENT' 'FOOD & DRINK' 'GOOD NEWS' 'HOME & LIVING' 'MEDIA'
 'PARENTING' 'POLITICS' 'QUEER VOICES' 'RELIGION' 'SCIENCE' 'SPORTS'
 'STYLE & BEAUTY' 'TECH' 'THE WORLDPOST' 'TRAVEL' 'U.S. NEWS' 'WEDDINGS'
 'WEIRD NEWS' 'WELLNESS' 'WOMEN']
        category  label
0      U.S. NEWS     20
1      U.S. NEWS     20
2  ENTERTAINMENT      4
3      PARENTING     10
4      U.S. NEWS     20


### **Splitting the dataset into training and testing before pre-processing the text column because we dont want any data leakage or confusion due to tokenization**

In [108]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df['text'], df['label'], test_size=0.2, random_state=42, stratify=df['label']
)

print(f"Train set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")

Train set size: 167621
Test set size: 41906


In [109]:
df.head()

Unnamed: 0,text,category,label
0,Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters Health experts said it is too early to predict whether demand would match up with the 171 million doses of the new boosters the U.S. ordered for the fall.,U.S. NEWS,20
1,"American Airlines Flyer Charged, Banned For Life After Punching Flight Attendant On Video He was subdued by passengers and crew when he fled to the back of the aircraft after the confrontation, according to the U.S. attorney's office in Los Angeles.",U.S. NEWS,20
2,"23 Of The Funniest Tweets About Cats And Dogs This Week (Sept. 17-23) ""Until you have a dog you don't understand what could be eaten.""",ENTERTAINMENT,4
3,"The Funniest Tweets From Parents This Week (Sept. 17-23) ""Accidentally put grown-up toothpaste on my toddler’s toothbrush and he screamed like I was cleaning his teeth with a Carolina Reaper dipped in Tabasco sauce.""",PARENTING,10
4,Woman Who Called Cops On Black Bird-Watcher Loses Lawsuit Against Ex-Employer Amy Cooper accused investment firm Franklin Templeton of unfairly firing her and branding her a racist after video of the Central Park encounter went viral.,U.S. NEWS,20


### **Cleaning the text Column**

In [110]:
# 1. Check for missing values
print("Missing values in 'text':", df['text'].isnull().sum())

Missing values in 'text': 0


In [111]:
# 2. Strip leading/trailing whitespace
df['text'] = df['text'].str.strip()

In [112]:
# 3. Remove empty strings after stripping (if any)
df = df[df['text'] != '']

In [113]:
# 4. Replace multiple spaces/newlines with single space
df['text'] = df['text'].str.replace(r'\s+', ' ', regex=True)

In [114]:
df.head()

Unnamed: 0,text,category,label
0,Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters Health experts said it is too early to predict whether demand would match up with the 171 million doses of the new boosters the U.S. ordered for the fall.,U.S. NEWS,20
1,"American Airlines Flyer Charged, Banned For Life After Punching Flight Attendant On Video He was subdued by passengers and crew when he fled to the back of the aircraft after the confrontation, according to the U.S. attorney's office in Los Angeles.",U.S. NEWS,20
2,"23 Of The Funniest Tweets About Cats And Dogs This Week (Sept. 17-23) ""Until you have a dog you don't understand what could be eaten.""",ENTERTAINMENT,4
3,"The Funniest Tweets From Parents This Week (Sept. 17-23) ""Accidentally put grown-up toothpaste on my toddler’s toothbrush and he screamed like I was cleaning his teeth with a Carolina Reaper dipped in Tabasco sauce.""",PARENTING,10
4,Woman Who Called Cops On Black Bird-Watcher Loses Lawsuit Against Ex-Employer Amy Cooper accused investment firm Franklin Templeton of unfairly firing her and branding her a racist after video of the Central Park encounter went viral.,U.S. NEWS,20


-------------