# Zero-Shot-Classification
Vamos a utilizar un modelo de Zero-Shot para reclasificar todas las noticias del dataset con el objetivo de obtener una clasificación mejor distribuida.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from transformers import pipeline

from forecasting_bluesky_code import preprocessing as pre
from forecasting_bluesky_code import eda_plots as ep


2025-04-15 15:22:16.178250: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Dataset
Reaprovechamos el dataset de noticias por día para reclasificar todas las noticias en una nueva columna de subject.

In [3]:
df = pd.read_csv('news_data_by_subject.csv')

In [4]:
df = pre.preprocessing_news_df(df)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1814 entries, 0 to 1813
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   date      1814 non-null   datetime64[ns]
 1   year      1814 non-null   int64         
 2   month     1814 non-null   int64         
 3   day       1814 non-null   int64         
 4   headline  1814 non-null   object        
 5   subject   1814 non-null   object        
dtypes: datetime64[ns](1), int64(3), object(2)
memory usage: 85.2+ KB
None


In [5]:
df = df.drop(columns = ['year','month','day','subject'])

In [6]:
df = df.set_index('date')

In [7]:
df['headline'] = df['headline'].apply(pre.text_cleaning)

In [8]:
df

Unnamed: 0_level_0,headline
date,Unnamed: 1_level_1
2023-05-08,the tradition golf international winner steve ...
2023-05-09,cyclone mocha forms in the indian ocean killin...
2023-05-10,italian open tennis international winner men d...
2023-05-10,karnataka legislative assembly election
2023-05-11,the discovery of new moons of saturn is report...
...,...
2025-04-06,montecarlo masters tennis international winner...
2025-04-06,japanese grand prix f formula racing internati...
2025-04-06,in ice hockey washington capitals star alexand...
2025-04-07,colossal biosciences announces romulus remus a...


### Zero-Shot-Classification
Aplicamos la función de Zero-Shot pero lo hacemos por batches para controlar mejor el uso de CPU.

In [9]:
batch = 30
print(f'Total batches: {len(df['headline']) / batch}')

Total batches: 60.46666666666667


In [19]:
df_classified = isc.zero_shot_classification_batched(df, 'headline', batch_size=batch)

Device set to use cpu


Processing batch 0–30...
Processing batch 30–60...
Processing batch 60–90...
Processing batch 90–120...
Processing batch 120–150...
Processing batch 150–180...
Processing batch 180–210...
Processing batch 210–240...
Processing batch 240–270...
Processing batch 270–300...
Processing batch 300–330...
Processing batch 330–360...
Processing batch 360–390...
Processing batch 390–420...
Processing batch 420–450...
Processing batch 450–480...
Processing batch 480–510...
Processing batch 510–540...
Processing batch 540–570...
Processing batch 570–600...
Processing batch 600–630...
Processing batch 630–660...
Processing batch 660–690...
Processing batch 690–720...
Processing batch 720–750...
Processing batch 750–780...
Processing batch 780–810...
Processing batch 810–840...
Processing batch 840–870...
Processing batch 870–900...
Processing batch 900–930...
Processing batch 930–960...
Processing batch 960–990...
Processing batch 990–1020...
Processing batch 1020–1050...
Processing batch 1050–108

In [66]:
df_classified.to_csv('news_data_classified_by_subject.csv')

In [48]:
df_classified['subject'].value_counts()

subject
sports                     468
elections                  358
politics                   165
crime                      148
technology                 148
science                    126
environment                105
international relations     58
war                         46
videogames                  45
protests                    42
entertainment               32
health                      28
artificial intelligence     15
finance                     13
immigration                 10
pandemics                    5
education                    2
Name: count, dtype: int64

In [52]:
# Pivot the dataset so that each subject becomes a column with its associated headline(s)

# Group by 'date' and 'subject', concatenate headlines per subject per day
grouped = df_classified.groupby(['date', 'subject'])['headline'].apply(lambda x: ' '.join(x))

# Unstack 'subject' so that each subject becomes a column
df_news_daily = grouped.unstack(level='subject', fill_value='')

# Reset the index, but drop the current index and turn 'date' into a column
df_news_daily = df_news_daily.reset_index(drop=False)

In [61]:
df_news_daily.to_csv('news_data_classified_by_day.csv', index = False)