In [1]:
import pandas as pd
import re
from datetime import date
from datetime import datetime
import locale
locale.setlocale(locale.LC_ALL, ('es_ES', 'UTF-8'))

'es_ES.UTF-8'

In [2]:
df = pd.read_csv("../../data/nubank_news.csv")

# GENERAL

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 290 entries, 0 to 289
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   link       290 non-null    object
 1   title      290 non-null    object
 2   date       290 non-null    object
 3   summary    290 non-null    object
 4   paragraph  290 non-null    object
dtypes: object(5)
memory usage: 11.5+ KB


In [4]:
df.describe(include='object')

Unnamed: 0,link,title,date,summary,paragraph
count,290,290,290,290,290
unique,290,290,242,290,290
top,https://international.nubank.com.br/es/compani...,Otávio Ribeiro Damaso comienza en Nubank como ...,"Nov 26 , 2024",Damaso se desempeñó como Director de Regulació...,"São Paulo, 4 de julio de 2025 – Nu, una de las..."
freq,1,1,3,1,1


Calculates the number of missing (null/NaN) values in each column of the DataFrame df. It returns a Series where the index is the column name and the value is the count of nulls in that column. This helps you quickly see which columns have missing data and how many missing values there are.

In [5]:
df.isnull().sum()

link         0
title        0
date         0
summary      0
paragraph    0
dtype: int64

Counts the number of duplicate rows in the DataFrame df. It returns an integer representing how many rows are exact duplicates of previous rows. This helps you identify if your dataset has repeated entries.

In [6]:
df.duplicated().sum()

np.int64(0)

# LINKS

Check invalid links

In [7]:
df['link'].apply(lambda x: not x.startswith('http')).sum()

np.int64(0)

Check duplicated links

In [8]:
df['link'].duplicated().sum()

np.int64(0)

Remove duplicates

In [9]:
df = df.drop_duplicates(subset='link')

# DATE

Verify all dates have the following format: Month Day, Year (for example: Abr 2, 2024).

In [10]:
def verify_date_format(date):
    return not bool(re.match(r'[a-zA-Z]+\s\d+\s.\s\d+', date))

df_bad_format = df[df['date'].apply(verify_date_format)]

df_bad_format["date"].apply(lambda x: print(x))

Nubank anuncia SHEIN en el Shopping de Nu
Abr 2 , 2024
La tasa para comprar y vender criptoactivos se reducirá para los clientes que más negocien
Mar 21 , 2024
Nu es galardonado como Banco Digital del Año por LatinFinance
Nov 6 , 2023
Nu es premiado en el Top of Mind de Folha de São Paulo
Nov 1 , 2023
Nu amplía la oferta de préstamos de nómina para jubilados y pensionados del INSS
Oct 24 , 2023


146    None
151    None
194    None
195    None
198    None
Name: date, dtype: object

Some dates where extracted with the text. The idea here is to extract only the date that is in the following format: Month Day, Year (for example: Abr 2, 2024) and change it to a date format.

In [11]:
#Clean dates
temp_dates = []

for date in df['date']:
    new_date = re.findall(r'[a-zA-Z]+\s[0-9]+\s.\s[0-9]+', date)[0] 
    new_date = re.sub(r'([a-zA-Z]+)', r'\1.', new_date)
    new_date = datetime.strptime(new_date, '%b %d , %Y')    
    temp_dates.append(new_date)

df['new_dates'] = temp_dates

Just to verify

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 290 entries, 0 to 289
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   link       290 non-null    object        
 1   title      290 non-null    object        
 2   date       290 non-null    object        
 3   summary    290 non-null    object        
 4   paragraph  290 non-null    object        
 5   new_dates  290 non-null    datetime64[ns]
dtypes: datetime64[ns](1), object(5)
memory usage: 13.7+ KB


# TEXTS

In [13]:
paragraphs_na = df[df["paragraph"].isna()]["link"]
print(paragraphs_na)

Series([], Name: link, dtype: object)


In [14]:
df.describe()

Unnamed: 0,new_dates
count,290
mean,2024-03-09 08:01:39.310344704
min,2022-05-16 00:00:00
25%,2023-09-06 00:00:00
50%,2024-04-06 00:00:00
75%,2024-10-01 18:00:00
max,2025-07-04 00:00:00


In [15]:
df['title_len'] = df['title'].str.len()
df['summary_len'] = df['summary'].str.len()
df['paragraph_len'] = df['paragraph'].str.len()

df[['title_len', 'summary_len', 'paragraph_len']].describe()

Unnamed: 0,title_len,summary_len,paragraph_len
count,290.0,290.0,290.0
mean,85.603448,155.175862,508.851724
std,25.331668,49.938371,419.586708
min,28.0,1.0,19.0
25%,67.25,119.0,363.75
50%,84.0,154.0,463.5
75%,100.0,192.75,560.0
max,159.0,253.0,5890.0


Meter aquí un nuevo código para limpiar los párrafos de las fechas que están al inicio, al final eso no se necesita.

## new_summary

Notice that one of the summaries has a length of 1. In his case, it consisted of just one character extracted from the scrapper process.

In [16]:
filtered_df = df[df['summary_len'] == 1]
print(filtered_df['summary'])

210    -
Name: summary, dtype: object


A new variable calle 'new_summary' was created to have the new information. 

In [17]:
df['new_summary'] = df['summary']
df.loc[df['summary_len'] == 1, 'new_summary'] = None

## new_paragraph

One of the paragraphs consisted of just one date. No any other information.

In [18]:
filtered_df = df[df['paragraph_len'] == 19]
print(filtered_df['paragraph'])

254    20 de abril de 2023
Name: paragraph, dtype: object


Most of the paragraphs start with the city (or cities) and the date from which the article is about. For example: "Ciudad de México, 2 de julio de 2025". 

In [19]:
df['paragraph'].head()

0    São Paulo, 4 de julio de 2025 – Nu, una de las...
1    Ciudad de México, 2 de julio de 2025.- Con un ...
2    São Paulo, julio de 2025 – Nu, una de las mayo...
3    São Paulo, 13 de junio de 2025 – Nubank anunci...
4    São Paulo, 5 de junio de 2025 – Nubank acaba d...
Name: paragraph, dtype: object

Here I look for any posibilities to clean the paragraphs from these dates + cities combinations. I had to include different combinations using REGEX.

In [20]:
#Casos: 287, 276

def clean_paragraph_format(text):
    match = re.search(r'^.*,\s*(\w*\sde\s\w*\s|\w*\s|\w*\s*\w*|\w*\s\d*\s)(de|del|,)\s\d*(\.|\s*[-:–—])', text)
    if match:
        text = text[match.end():].strip()
    
    return text

In [21]:
new_paragraphs = []
for paragraph in df['paragraph']:
    new_paragraphs.append(clean_paragraph_format(paragraph))    

df['new_paragraph'] = new_paragraphs

Search if the original paragraph with a length of 19 was removed or not from the cleaning process.

In [22]:
filtered_df = df[df['new_paragraph'].str.len() == 19]
print(filtered_df['new_paragraph'])

254    20 de abril de 2023
Name: new_paragraph, dtype: object


It was not removed, so I proceed to removed.

In [23]:
df.loc[df['new_paragraph'].str.len() == 19, 'new_paragraph'] = None

In [24]:
df['new_summary_len'] = df['new_summary'].str.len()
df['new_paragraph_len'] = df['new_paragraph'].str.len()

df[['new_summary_len', 'new_paragraph_len']].describe()

Unnamed: 0,new_summary_len,new_paragraph_len
count,289.0,289.0
mean,155.709343,478.878893
std,49.190234,417.740602
min,41.0,41.0
25%,119.0,338.0
50%,154.0,431.0
75%,193.0,532.0
max,253.0,5849.0


In [25]:
df.loc[df['new_paragraph'].str.len() == 5849]['new_paragraph']

201    Nu México, empresa de finanzas digitales trans...
Name: new_paragraph, dtype: object

Some paragraphs have really long text. The reason for this is that the entire article was extracted, instead of just the first paragraph. Since most of the article have been properly extracted, for now I am going to leave this long paragraphs like this. To fix this I can either substitute this paragraph with summary or refine the code to prevent these situations.

In [31]:
df['new_summary'] = df['new_summary'].fillna('')
df['new_paragraph'] = df['new_paragraph'].fillna('')

In [32]:
(df['title'].str.lower() == df['new_summary'].str.lower()).mean()

np.float64(0.0034482758620689655)

In [33]:
df['new_summary_paragraph'] = df.apply(
    lambda row: row['new_summary'].lower() in row['new_paragraph'].lower(), axis=1)

df['new_summary_paragraph'].mean()

np.float64(0.04827586206896552)