# Dataset 3 – Resume Dataset SNEHAAN BHAWAL

Este notebook limpia el **Dataset 1** para el Modelo 1 (CV → rol tech).

## Resumen del analisis:
- Este dataset tiene 2481 filas (limpio) y 24 categorías
- Cada categoría tiene una media de unos 100 CV's
- Roles muy generales (nada específicos)
- Serivría para entrenar un modelo que diga tu perfil a grandes rasgos

Ejemplo de roles:

    - public-relations

    - hr

    - designer

    - arts

    - teacher 

## Origen de los datos

- Kaggle: https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset

> **Importante:** los datos **NO** se suben al repositorio (están en `.gitignore`).  
> Cada persona que use este proyecto debe descargar los datos en local.

## Instrucciones para descargar y colocar el dataset

1. Ve al enlace de Kaggle:  
   https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset
2. Descarga el fichero CSV del dataset (por ejemplo `Resume.csv`).
3. Cópialo dentro de la carpeta:

   `data/model1_cv_role/1.raw/`

4. Renómbralo a:

   `resumes_dataset3.csv`

   (o ajusta el nombre en el código si prefieres otro).

## Ruta esperada en este notebook

Este notebook asume que el archivo está en:

```text
data/model1_cv_role/1.raw/resumes_dataset1.csv
```

Si usas otro nombre o ubicación, cambia file_path en la celda de lectura.


In [3]:
import sys
sys.executable, sys.version


('/opt/local/bin/python3.13',
 '3.13.9 (main, Oct 18 2025, 13:09:20) [Clang 17.0.0 (clang-1700.3.19.1)]')

In [1]:
import pandas as pd

In [4]:
df = pd.read_csv('../../1.raw/archive-3/Resume/resumes_dataset3.csv')
df.head()

Unnamed: 0,ID,Resume_str,Resume_html,Category
0,16852973,HR ADMINISTRATOR/MARKETING ASSOCIATE\...,"<div class=""fontsize fontface vmargins hmargin...",HR
1,22323967,"HR SPECIALIST, US HR OPERATIONS ...","<div class=""fontsize fontface vmargins hmargin...",HR
2,33176873,HR DIRECTOR Summary Over 2...,"<div class=""fontsize fontface vmargins hmargin...",HR
3,27018550,HR SPECIALIST Summary Dedica...,"<div class=""fontsize fontface vmargins hmargin...",HR
4,17812897,HR MANAGER Skill Highlights ...,"<div class=""fontsize fontface vmargins hmargin...",HR


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2484 entries, 0 to 2483
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   ID           2484 non-null   int64 
 1   Resume_str   2484 non-null   object
 2   Resume_html  2484 non-null   object
 3   Category     2484 non-null   object
dtypes: int64(1), object(3)
memory usage: 77.8+ KB


In [6]:
df.describe()

Unnamed: 0,ID
count,2484.0
mean,31826160.0
std,21457350.0
min,3547447.0
25%,17544300.0
50%,25210310.0
75%,36114440.0
max,99806120.0


In [7]:
df['Category'].nunique()

24

In [32]:
df['Category'].value_counts()

Category
INFORMATION-TECHNOLOGY    120
BUSINESS-DEVELOPMENT      120
ADVOCATE                  118
CHEF                      118
ENGINEERING               118
ACCOUNTANT                118
FINANCE                   118
FITNESS                   117
AVIATION                  117
SALES                     116
BANKING                   115
HEALTHCARE                115
CONSULTANT                115
CONSTRUCTION              112
PUBLIC-RELATIONS          111
HR                        110
DESIGNER                  107
ARTS                      103
TEACHER                   102
APPAREL                    97
DIGITAL-MEDIA              96
AGRICULTURE                63
AUTOMOBILE                 36
BPO                        22
Name: count, dtype: int64

In [35]:
val_cat = df['Category'].value_counts()
media_cat = val_cat.mean()
print(media_cat)

103.5


In [13]:
df['Category'].value_counts(normalize=True) * 100

Category
INFORMATION-TECHNOLOGY    4.830918
BUSINESS-DEVELOPMENT      4.830918
ADVOCATE                  4.750403
CHEF                      4.750403
ENGINEERING               4.750403
ACCOUNTANT                4.750403
FINANCE                   4.750403
FITNESS                   4.710145
AVIATION                  4.710145
SALES                     4.669887
BANKING                   4.629630
HEALTHCARE                4.629630
CONSULTANT                4.629630
CONSTRUCTION              4.508857
PUBLIC-RELATIONS          4.468599
HR                        4.428341
DESIGNER                  4.307568
ARTS                      4.146538
TEACHER                   4.106280
APPAREL                   3.904992
DIGITAL-MEDIA             3.864734
AGRICULTURE               2.536232
AUTOMOBILE                1.449275
BPO                       0.885668
Name: proportion, dtype: float64

In [14]:
df.isna().sum()

ID             0
Resume_str     0
Resume_html    0
Category       0
dtype: int64

In [16]:
df['resume_len'] = df['Resume_str'].str.len()
df['resume_len'].describe()

count     2484.000000
mean      6295.308776
std       2769.251458
min         21.000000
25%       5160.000000
50%       5886.500000
75%       7227.250000
max      38842.000000
Name: resume_len, dtype: float64

In [18]:
df.nsmallest(5, 'resume_len')[['Resume_str', 'Category', 'resume_len']]

Unnamed: 0,Resume_str,Category,resume_len
656,,BUSINESS-DEVELOPMENT,21
1931,CONSTRUCTION WORKER Highlig...,CONSTRUCTION,881
130,GRAPHIC DESIGNER Expe...,DESIGNER,985
1102,SALES ASSOCIATE Summary My g...,SALES,995
1049,SALES ASSOCIATE Summary Moti...,SALES,1033


In [20]:
for cat in df['Category'].unique():
    ejemplo = df[df['Category'] == cat]['Resume_str'].iloc[0]
    print("="*80)
    print(cat)
    print(ejemplo[:300])


HR
         HR ADMINISTRATOR/MARKETING ASSOCIATE

HR ADMINISTRATOR       Summary     Dedicated Customer Service Manager with 15+ years of experience in Hospitality and Customer Service Management.   Respected builder and leader of customer-focused teams; strives to instill a shared, enthusiastic commit
DESIGNER
         DESIGNER       Summary     Designer with more than 15 years in product design, manufacturing, exhibit design and visual merchandising,  with comprehensive management and logistics experience who thrives in dynamically changing environments.             Highlights          Design processing 
INFORMATION-TECHNOLOGY
         INFORMATION TECHNOLOGY         Summary     Dedicated  Information Assurance Professional  well-versed in analyzing and mitigating risk and finding cost-effective solutions. Excels at boosting performance and productivity by establishing realistic goals and enforcing deadlines.  Versatile IT
TEACHER
         TEACHER         Professional Summary     Mast

In [27]:
import pandas as pd

# Partimos del df que ya tienes cargado
df_clean = df.copy()

# 1) Longitud del CV
df_clean['resume_len'] = df_clean['Resume_str'].str.len()

# 2) Eliminar duplicados exactos en el texto del CV
df_clean = df_clean.drop_duplicates(subset=['Resume_str'])

# 3) Filtrar CVs demasiado cortos (< 500 caracteres)
MIN_LEN = 500
df_clean = df_clean[df_clean['resume_len'] >= MIN_LEN].copy()

print("Filas después de limpiar:", df_clean.shape[0])

# 4) Renombrar columnas al estándar y añadir metadatos
df_clean = df_clean.rename(columns={
    'Resume_str': 'cv_text',
    'Category': 'role_raw'
})

#Poner minusculas role_raw
df_clean['role_raw'] = df_clean['role_raw'].str.lower()

df_clean['role_label_final'] = df_clean['role_raw']  # de momento 1:1
df_clean['source_dataset'] = 'dataset1_avishek'

# Reordenar columnas
df_clean = df_clean[[
    'cv_text',
    'role_raw',
    'role_label_final',
    'source_dataset',
    'resume_len'
]]

df_clean.head()


Filas después de limpiar: 2481


Unnamed: 0,cv_text,role_raw,role_label_final,source_dataset,resume_len
0,HR ADMINISTRATOR/MARKETING ASSOCIATE\...,hr,hr,dataset1_avishek,5442
1,"HR SPECIALIST, US HR OPERATIONS ...",hr,hr,dataset1_avishek,5572
2,HR DIRECTOR Summary Over 2...,hr,hr,dataset1_avishek,7720
3,HR SPECIALIST Summary Dedica...,hr,hr,dataset1_avishek,2855
4,HR MANAGER Skill Highlights ...,hr,hr,dataset1_avishek,9172


In [28]:
output_path = "resumes_dataset1_clean.csv"  # se guarda en 2.cleaning
df_clean.to_csv(output_path, index=False)
print("Guardado:", output_path)


Guardado: resumes_dataset1_clean.csv


In [29]:
df_clean['role_label_final'].value_counts()
df_clean['resume_len'].describe()


count     2481.000000
mean      6298.708585
std       2767.887922
min        881.000000
25%       5160.000000
50%       5888.000000
75%       7228.000000
max      38842.000000
Name: resume_len, dtype: float64

In [30]:
print("Original:", df.shape[0], "filas")
print("Limpio:", df_clean.shape[0], "filas")
print("Eliminadas:", df.shape[0] - df_clean.shape[0])

print("\nDistribución por rol (dataset limpio):")
print(df_clean['role_label_final'].value_counts())


Original: 2484 filas
Limpio: 2481 filas
Eliminadas: 3

Distribución por rol (dataset limpio):
role_label_final
information-technology    120
business-development      119
advocate                  118
chef                      118
engineering               118
accountant                118
finance                   117
fitness                   117
sales                     116
aviation                  116
banking                   115
healthcare                115
consultant                115
construction              112
public-relations          111
hr                        110
designer                  107
arts                      103
teacher                   102
apparel                    97
digital-media              96
agriculture                63
automobile                 36
bpo                        22
Name: count, dtype: int64
