# Dataset 2 – Resume Dataset (Jithin Jagadeesh)

Este notebook limpia el **Dataset 2** para el Modelo 1 (CV → rol tech).

## Origen de los datos

- Kaggle: https://www.kaggle.com/datasets/jithinjagadeesh/resume-dataset

> **Importante:** los datos **NO** se suben al repositorio (están en `.gitignore`).  
> Cada persona que use este proyecto debe descargar los datos en local.

## Instrucciones para descargar y colocar el dataset

1. Ve al enlace de Kaggle:  
   https://www.kaggle.com/datasets/jithinjagadeesh/resume-dataset
2. Descarga el fichero CSV (por ejemplo `resume_dataset.csv` o similar).
3. Cópialo dentro de la carpeta:

   `data/model1_cv_role/1.raw/`

4. Renómbralo a:

   `resumes_dataset2.csv`

   (o ajusta el nombre en el código si prefieres otro).

## Ruta esperada en este notebook

Este notebook asume que el archivo está en:

```text
data/model1_cv_role/1.raw/resumes_dataset2.csv
```

Si usas otro nombre o ubicación, cambia file_path en la celda de lectura.

In [1]:
import pandas as pd

In [3]:
file_path = "../1.raw/resumes_dataset2.csv"  # ajusta al nombre real
df2 = pd.read_csv(file_path)

In [4]:
print("Shape:", df2.shape)
df2.head()

Shape: (400, 2)


Unnamed: 0,Category,Resume
0,Frontend Developer,"As a seasoned Frontend Developer, I have a pro..."
1,Backend Developer,With a solid background in Backend Development...
2,Python Developer,"As a Python Developer, I leverage my expertise..."
3,Data Scientist,"With a background in Data Science, I possess a..."
4,Frontend Developer,Experienced Frontend Developer with a passion ...


In [5]:
df2.columns.tolist()

['Category', 'Resume']

In [6]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  400 non-null    object
 1   Resume    400 non-null    object
dtypes: object(2)
memory usage: 6.4+ KB


In [7]:
# cuántas categorías distintas
df2['Category'].nunique()

8

In [8]:
# distribución absoluta
df2['Category'].value_counts()

Category
Backend Developer                     57
Cloud Engineer                        56
Frontend Developer                    54
Data Scientist                        53
Full Stack Developer                  47
Python Developer                      45
Mobile App Developer (iOS/Android)    45
Machine Learning Engineer             43
Name: count, dtype: int64

In [None]:
df2['Category'].value_counts(normalize=True) * 100

Category
Backend Developer                     14.25
Cloud Engineer                        14.00
Frontend Developer                    13.50
Data Scientist                        13.25
Full Stack Developer                  11.75
Python Developer                      11.25
Mobile App Developer (iOS/Android)    11.25
Machine Learning Engineer             10.75
Name: proportion, dtype: float64

In [10]:
df2['resume_len'] = df2['Resume'].str.len()
df2['resume_len'].describe()

count     400.000000
mean      823.520000
std       107.043427
min       671.000000
25%       754.000000
50%       801.000000
75%       871.000000
max      1806.000000
Name: resume_len, dtype: float64

In [11]:
# Los 5 más cortos
df2.nsmallest(5, 'resume_len')[['Category', 'resume_len', 'Resume']]

Unnamed: 0,Category,resume_len,Resume
339,Mobile App Developer (iOS/Android),671,Experienced Mobile App Developer specializing ...
347,Mobile App Developer (iOS/Android),671,Experienced Mobile App Developer specializing ...
342,Mobile App Developer (iOS/Android),679,Skilled Mobile App Developer specializing in b...
304,Frontend Developer,686,Creative Frontend Developer with a focus on cr...
308,Frontend Developer,686,Creative Frontend Developer with a focus on cr...


In [12]:
# Los 5 más largos
df2.nlargest(5, 'resume_len')[['Category', 'resume_len', 'Resume']]

Unnamed: 0,Category,resume_len,Resume
27,Full Stack Developer,1806,Dynamic Full Stack Developer with a strong bac...
28,Full Stack Developer,1670,Motivated Full Stack Developer with a strong f...
3,Data Scientist,1108,"With a background in Data Science, I possess a..."
10,Data Scientist,1080,Experienced Data Scientist with a strong backg...
37,Mobile App Developer (iOS/Android),1045,Creative Mobile App Developer with a focus on ...


In [13]:
(df2['resume_len'] < 200).sum(), (df2['resume_len'] < 500).sum()

(0, 0)

In [14]:
df2.duplicated(subset=['Resume']).sum()


212

In [15]:
for cat in df2['Category'].unique():
    ejemplo = df2[df2['Category'] == cat]['Resume'].iloc[0]
    print("="*80)
    print(cat)
    print(ejemplo[:300])


Frontend Developer
As a seasoned Frontend Developer, I have a proven track record of crafting stunning and responsive user interfaces for web applications. With over 5 years of experience, I am proficient in HTML, CSS, and JavaScript, and I have a deep understanding of modern frontend frameworks such as React, Vue.js,
Backend Developer
With a solid background in Backend Development, I bring over 7 years of experience in building robust and scalable server-side applications. My expertise lies in designing and implementing RESTful APIs using frameworks such as Node.js and Express.js, coupled with databases like MongoDB, PostgreSQL, 
Python Developer
As a Python Developer, I leverage my expertise in the Python programming language to build scalable and efficient backend solutions for diverse applications. With over 4 years of experience, I am proficient in frameworks such as Django and Flask, using them to develop RESTful APIs, web services, and
Data Scientist
With a background in Data Sc

In [16]:
# Partimos de df2 ya cargado
df2_clean = df2.copy()

df2_clean['resume_len'] = df2_clean['Resume'].str.len()

print("Duplicados antes de limpiar:", df2_clean.duplicated(subset=['Resume']).sum())
df2_clean = df2_clean.drop_duplicates(subset=['Resume'])
print("Filas después de eliminar duplicados:", df2_clean.shape[0])

Duplicados antes de limpiar: 212
Filas después de eliminar duplicados: 188


In [17]:
df2_clean = df2_clean.rename(columns={
    'Resume': 'cv_text',
    'Category': 'role_raw'
})

# De momento usamos la categoría original como etiqueta final
df2_clean['role_label_final'] = df2_clean['role_raw']

# Identificador de origen
df2_clean['source_dataset'] = 'dataset2_jithin'

# Reordenar columnas
df2_clean = df2_clean[[
    'cv_text',
    'role_raw',
    'role_label_final',
    'source_dataset',
    'resume_len'
]]

df2_clean.head()


Unnamed: 0,cv_text,role_raw,role_label_final,source_dataset,resume_len
0,"As a seasoned Frontend Developer, I have a pro...",Frontend Developer,Frontend Developer,dataset2_jithin,994
1,With a solid background in Backend Development...,Backend Developer,Backend Developer,dataset2_jithin,916
2,"As a Python Developer, I leverage my expertise...",Python Developer,Python Developer,dataset2_jithin,927
3,"With a background in Data Science, I possess a...",Data Scientist,Data Scientist,dataset2_jithin,1108
4,Experienced Frontend Developer with a passion ...,Frontend Developer,Frontend Developer,dataset2_jithin,965


In [18]:
output_path = "resumes_dataset2_clean.csv"
df2_clean.to_csv(output_path, index=False)
print("Guardado:", output_path)

Guardado: resumes_dataset2_clean.csv


In [19]:
print("Original:", df2.shape[0], "filas")
print("Limpio:", df2_clean.shape[0], "filas")

print("\nDistribución por rol (dataset2 limpio):")
print(df2_clean['role_label_final'].value_counts())

print("\nEstadísticas de longitud:")
print(df2_clean['resume_len'].describe())


Original: 400 filas
Limpio: 188 filas

Distribución por rol (dataset2 limpio):
role_label_final
Full Stack Developer                  30
Data Scientist                        29
Frontend Developer                    24
Mobile App Developer (iOS/Android)    23
Cloud Engineer                        23
Python Developer                      22
Backend Developer                     19
Machine Learning Engineer             18
Name: count, dtype: int64

Estadísticas de longitud:
count     188.000000
mean      834.159574
std       130.570063
min       671.000000
25%       755.000000
50%       801.500000
75%       892.750000
max      1806.000000
Name: resume_len, dtype: float64
