# Dirty Data Wrangling Challenge

### Enshitificator 1.0

La función de este cuaderno, será ensuciar el dataset de *"Hollywood 2025 Media Hype & Sentiment"*.

Source: https://www.kaggle.com/datasets/kanchana1990/hollywood-2025-media-hype-and-sentiment

El proceso de *enshitting* consta de las siguientes 10 partes:

1. Missing data: Remove or leave blank some values randomly.

2. Duplicated rows: Add exact or partial duplicates.

3. Outliers: Insert extreme values in numeric columns.

4. Format inconsistencies: Change date formats, numeric formats, or units.

5. Typographical errors: Introduce spelling mistakes in categorical columns.

6. Extra categories: Add unusual values in categorical columns to simulate errors.

7. Incorrect data types: Store numbers as strings or vice versa.

8. No default codification of file (utf-8?)

9. Incorrect headers

10. Extra punctuation symbols (1000€)

Cada una de estas partes resultará en un error nuevo para el analista del dataset.

In [152]:
import pandas as pd
import numpy as np
import random

dirty_df = pd.read_csv("source_clean_dataset.csv")
display(dirty_df.head())

Unnamed: 0,Movie_Tag,Title,Source,Publish_Date,Director,Key_Cast,Genre,Studio,Link,Description_Snippet
0,Avatar_Fire_and_Ash,All hail Avatar! How event movies are trying t...,The Guardian,2025-12-09 16:17:00,James Cameron,"Sam Worthington, Zoe Saldaña, Sigourney Weaver...",Sci-Fi/Fantasy,20th Century Studios,https://news.google.com/rss/articles/CBMiyAFBV...,All hail Avatar! How event movies are trying t...
1,Avatar_Fire_and_Ash,"THE WEEKEND WARRIOR December 12, 2025 (Video E...",Substack,2025-12-09 14:31:35,James Cameron,"Sam Worthington, Zoe Saldaña, Sigourney Weaver...",Sci-Fi/Fantasy,20th Century Studios,https://news.google.com/rss/articles/CBMifkFVX...,"THE WEEKEND WARRIOR December 12, 2025 (Video E..."
2,Avatar_Fire_and_Ash,Avatar: Fire And Ash Is Among The Most Expensi...,SlashFilm,2025-12-09 18:20:00,James Cameron,"Sam Worthington, Zoe Saldaña, Sigourney Weaver...",Sci-Fi/Fantasy,20th Century Studios,https://news.google.com/rss/articles/CBMikgFBV...,Avatar: Fire And Ash Is Among The Most Expensi...
3,Avatar_Fire_and_Ash,6-Week Box Office Tracking & Forecasts: AVATAR...,Box Office Theory,2025-12-05 17:25:52,James Cameron,"Sam Worthington, Zoe Saldaña, Sigourney Weaver...",Sci-Fi/Fantasy,20th Century Studios,https://news.google.com/rss/articles/CBMijgJBV...,6-Week Box Office Tracking & Forecasts: AVATAR...
4,Avatar_Fire_and_Ash,Why ‘Avatar: Fire and Ash’ Received Box Office...,Us Weekly,2025-12-08 20:54:04,James Cameron,"Sam Worthington, Zoe Saldaña, Sigourney Weaver...",Sci-Fi/Fantasy,20th Century Studios,https://news.google.com/rss/articles/CBMixAFBV...,Why ‘Avatar: Fire and Ash’ Received Box Office...


### 1. Missing data (valores faltantes)

Seleccionamos el 5% de las celdas aleatoriamente y las dejamos en None

In [153]:
dirty_df.loc[
    dirty_df.sample(frac=0.05, random_state=42).index,
    "Studio"
] = None

display(dirty_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Movie_Tag            500 non-null    object
 1   Title                500 non-null    object
 2   Source               500 non-null    object
 3   Publish_Date         500 non-null    object
 4   Director             500 non-null    object
 5   Key_Cast             500 non-null    object
 6   Genre                500 non-null    object
 7   Studio               475 non-null    object
 8   Link                 500 non-null    object
 9   Description_Snippet  500 non-null    object
dtypes: object(10)
memory usage: 39.2+ KB


None

### 3. Duplicated rows (filas duplicadas)

Duplicamos algunas filas completas

In [154]:
duplicates = dirty_df.sample(10)
dirty_df = pd.concat([dirty_df, duplicates], ignore_index=True)

display(dirty_df[dirty_df.duplicated()].head())

Unnamed: 0,Movie_Tag,Title,Source,Publish_Date,Director,Key_Cast,Genre,Studio,Link,Description_Snippet
500,Nosferatu,‘Nosferatu’ review: Dir. Robert Eggers (2024) ...,The Hollywood News,2024-12-17 08:00:00,Robert Eggers,"Bill Skarsgård, Lily-Rose Depp, Nicholas Hoult",Gothic Horror,,https://news.google.com/rss/articles/CBMihwFBV...,‘Nosferatu’ review: Dir. Robert Eggers (2024)&...
501,Avatar_Fire_and_Ash,AVATAR: FIRE AND ASH First Reactions Describe ...,ComicBookMovie.com,2025-12-02 13:12:00,James Cameron,"Sam Worthington, Zoe Saldaña, Sigourney Weaver...",Sci-Fi/Fantasy,,https://news.google.com/rss/articles/CBMi5AFBV...,AVATAR: FIRE AND ASH First Reactions Describe ...
502,Nosferatu,Nosferatu: Old-World Devils and New-World Vict...,Crisis Magazine,2025-01-13 08:00:00,Robert Eggers,"Bill Skarsgård, Lily-Rose Depp, Nicholas Hoult",Gothic Horror,,https://news.google.com/rss/articles/CBMiiwFBV...,Nosferatu: Old-World Devils and New-World Vict...
503,FNAF_2,'Weapons' ending breakdown: Unpacking those tw...,Entertainment Weekly,2025-08-08 07:00:00,Emma Tammi,"Josh Hutcherson, Matthew Lillard, Elizabeth La...",Horror,,https://news.google.com/rss/articles/CBMiiAFBV...,'Weapons' ending breakdown: Unpacking those tw...
504,FNAF_2,Movie Review: 'Five Nights At Freddy’s 2' - Dr...,DrydenWire.com,2025-12-08 16:24:14,Emma Tammi,"Josh Hutcherson, Matthew Lillard, Elizabeth La...",Horror,,https://news.google.com/rss/articles/CBMidEFVX...,Movie Review: 'Five Nights At Freddy’s 2'&nbsp...


### 4. Partial duplicates (duplicados parciales)


In [155]:
partial_duplicates = dirty_df.sample(5).copy()
if len(dirty_df.columns) > 1:
    partial_duplicates.iloc[:, 0] = dirty_df.iloc[0, 0]
dirty_df = pd.concat([dirty_df, partial_duplicates], ignore_index=True)

display(dirty_df[dirty_df.duplicated()].head())

Unnamed: 0,Movie_Tag,Title,Source,Publish_Date,Director,Key_Cast,Genre,Studio,Link,Description_Snippet
500,Nosferatu,‘Nosferatu’ review: Dir. Robert Eggers (2024) ...,The Hollywood News,2024-12-17 08:00:00,Robert Eggers,"Bill Skarsgård, Lily-Rose Depp, Nicholas Hoult",Gothic Horror,,https://news.google.com/rss/articles/CBMihwFBV...,‘Nosferatu’ review: Dir. Robert Eggers (2024)&...
501,Avatar_Fire_and_Ash,AVATAR: FIRE AND ASH First Reactions Describe ...,ComicBookMovie.com,2025-12-02 13:12:00,James Cameron,"Sam Worthington, Zoe Saldaña, Sigourney Weaver...",Sci-Fi/Fantasy,,https://news.google.com/rss/articles/CBMi5AFBV...,AVATAR: FIRE AND ASH First Reactions Describe ...
502,Nosferatu,Nosferatu: Old-World Devils and New-World Vict...,Crisis Magazine,2025-01-13 08:00:00,Robert Eggers,"Bill Skarsgård, Lily-Rose Depp, Nicholas Hoult",Gothic Horror,,https://news.google.com/rss/articles/CBMiiwFBV...,Nosferatu: Old-World Devils and New-World Vict...
503,FNAF_2,'Weapons' ending breakdown: Unpacking those tw...,Entertainment Weekly,2025-08-08 07:00:00,Emma Tammi,"Josh Hutcherson, Matthew Lillard, Elizabeth La...",Horror,,https://news.google.com/rss/articles/CBMiiAFBV...,'Weapons' ending breakdown: Unpacking those tw...
504,FNAF_2,Movie Review: 'Five Nights At Freddy’s 2' - Dr...,DrydenWire.com,2025-12-08 16:24:14,Emma Tammi,"Josh Hutcherson, Matthew Lillard, Elizabeth La...",Horror,,https://news.google.com/rss/articles/CBMidEFVX...,Movie Review: 'Five Nights At Freddy’s 2'&nbsp...


### 5. Outliers (valores extremos)

Usamos índices aleatorios pero reproducibles, luego introducimos un par de fechas extremadamente fuera de rango.

In [156]:
np.random.seed(42)

outlier_indices = np.random.choice(
    dirty_df.index,
    size=2,
    replace=False
)
 
dirty_df.loc[outlier_indices[0], "Publish_Date"] = pd.Timestamp("1800-01-01")
dirty_df.loc[outlier_indices[1], "Publish_Date"] = pd.Timestamp("2200-12-31")
show = dirty_df.loc[outlier_indices[1]], dirty_df.loc[outlier_indices[0]]
display(show)


(Movie_Tag                                                      Mickey_17
 Title                  Mickey 17 Review: Uneven but undeniably entert...
 Source                                                             JoBlo
 Publish_Date                                         2200-12-31 00:00:00
 Director                                                    Bong Joon-ho
 Key_Cast                     Robert Pattinson, Steven Yeun, Mark Ruffalo
 Genre                                                       Sci-Fi/Drama
 Studio                                                      Warner Bros.
 Link                   https://news.google.com/rss/articles/CBMiUEFVX...
 Description_Snippet    Mickey 17 Review: Uneven but undeniably entert...
 Name: 499, dtype: object,
 Movie_Tag                                                      Nosferatu
 Title                  ‘Nosferatu’ Review: Excellent Cinematography A...
 Source                                              Bounding Into Comics
 Publish_Da

# 6. Format inconsistencies (formatos inconsistentes)

Si hay una columna de fechas, cambiamos formatos

In [157]:
date_cols = dirty_df.select_dtypes(include=["object"]).columns

for col in date_cols:
    if "publish_date" in col.lower():
        dirty_df.loc[dirty_df.sample(5).index, col] = [
            "01-02-2025", "2025/02/01", "Feb 1, 2025", "2025.02.01", "01/02/25"
        ]

display(dirty_df[dirty_df['Publish_Date'] == "Feb 1, 2025"].head())


Unnamed: 0,Movie_Tag,Title,Source,Publish_Date,Director,Key_Cast,Genre,Studio,Link,Description_Snippet
441,Mickey_17,Review: Bong Joon-ho's Mickey 17 is doing a lo...,Lainey Gossip,"Feb 1, 2025",Bong Joon-ho,"Robert Pattinson, Steven Yeun, Mark Ruffalo",Sci-Fi/Drama,Warner Bros.,https://news.google.com/rss/articles/CBMimwFBV...,Review: Bong Joon-ho's Mickey 17 is doing a lo...


# 7. Typographical errors (errores tipográficos)

Introducimos errores de escritura en columnas categóricas

In [158]:
for col in dirty_df.select_dtypes(include=["object"]).columns:
    dirty_df.loc[dirty_df.sample(3).index, col] = "unknwon"

display(dirty_df[dirty_df['Link'] == "unknwon"].head())

Unnamed: 0,Movie_Tag,Title,Source,Publish_Date,Director,Key_Cast,Genre,Studio,Link,Description_Snippet
94,Avatar_Fire_and_Ash,FYI: New ‘Avatar’ Movie Long AF - Gizmodo,Gizmodo,2025-11-12 08:00:00,James Cameron,"Sam Worthington, Zoe Saldaña, Sigourney Weaver...",Sci-Fi/Fantasy,20th Century Studios,unknwon,FYI: New ‘Avatar’ Movie Long AF&nbsp;&nbsp;Giz...
451,Mickey_17,Mickey 17 Review: So Relatable That It Is Terr...,Mama's Geeky,2025-03-05 08:00:00,Bong Joon-ho,"Robert Pattinson, Steven Yeun, Mark Ruffalo",Sci-Fi/Drama,Warner Bros.,unknwon,Mickey 17 Review: So Relatable That It Is Terr...
475,Mickey_17,"'Mickey 17' Review: Quirky, Refreshing Sci-FI ...",FilmSpeak,2025-03-07 08:00:00,Bong Joon-ho,"Robert Pattinson, Steven Yeun, Mark Ruffalo",Sci-Fi/Drama,Warner Bros.,unknwon,"'Mickey 17' Review: Quirky, Refreshing Sci-FI&..."


# 8. Extra categories (categorías inesperadas)

In [159]:
for col in dirty_df.select_dtypes(include=["object"]).columns:
    dirty_df.loc[dirty_df.sample(3).index, col] = "###ERROR###"
    
display(dirty_df[dirty_df['Movie_Tag'] == "###ERROR###"].head())

Unnamed: 0,Movie_Tag,Title,Source,Publish_Date,Director,Key_Cast,Genre,Studio,Link,Description_Snippet
70,###ERROR###,James Cameron Could End 'Avatar' With 'Fire an...,MovieWeb,2025-11-26 08:00:00,James Cameron,"Sam Worthington, Zoe Saldaña, Sigourney Weaver...",Sci-Fi/Fantasy,20th Century Studios,https://news.google.com/rss/articles/CBMihgFBV...,James Cameron Could End 'Avatar' With 'Fire an...
240,###ERROR###,Sonic The Hedgehog 3 Lands Another Impressive ...,Screen Rant,2025-01-28 08:00:00,Jeff Fowler,"Ben Schwartz, Jim Carrey, Keanu Reeves (Shadow...",Action/Adventure,Paramount,https://news.google.com/rss/articles/CBMitgFBV...,Sonic The Hedgehog 3 Lands Another Impressive ...
512,###ERROR###,AVATAR: FIRE AND ASH First Reactions Describe ...,ComicBookMovie.com,2025-12-02 13:12:00,James Cameron,"Sam Worthington, Zoe Saldaña, Sigourney Weaver...",Sci-Fi/Fantasy,,https://news.google.com/rss/articles/CBMi5AFBV...,AVATAR: FIRE AND ASH First Reactions Describe ...


# 9. Incorrect data types (tipos incorrectos)

Cogemos la columna Source y cambiamos un par de registros a formato JSON

In [160]:
# Utilizamos una semilla
np.random.seed(42)
indices = dirty_df.sample(3, random_state=42).index

# Convertimos esos valores a JSON manualmente (string)
dirty_df.loc[indices, 'Source'] = (
    dirty_df.loc[indices, 'Source']
    .apply(lambda x: {"Source": x})
    .apply(lambda x: str(x))
)

display(dirty_df.loc[indices])

Unnamed: 0,Movie_Tag,Title,Source,Publish_Date,Director,Key_Cast,Genre,Studio,Link,Description_Snippet
304,Nosferatu,‘Nosferatu’ Review: Excellent Cinematography A...,{'Source': 'Bounding Into Comics'},1800-01-01 00:00:00,Robert Eggers,"Bill Skarsgård, Lily-Rose Depp, Nicholas Hoult",Gothic Horror,Focus Features,https://news.google.com/rss/articles/CBMi9AFBV...,‘Nosferatu’ Review: Excellent Cinematography A...
499,Mickey_17,Mickey 17 Review: Uneven but undeniably entert...,{'Source': 'JoBlo'},2200-12-31 00:00:00,Bong Joon-ho,"Robert Pattinson, Steven Yeun, Mark Ruffalo",Sci-Fi/Drama,Warner Bros.,https://news.google.com/rss/articles/CBMiUEFVX...,Mickey 17 Review: Uneven but undeniably entert...
441,Mickey_17,Review: Bong Joon-ho's Mickey 17 is doing a lo...,{'Source': 'Lainey Gossip'},"Feb 1, 2025",Bong Joon-ho,"Robert Pattinson, Steven Yeun, Mark Ruffalo",Sci-Fi/Drama,Warner Bros.,https://news.google.com/rss/articles/CBMimwFBV...,Review: Bong Joon-ho's Mickey 17 is doing a lo...


### 10. Incorrect headers

In [161]:
dirty_df.columns = [
    col + random.choice(["&", "##", "!!!"]) for col in dirty_df.columns
]

display(dirty_df.head())

Unnamed: 0,Movie_Tag##,Title##,Source##,Publish_Date!!!,Director##,Key_Cast&,Genre&,Studio&,Link!!!,Description_Snippet##
0,Avatar_Fire_and_Ash,All hail Avatar! How event movies are trying t...,The Guardian,2025-12-09 16:17:00,James Cameron,"Sam Worthington, Zoe Saldaña, Sigourney Weaver...",Sci-Fi/Fantasy,20th Century Studios,https://news.google.com/rss/articles/CBMiyAFBV...,All hail Avatar! How event movies are trying t...
1,Avatar_Fire_and_Ash,"THE WEEKEND WARRIOR December 12, 2025 (Video E...",Substack,2025-12-09 14:31:35,James Cameron,"Sam Worthington, Zoe Saldaña, Sigourney Weaver...",Sci-Fi/Fantasy,20th Century Studios,https://news.google.com/rss/articles/CBMifkFVX...,"THE WEEKEND WARRIOR December 12, 2025 (Video E..."
2,Avatar_Fire_and_Ash,Avatar: Fire And Ash Is Among The Most Expensi...,SlashFilm,2025-12-09 18:20:00,James Cameron,"Sam Worthington, Zoe Saldaña, Sigourney Weaver...",Sci-Fi/Fantasy,20th Century Studios,https://news.google.com/rss/articles/CBMikgFBV...,unknwon
3,Avatar_Fire_and_Ash,###ERROR###,Box Office Theory,2025-12-05 17:25:52,James Cameron,"Sam Worthington, Zoe Saldaña, Sigourney Weaver...",Sci-Fi/Fantasy,20th Century Studios,https://news.google.com/rss/articles/CBMijgJBV...,6-Week Box Office Tracking & Forecasts: AVATAR...
4,Avatar_Fire_and_Ash,Why ‘Avatar: Fire and Ash’ Received Box Office...,Us Weekly,2025-12-08 20:54:04,James Cameron,"Sam Worthington, Zoe Saldaña, Sigourney Weaver...",Sci-Fi/Fantasy,20th Century Studios,https://news.google.com/rss/articles/CBMixAFBV...,Why ‘Avatar: Fire and Ash’ Received Box Office...


### 12. Encoding issue (no UTF-8 explícito)

Guardamos el archivo sin especificar encoding (simula problemas)

In [162]:
dirty_df.to_csv("dirty_dataset.csv", index=False, encoding="UTF-8")