<div style="text-align: center;">
  <img src="https://github.com/Hack-io-Data/Imagenes/blob/main/01-LogosHackio/logo_naranja@4x.png?raw=true" alt="esquema" />
</div>

# Laboratorio Limpieza de Datos

En este laboratorio usaremos el DataFrame de Netflix completo creado en los primeros laboratorios de Pandas. 

**Instrucciones:**

1. Lee cuidadosamente el enunciado de cada ejercicio.

2. Implementa la solución en la celda de código proporcionada.

3. Documenta todas las funciones creadas durante el ejercicio. 

4. Debes incluir después de cada gráfica la interpretación de las mismas en una celda de markdown. 

In [4]:
# Tratamiento de datos
# -----------------------------------------------------------------------
import pandas as pd
import numpy as np

# Configuración
# -----------------------------------------------------------------------
pd.set_option('display.max_columns', None) # para poder visualizar todas las columnas de los DataFrames

# Para gestión de fechas
# -----------------------------------------------------------------------
from datetime import datetime

import warnings
warnings.filterwarnings("ignore")

# Importar nuestras funciones
# -----------------------------------------------------------------------
from src import soporte_apply as sa




ModuleNotFoundError: No module named 'src'

## Parte 1: Limpieza y Preparación de Datos

In [5]:
df_netflix_agrupado = pd.read_csv("datos/Netflix_full.csv", index_col=0)
df_netflix_agrupado.head(5)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,Genre,Premiere,Runtime,IMDB Score,Language,Original_Netflix
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,2021-09-25,2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",Documentary,2020-10-02,90.0,7.5,English,Sí
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",,,,,,No
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,2021-09-24,2021,TV-MA,,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,,,,,,No
3,s4,TV Show,Jailbirds New Orleans,,,,2021-09-24,2021,TV-MA,,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",,,,,,No
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,,,,,,No


#### Ejercicio 1: Estandarización y limpieza de columnas

En este ejercicio, debes limpiar y estandarizar algunas columnas clave para hacerlas más manejables y consistentes en tus análisis. Específicamente, trabajarás con las columnas `date_added` y `duration` para convertirlas a un formato uniforme y estructurado.

Instrucciones:

1. **Convertir la columna `date_added`**: La columna `date_added` contiene fechas en formato de texto. Debes convertirla a un formato `datetime` que pandas pueda entender y manejar fácilmente.

2. **Limpiar la columna `duration`**: La columna `duration` tiene valores en diferentes formatos como "1 Season", "2 Seasons", "90 min", etc. Tu tarea es extraer el número (ya sea el número de temporadas o la cantidad de minutos) y crear una nueva columna llamada `duration_cleaned` con esos valores estandarizados.


**Resultado Esperado:**
Deberás obtener algo como esto:

| duration   | duration_cleaned |
|------------|-----------------|
| 1 Season   | 1               |
| 90 min     | 90              |
| 2 Seasons  | 2               |
| 45 min     | 45              |
| 3 Seasons  | 3               |

In [6]:
#convertir la columna date_added en formato datetime
df_netflix_agrupado.date_added = pd.to_datetime(df_netflix_agrupado.date_added) 
df_netflix_agrupado.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8807 entries, 0 to 8806
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   show_id           8807 non-null   object        
 1   type              8807 non-null   object        
 2   title             8807 non-null   object        
 3   director          6173 non-null   object        
 4   cast              7982 non-null   object        
 5   country           7976 non-null   object        
 6   date_added        8797 non-null   datetime64[ns]
 7   release_year      8807 non-null   int64         
 8   rating            8800 non-null   object        
 9   duration          3994 non-null   object        
 10  listed_in         8807 non-null   object        
 11  description       8807 non-null   object        
 12  Genre             513 non-null    object        
 13  Premiere          513 non-null    object        
 14  Runtime           513 non-nul

In [7]:
df_netflix_agrupado[["duration_cleaned", "Unidad_de_medida"]] = df_netflix_agrupado.duration.str.split(" ", expand=True)
df_netflix_agrupado.head(5)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,Genre,Premiere,Runtime,IMDB Score,Language,Original_Netflix,duration_cleaned,Unidad_de_medida
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,2021-09-25,2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",Documentary,2020-10-02,90.0,7.5,English,Sí,90.0,min
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",,,,,,No,2.0,Seasons
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,2021-09-24,2021,TV-MA,,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,,,,,,No,,
3,s4,TV Show,Jailbirds New Orleans,,,,2021-09-24,2021,TV-MA,,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",,,,,,No,,
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,,,,,,No,2.0,Seasons


#### Ejercicio 2: Normalización de la columna `rating`

La columna `rating` tiene diferentes calificaciones como `PG`, `PG-13`, `R`, entre otras. Debes categorizar estas calificaciones en tres grupos:

- **'General Audience'** para calificaciones como `G`, `PG`.

- **'Teens'** para calificaciones como `PG-13`, `TV-14`.

- **'Adults'** para calificaciones como `R`, `TV-MA`.


In [8]:
df_netflix_agrupado.rating.unique()

array(['PG-13', 'TV-MA', 'PG', 'TV-14', 'TV-PG', 'TV-Y', 'TV-Y7', 'R',
       'TV-G', 'G', 'NC-17', nan, 'NR', 'TV-Y7-FV', 'UR'], dtype=object)

In [9]:

def categorizar_rating(rating):
    """categoriza el rating de la película o serie en 3 grandes grupos: 
                General Audience:  G', 'PG', 'TV-PG', 'TV-G', 'TV-Y', 'TV-Y7', 'TV-Y7-FV'
                Teens: 'PG-13', 'TV-14'
                Adults': 'R', 'TV-MA', 'NR', 'UR', 'NC-17'

    Args:
        rating (string): rating de la pelicula o serie indicada en la columna rating del DataFrame df_netflix_agrupado

    Returns:
        string: devuelve el grupo al que pertenece
    """
    if rating in ['G', 'PG', 'TV-PG', 'TV-G', 'TV-Y', 'TV-Y7', 'TV-Y7-FV']:
        return 'General Audience'
    elif rating in ['PG-13', 'TV-14']:
        return 'Teens'
    elif rating in ['R', 'TV-MA', 'NR', 'UR', 'NC-17']:
        return 'Adults'
    else:
        return np.nan


df_netflix_agrupado['calificacion_rating'] = df_netflix_agrupado.rating.apply(categorizar_rating)
df_netflix_agrupado.head(5)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,Genre,Premiere,Runtime,IMDB Score,Language,Original_Netflix,duration_cleaned,Unidad_de_medida,calificacion_rating
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,2021-09-25,2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",Documentary,2020-10-02,90.0,7.5,English,Sí,90.0,min,Teens
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",,,,,,No,2.0,Seasons,Adults
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,2021-09-24,2021,TV-MA,,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,,,,,,No,,,Adults
3,s4,TV Show,Jailbirds New Orleans,,,,2021-09-24,2021,TV-MA,,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",,,,,,No,,,Adults
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,,,,,,No,2.0,Seasons,Adults


#### Ejercicio 3: Creación de una columna personalizada basada en el elenco

Vamos a identificar si un actor clave como `Leonardo DiCaprio`, `Tom Hanks`, o `Morgan Freeman` aparece en el elenco.

Usa `apply` y una función lambda para crear una nueva columna llamada `has_famous_actor` que contenga `True` si alguno de estos actores está en la lista de `cast` y `False` en caso contrario.

In [10]:
#Si solo existe 1 actor y coincide con uno de los dados en la lista
famous_actor = ['Leonardo DiCaprio', 'Tom Hanks',  'Morgan Freeman' ]

df_netflix_agrupado['has_famous_actor_unico'] = df_netflix_agrupado.apply(lambda x: True if x['cast'] in famous_actor else False, axis=1)

df_netflix_agrupado[df_netflix_agrupado.has_famous_actor_unico == True]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,Genre,Premiere,Runtime,IMDB Score,Language,Original_Netflix,duration_cleaned,Unidad_de_medida,calificacion_rating,has_famous_actor_unico
5560,s5561,Movie,The C Word,Meghan O'Hara,Morgan Freeman,United States,2017-03-24,2016,TV-14,,Documentaries,"To solve the mystery of cancer, challenging th...",,,,,,No,,,Teens,True
6272,s6273,Movie,Before the Flood,Fisher Stevens,Leonardo DiCaprio,United States,2018-02-01,2016,PG,,Documentaries,Leonardo DiCaprio crisscrosses the globe to in...,,,,,,No,,,General Audience,True
8518,s8519,TV Show,The Story of God with Morgan Freeman,,Morgan Freeman,United States,2019-02-01,2017,TV-14,2 Seasons,Docuseries,Host Morgan Freeman explores religion's role i...,,,,,,No,2.0,Seasons,Teens,True
8519,s8520,TV Show,The Story of Us with Morgan Freeman,,Morgan Freeman,United States,2019-02-01,2017,TV-14,,Docuseries,"Explore the forces of love, belief, power, war...",,,,,,No,,,Teens,True


In [12]:
#si contiene uno de los actores famosos
famous_actor = ['Leonardo DiCaprio', 'Tom Hanks',  'Morgan Freeman' ]

def actores_famosos (actor):
    if famous_actor[0] in actor or famous_actor[1] in actor or famous_actor[2] in actor:
        return True
    else:
        return False

In [13]:
#sustituyo los valores nulos por sin informacíón
df_netflix_agrupado.cast = df_netflix_agrupado['cast'].fillna("sin información")

In [14]:
df_netflix_agrupado['has_famous_actor_1'] = df_netflix_agrupado.cast.apply(actores_famosos)

In [15]:
df_netflix_agrupado[df_netflix_agrupado.has_famous_actor_1 == True]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,Genre,Premiere,Runtime,IMDB Score,Language,Original_Netflix,duration_cleaned,Unidad_de_medida,calificacion_rating,has_famous_actor_unico,has_famous_actor_1
129,s130,Movie,An Unfinished Life,Lasse Hallström,"Robert Redford, Jennifer Lopez, Morgan Freeman...","Germany, United States",2021-09-01,2005,PG-13,,Dramas,A grieving widow and her daughter move in with...,,,,,,No,,,Teens,False,True
329,s330,Movie,Catch Me If You Can,Steven Spielberg,"Leonardo DiCaprio, Tom Hanks, Christopher Walk...","United States, Canada",2021-08-01,2002,PG-13,142 min,Dramas,An FBI agent makes it his mission to put cunni...,,,,,,No,142.0,min,Teens,False,True
340,s341,Movie,Inception,Christopher Nolan,"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellio...","United States, United Kingdom",2021-08-01,2010,PG-13,148 min,"Action & Adventure, Sci-Fi & Fantasy, Thrillers",A troubled thief who extracts secrets from peo...,,,,,,No,148.0,min,Teens,False,True
392,s393,Movie,Django Unchained,Quentin Tarantino,"Jamie Foxx, Christoph Waltz, Leonardo DiCaprio...",United States,2021-07-24,2012,R,165 min,"Action & Adventure, Dramas","Accompanied by a German bounty hunter, a freed...",,,,,,No,165.0,min,Adults,False,True
608,s609,Movie,The Sum of All Fears,Phil Alden Robinson,"Ben Affleck, Morgan Freeman, Bridget Moynahan,...","United States, Germany, Canada",2021-07-01,2002,PG-13,124 min,Action & Adventure,CIA agent Jack Ryan tries to discover why thre...,,,,,,No,124.0,min,Teens,False,True
799,s800,Movie,Million Dollar Baby,Clint Eastwood,"Clint Eastwood, Hilary Swank, Morgan Freeman, ...",United States,2021-06-02,2004,PG-13,133 min,"Dramas, Sports Movies",When a cantankerous trainer mentors a persiste...,,,,,,No,133.0,min,Teens,False,True
1358,s1359,Movie,Shutter Island,Martin Scorsese,"Leonardo DiCaprio, Mark Ruffalo, Ben Kingsley,...",United States,2021-02-01,2010,R,139 min,Thrillers,A U.S. marshal's troubling visions compromise ...,,,,,,No,139.0,min,Adults,False,True
1469,s1470,Movie,What's Eating Gilbert Grape,Lasse Hallström,"Johnny Depp, Leonardo DiCaprio, Juliette Lewis...",United States,2021-01-01,1993,PG-13,118 min,"Classic Movies, Dramas, Independent Movies","In a backwater Iowa town, young Gilbert is tor...",,,,,,No,118.0,min,Teens,False,True
1610,s1611,Movie,Angels & Demons,Ron Howard,"Tom Hanks, Ewan McGregor, Ayelet Zurer, Stella...","United States, Italy",2020-12-01,2009,PG-13,139 min,Thrillers,A Harvard symbologist races to uncover clues t...,,,,,,No,139.0,min,Teens,False,True
1624,s1625,Movie,The Da Vinci Code,Ron Howard,"Tom Hanks, Audrey Tautou, Ian McKellen, Jean R...","United States, Malta, France, United Kingdom",2020-12-01,2006,PG-13,149 min,Thrillers,"When the curator of the Louvre is killed, a Ha...",,,,,,No,149.0,min,Teens,False,True


#### Ejercicio 4: Creación de una columna personalizada usando lógica condicional

Vamos a crear una columna llamada `is_recent` que identifique si un título fue lanzado en los últimos 5 años.

Crea una función para marcar con `True` si el título es reciente (lanzado en los últimos 5 años) y `False` si no lo es.

In [16]:
ano_actual = pd.to_datetime("now").year
df_netflix_agrupado['is_recent'] = df_netflix_agrupado.apply(lambda x: True if ano_actual - x['release_year'] <=5 else False, axis=1)

df_netflix_agrupado

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,Genre,Premiere,Runtime,IMDB Score,Language,Original_Netflix,duration_cleaned,Unidad_de_medida,calificacion_rating,has_famous_actor_unico,has_famous_actor_1,is_recent
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,sin información,United States,2021-09-25,2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",Documentary,2020-10-02,90.0,7.5,English,Sí,90,min,Teens,False,False,True
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",,,,,,No,2,Seasons,Adults,False,False,True
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,2021-09-24,2021,TV-MA,,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,,,,,,No,,,Adults,False,False,True
3,s4,TV Show,Jailbirds New Orleans,,sin información,,2021-09-24,2021,TV-MA,,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",,,,,,No,,,Adults,False,False,True
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,,,,,,No,2,Seasons,Adults,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,2019-11-20,2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a...",,,,,,No,158,min,Adults,False,False,False
8803,s8804,TV Show,Zombie Dumb,,sin información,,2019-07-01,2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g...",,,,,,No,2,Seasons,General Audience,False,False,False
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,2019-11-01,2009,R,,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...,,,,,,No,,,Adults,False,False,False
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,2020-01-11,2006,PG,,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero...",,,,,,No,,,General Audience,False,False,False


#### Ejercicio 5: Clasificación de películas por década

En este ejercicio, tu objetivo es categorizar los años de lanzamiento de las películas o series en décadas. La columna `release_year` contiene el año de lanzamiento y debes crear una nueva columna llamada `decade` que indique la década correspondiente, como "1990s", "2000s", etc.


In [17]:
str(df_netflix_agrupado.release_year[0])[0:3]

def decadas (ano):
    """Dado el año de lanzamiento de una película o serie, devuelve la década a la que pertenece
    Args:
        ano (int): año de lanzamiento 

    Returns:
        str: década en la que se lanzó el título
    """
    ano = str(ano)
    decada= ano[0:3]+'0s'
    return decada

df_netflix_agrupado['decade'] = df_netflix_agrupado.release_year.apply(decadas)
df_netflix_agrupado.sample(5)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,Genre,Premiere,Runtime,IMDB Score,Language,Original_Netflix,duration_cleaned,Unidad_de_medida,calificacion_rating,has_famous_actor_unico,has_famous_actor_1,is_recent,decade
7213,s7214,Movie,Kiss the Girls,Gary Fleder,"Morgan Freeman, Ashley Judd, Cary Elwes, Alex ...",United States,2020-01-01,1997,R,116 min,Thrillers,A forensic psychologist teams up with an escap...,,,,,,No,116.0,min,Adults,False,True,False,1990s
4974,s4975,Movie,Roxanne Roxanne,Michael Larnell,"Chanté Adams, Mahershala Ali, Nia Long, Elvis ...",United States,2018-03-23,2018,TV-MA,,"Dramas, Music & Musicals",She grew up in a tough New York neighborhood i...,Biopic,2018-03-23,98.0,6.2,English,Sí,,,Adults,False,False,False,2010s
4888,s4889,Movie,Darc,Julius R. Nasso,"Tony Schiena, Armand Assante, Kippei Shiina, S...",United States,2018-05-01,2018,TV-MA,98 min,"Action & Adventure, Independent Movies","To bring down a global human trafficking ring,...",,,,,,No,98.0,min,Adults,False,False,False,2010s
1667,s1668,TV Show,Flavorful Origins,,Yang Chen,China,2020-11-20,2020,TV-G,,"Docuseries, International TV Shows",Delve into the delectable world of Chaoshan cu...,,,,,,No,,,General Audience,False,False,True,2020s
8274,s8275,Movie,The Detained,Blair Hayes,"Alex Frnka, Sarah Davenport, Henry Zaga, Coy S...",United States,2019-02-15,2017,TV-14,85 min,"Horror Movies, Thrillers",Five high school students serving Saturday det...,,,,,,No,85.0,min,Teens,False,False,False,2010s


#### Ejercicio 6: Extracción de información

Para practicar la extracción de información:

1. **Extrae el primer actor** de la lista en la columna `cast` y crea una nueva columna llamada `first_actor`.

2. **Extrae el primer nombre del director** y guárdalo en una columna llamada `first_name_director`.


In [18]:
#extracción primer actor
df_netflix_agrupado["first_actor"] = df_netflix_agrupado.cast.str.split(",", expand=True)[0]


#extracción primer director:
df_netflix_agrupado["first_name_director"] = df_netflix_agrupado.director.str.split(",", expand=True)[0]

df_netflix_agrupado.sample(5)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,Genre,Premiere,Runtime,IMDB Score,Language,Original_Netflix,duration_cleaned,Unidad_de_medida,calificacion_rating,has_famous_actor_unico,has_famous_actor_1,is_recent,decade,first_actor,first_name_director
7894,s7895,Movie,Rosemary's Baby,Roman Polanski,"Mia Farrow, John Cassavetes, Ruth Gordon, Sidn...",United States,2019-11-01,1968,R,137 min,"Classic Movies, Horror Movies, Thrillers",A woman is thrilled to find out she's pregnant...,,,,,,No,137.0,min,Adults,False,False,False,1960s,Mia Farrow,Roman Polanski
3217,s3218,Movie,The Island,Toka McBaror,"Sambasa Nzeribe, Segun Arinze, Tokunbo Idowu, ...",Nigeria,2019-11-29,2018,TV-14,,"Dramas, International Movies, Thrillers",When a colonel uncovers controversial intel ab...,,,,,,No,,,Teens,False,False,False,2010s,Sambasa Nzeribe,Toka McBaror
2120,s2121,Movie,Les Misérables,Tom Hooper,"Hugh Jackman, Russell Crowe, Anne Hathaway, Am...","United Kingdom, United States",2020-08-16,2012,PG-13,158 min,"Dramas, International Movies, Music & Musicals",The musical version of Victor Hugo's epic tale...,,,,,,No,158.0,min,Teens,False,False,False,2010s,Hugh Jackman,Tom Hooper
5267,s5268,Movie,The Bad Batch,Ana Lily Amirpour,"Suki Waterhouse, Jason Momoa, Keanu Reeves, Ji...",United States,2017-09-22,2016,R,119 min,"Dramas, Independent Movies, Thrillers","Banished to a wasteland of undesirables, a you...",,,,,,No,119.0,min,Adults,False,False,False,2010s,Suki Waterhouse,Ana Lily Amirpour
3735,s3736,Movie,Right Here Right Now,Jakob Lass,"Niklas Bruhn, David Schütter, Tinka Furst, Cor...",Germany,2019-06-16,2018,TV-MA,,"Comedies, Dramas, Independent Movies",The owner of a failing music club throws one l...,,,,,,No,,,Adults,False,False,False,2010s,Niklas Bruhn,Jakob Lass


#### Ejercicio 7: Limpieza de la columna `cast`

La columna `cast` contiene una lista de actores separados por comas. Tu objetivo es realizar las siguientes tareas:

1. **Reemplaza los valores nulos** en la columna `cast` por "sin información".

2. **Contar el número de actores** en cada entrada y crear una nueva columna llamada `num_cast`.

3. **Normalizar los nombres**: Asegúrate de que los nombres de los actores estén en un formato consistente (por ejemplo, quitar espacios adicionales).


In [19]:
#Rellenar los valores nulos por "sin información"
df_netflix_agrupado.cast = df_netflix_agrupado['cast'].fillna("sin información")
df_netflix_agrupado.cast 

#Contar el número de actores
def longitud (elemento):
    if elemento == 'sin información':
        return 0
    else: 
        elemento = elemento.split(",")
        return len(elemento)

df_netflix_agrupado['num_cast'] = df_netflix_agrupado.cast.apply(longitud)
df_netflix_agrupado

#Normalizar los nombres:
df_netflix_agrupado.cast = df_netflix_agrupado.cast.str.strip()

df_netflix_agrupado.head(5)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,Genre,Premiere,Runtime,IMDB Score,Language,Original_Netflix,duration_cleaned,Unidad_de_medida,calificacion_rating,has_famous_actor_unico,has_famous_actor_1,is_recent,decade,first_actor,first_name_director,num_cast
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,sin información,United States,2021-09-25,2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",Documentary,2020-10-02,90.0,7.5,English,Sí,90.0,min,Teens,False,False,True,2020s,sin información,Kirsten Johnson,0
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",,,,,,No,2.0,Seasons,Adults,False,False,True,2020s,Ama Qamata,,19
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,2021-09-24,2021,TV-MA,,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,,,,,,No,,,Adults,False,False,True,2020s,Sami Bouajila,Julien Leclercq,9
3,s4,TV Show,Jailbirds New Orleans,,sin información,,2021-09-24,2021,TV-MA,,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",,,,,,No,,,Adults,False,False,True,2020s,sin información,,0
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,,,,,,No,2.0,Seasons,Adults,False,False,True,2020s,Mayur More,,8



#### Ejercicio 8: Identificación de Directores Recurrentes

En este ejercicio, debes identificar los directores que aparecen más de una vez en el conjunto de datos. Realiza los siguientes pasos:

1. **Reemplaza los valores nulos** en la columna `director` por "sin información".

3. **Cuenta cuántas veces aparece cada director** en la columna creada en el ejercicio 6.

4. **Filtra aquellos directores que aparecen más de una vez** y crea una nueva columna llamada `recurrent_director` donde se indique "Yes" si el director aparece varias veces o "No" en caso contrario.

In [20]:
#Rellenar los valores nulos por "sin información"
df_netflix_agrupado.director = df_netflix_agrupado['director'].fillna("sin información")
df_netflix_agrupado.director 



0       Kirsten Johnson
1       sin información
2       Julien Leclercq
3       sin información
4       sin información
             ...       
8802      David Fincher
8803    sin información
8804    Ruben Fleischer
8805       Peter Hewitt
8806        Mozez Singh
Name: director, Length: 8807, dtype: object

In [21]:
#Contar el número de veces que aparece un director
display(df_netflix_agrupado.first_name_director.value_counts())

#filtro directores que aparecen mas de una vez
filtro_director_recurrente = df_netflix_agrupado.first_name_director.value_counts()>1
filtro_director_recurrente['David Guy Levy']



first_name_director
Rajiv Chilaka       22
Raúl Campos         18
Marcus Raboy        16
Suhas Kadav         16
Jay Karas           15
                    ..
David Guy Levy       1
Storm Theunissen     1
Bart Freundlich      1
Adrian Murray        1
Theo Davies          1
Name: count, Length: 4405, dtype: int64

np.False_

In [22]:
df_netflix_agrupado.first_name_director.info()

<class 'pandas.core.series.Series'>
Index: 8807 entries, 0 to 8806
Series name: first_name_director
Non-Null Count  Dtype 
--------------  ----- 
6173 non-null   object
dtypes: object(1)
memory usage: 395.6+ KB


In [24]:
contador_director_recurrente = df_netflix_agrupado.first_name_director.value_counts()
directores_recurrentes =contador_director_recurrente[contador_director_recurrente>1].index
directores_recurrentes

#creación de la columna director recurrente con lambda
df_netflix_agrupado['recurrent_director'] = df_netflix_agrupado.first_name_director.apply(lambda x: 'Yes' if x in directores_recurrentes else 'No')

df_netflix_agrupado



Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,Genre,Premiere,Runtime,IMDB Score,Language,Original_Netflix,duration_cleaned,Unidad_de_medida,calificacion_rating,has_famous_actor_unico,has_famous_actor_1,is_recent,decade,first_actor,first_name_director,num_cast,recurrent_director
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,sin información,United States,2021-09-25,2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",Documentary,2020-10-02,90.0,7.5,English,Sí,90,min,Teens,False,False,True,2020s,sin información,Kirsten Johnson,0,No
1,s2,TV Show,Blood & Water,sin información,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",,,,,,No,2,Seasons,Adults,False,False,True,2020s,Ama Qamata,,19,No
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,2021-09-24,2021,TV-MA,,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,,,,,,No,,,Adults,False,False,True,2020s,Sami Bouajila,Julien Leclercq,9,Yes
3,s4,TV Show,Jailbirds New Orleans,sin información,sin información,,2021-09-24,2021,TV-MA,,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",,,,,,No,,,Adults,False,False,True,2020s,sin información,,0,No
4,s5,TV Show,Kota Factory,sin información,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,,,,,,No,2,Seasons,Adults,False,False,True,2020s,Mayur More,,8,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,2019-11-20,2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a...",,,,,,No,158,min,Adults,False,False,False,2000s,Mark Ruffalo,David Fincher,10,Yes
8803,s8804,TV Show,Zombie Dumb,sin información,sin información,,2019-07-01,2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g...",,,,,,No,2,Seasons,General Audience,False,False,False,2010s,sin información,,0,No
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,2019-11-01,2009,R,,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...,,,,,,No,,,Adults,False,False,False,2000s,Jesse Eisenberg,Ruben Fleischer,7,Yes
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,2020-01-11,2006,PG,,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero...",,,,,,No,,,General Audience,False,False,False,2000s,Tim Allen,Peter Hewitt,9,No


In [25]:
#creación de la columna director recurrente con una función. Necesito aclarar cómo, cuando encuentre un valor nulo, haga algo, he tenido que sustituir los nulos por "sin información" para que la formula se ejecutara
df_netflix_agrupado.first_name_director = df_netflix_agrupado['first_name_director'].fillna("sin información")
contador_director_recurrente = df_netflix_agrupado.first_name_director.value_counts()
#display(contador_director_recurrente)


def director_recurrente_1 (director):
    #print(director)
    #print(contador_director_recurrente.loc[director])
    if director == 'sin información':
        return 'sin información'
    else:
        if contador_director_recurrente.loc[director] >1:
            return 'Yes'
        elif contador_director_recurrente.loc[director] == 1:
            return 'No'



In [26]:
df_netflix_agrupado.first_name_director[1]

'sin información'

In [27]:
df_netflix_agrupado['recurrent_director_1'] = df_netflix_agrupado.first_name_director.apply(director_recurrente_1)
df_netflix_agrupado

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,Genre,Premiere,Runtime,IMDB Score,Language,Original_Netflix,duration_cleaned,Unidad_de_medida,calificacion_rating,has_famous_actor_unico,has_famous_actor_1,is_recent,decade,first_actor,first_name_director,num_cast,recurrent_director,recurrent_director_1
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,sin información,United States,2021-09-25,2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",Documentary,2020-10-02,90.0,7.5,English,Sí,90,min,Teens,False,False,True,2020s,sin información,Kirsten Johnson,0,No,No
1,s2,TV Show,Blood & Water,sin información,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",,,,,,No,2,Seasons,Adults,False,False,True,2020s,Ama Qamata,sin información,19,No,sin información
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,2021-09-24,2021,TV-MA,,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,,,,,,No,,,Adults,False,False,True,2020s,Sami Bouajila,Julien Leclercq,9,Yes,Yes
3,s4,TV Show,Jailbirds New Orleans,sin información,sin información,,2021-09-24,2021,TV-MA,,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",,,,,,No,,,Adults,False,False,True,2020s,sin información,sin información,0,No,sin información
4,s5,TV Show,Kota Factory,sin información,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,,,,,,No,2,Seasons,Adults,False,False,True,2020s,Mayur More,sin información,8,No,sin información
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,2019-11-20,2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a...",,,,,,No,158,min,Adults,False,False,False,2000s,Mark Ruffalo,David Fincher,10,Yes,Yes
8803,s8804,TV Show,Zombie Dumb,sin información,sin información,,2019-07-01,2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g...",,,,,,No,2,Seasons,General Audience,False,False,False,2010s,sin información,sin información,0,No,sin información
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,2019-11-01,2009,R,,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...,,,,,,No,,,Adults,False,False,False,2000s,Jesse Eisenberg,Ruben Fleischer,7,Yes,Yes
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,2020-01-11,2006,PG,,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero...",,,,,,No,,,General Audience,False,False,False,2000s,Tim Allen,Peter Hewitt,9,No,No


In [28]:
#guardo el fichero trabajado
df_netflix_agrupado.to_pickle ("datos/Netflix.full.newcol.pkl")