***Emma Arenas Villaverde***
***

# Netflix Dataset Analysis
The `netflix-titles` dataset is an extensive collection of data encompassing a wide array of movies and TV shows available on Netflix. This dataset offers invaluable insights into the trends and patterns in media content on a major streaming platform. It serves as a crucial resource for media analysts, content creators, and researchers interested in exploring content strategies, viewer preferences, and the evolving landscape of digital entertainment.

## Data Preparation and Cleaning

### Importing Libraries

In [None]:
install.packages("readr") # to read CSV files
install.packages("dplyr") # for data manipulation
install.packages("lubridate") # for 
library(lubridate)
library(readr)
library(dplyr)

###  Loading Dataset

In [26]:
netflix_titles <- read.csv2("../data/netflix-titles.csv")

### Data Overview

In [27]:
dim(netflix_titles) # to obtain its dimensions

In [29]:
str(netflix_titles) # to see its internal structure

'data.frame':	8811 obs. of  12 variables:
 $ show_id     : chr  "s1" "s2" "s3" "s4" ...
 $ type        : chr  "Movie" "TV Show" "TV Show" "TV Show" ...
 $ title       : chr  "Dick Johnson Is Dead" "Blood & Water" "Ganglands" "Jailbirds New Orleans" ...
 $ director    : chr  "Kirsten Johnson" "" "Julien Leclercq" "" ...
 $ cast        : chr  "" "Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile "| __truncated__ "Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabiha Akkari, Sofia Lesaffre, Salim Kechiouche, Noureddine Farihi, G"| __truncated__ "" ...
 $ country     : chr  "United States" "South Africa" "" "" ...
 $ date_added  : chr  "September 25, 2021" "September 24, 2021" "September 24, 2021" "September 24, 2021" ...
 $ release_year: chr  "2020" "2021" "2021" "2021" ...
 $ rating      : chr  "PG-13" "TV-MA" "TV-MA" "TV-MA" ...
 $ duration    : chr  "90 min" "2 Seasons" "1 Season" "1 Season" ...
 $ listed_in   : chr  "Documentarie

In [30]:
head(netflix_titles) # to view its first six rows

Unnamed: 0_level_0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable."
2,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile Tshabalala, Getmore Sithole, Cindy Mahlangu, Ryle De Morny, Greteli Fincham, Sello Maake Ka-Ncube, Odwa Gwanya, Mekaila Mathys, Sandi Schultz, Duane Williams, Shamilla Miller, Patrick Mofokeng",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is her sister who was abducted at birth."
3,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabiha Akkari, Sofia Lesaffre, Salim Kechiouche, Noureddine Farihi, Geert Van Rampelberg, Bakary Diombera",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Action & Adventure","To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled into a violent and deadly turf war."
4,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down among the incarcerated women at the Orleans Justice Center in New Orleans on this gritty reality series."
5,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam Khan, Ahsaas Channa, Revathi Pillai, Urvi Singh, Arun Kumar",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV Comedies","In a city of coaching centers known to train India’s finest collegiate minds, an earnest but unexceptional student and his friends navigate campus life."
6,s6,TV Show,Midnight Mass,Mike Flanagan,"Kate Siegel, Zach Gilford, Hamish Linklater, Henry Thomas, Kristin Lehman, Samantha Sloyan, Igby Rigney, Rahul Kohli, Annarah Cymone, Annabeth Gish, Alex Essoe, Rahul Abburi, Matt Biedel, Michael Trucco, Crystal Balint, Louis Oliver",,"September 24, 2021",2021,TV-MA,1 Season,"TV Dramas, TV Horror, TV Mysteries","The arrival of a charismatic young priest brings glorious miracles, ominous mysteries and renewed religious fervor to a dying town desperate to believe."


### Removing Unwanted Columns

In [33]:
netflix_titles$show_id <- NULL
netflix_titles$description <- NULL
netflix_titles$date_added <- NULL

In [34]:
names(netflix_titles)

#### Rearrancing Columns

In [35]:
netflix_titles <- netflix_titles[,c("title","type","release_year","director","cast","rating","listed_in")]

In [36]:
names(netflix_titles)

### Detection of Duplicate Record

In [40]:
duplicated_rows <- duplicated(netflix_titles)
netflix_titles[duplicated_rows,]

Unnamed: 0_level_0,title,type,release_year,director,cast,rating,listed_in
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
5965,feb-09,TV Show,2018,,"Shahd El Yaseen, Shaila Sabt, Hala, Hanadi Al-Kandari, Salma Salem, Ibrahim Al-Harbi, Mahmoud Boushahri, Yousef Al Balushi, Ghorour, Abdullah Al-bloshi",TV-14,"International TV Shows, TV Dramas"
5966,22-jul,Movie,2018,Paul Greengrass,"Anders Danielsen Lie, Jon Øigarden, Jonas Strand Gravli, Ola G. Furuseth, Maria Bock, Thorbjørn Harr, Jaden Smith",R,"Dramas, Thrillers"
5967,15-Aug,Movie,2019,Swapnaneel Jayakar,"Rahul Pethe, Mrunmayee Deshpande, Adinath Kothare, Vaibhav Mangale, Jaywant Wadkar, Satish Pulekar, Naina Apte, Uday Tikekar",TV-14,"Comedies, Dramas, Independent Movies"


In [None]:
netflix_titles <- netflix_titles %>% distinct() # función de la librería dplyr para borrar duplicados

In [44]:
sum(duplicated(netflix_titles))

### Checking Missing Values

In [None]:
# Convertir todas las cadenas vacías en NA para todo el dataframe
netflix_titles <- netflix_titles %>% 
  mutate(across(everything(), na_if, ""))

In [60]:
missing_values <- any(is.na(netflix_titles))
missing_values

In [61]:
# Contar valores faltantes en cada columna
netflix_titles %>%
  summarise(across(everything(), ~sum(is.na(.))))

title,type,release_year,director,cast,rating,listed_in
<int>,<int>,<int>,<int>,<int>,<int>,<int>
2,1,2,2635,826,6,3


#### Fixing Missing Values

In [63]:
netflix_titles$title[is.na(netflix_titles$title)] <- "Unknown title"
netflix_titles$type[is.na(netflix_titles$type)] <- "Unknown type"
netflix_titles$release_year[is.na(netflix_titles$release_year)] <- "Unknown year"
netflix_titles$director[is.na(netflix_titles$director)] <- "Unknown director"
netflix_titles$cast[is.na(netflix_titles$cast)] <- "Unknown cast"
netflix_titles$rating[is.na(netflix_titles$rating)] <- "Unknown rating"
netflix_titles$listed_in[is.na(netflix_titles$listed_in)] <- "Unknown list"

In [65]:
netflix_titles %>%
  summarise(across(everything(), ~sum(is.na(.))))

title,type,release_year,director,cast,rating,listed_in
<int>,<int>,<int>,<int>,<int>,<int>,<int>
0,0,0,0,0,0,0


### Changing Data Type of Columns

In [68]:
netflix_titles$release_year <- as.numeric(netflix_titles$release_year)
str(netflix_titles$release_year)

 num [1:8808] 2020 2021 2021 2021 2021 ...


In [71]:
netflix_titles <- netflix_titles %>%
  mutate(seasons = ifelse(type == "TV Show", as.numeric(str_extract(duration, "\\d+")), NA))

ERROR: [1m[33mError[39m in `mutate()`:[22m
[1m[22m[36mℹ[39m In argument: `seasons = ifelse(...)`.
[1mCaused by error in `str_extract()`:[22m
[33m![39m no se pudo encontrar la función "str_extract"


In [70]:
netflix_titles$date_added <- mdy(netflix_titles$date_added)
str(netflix_titles$date_added)

ERROR: Error in `$<-.data.frame`(`*tmp*`, date_added, value = structure(numeric(0), class = "Date")): replacement has 0 rows, data has 8808
