# Netflix Movies analysis project

## Target of project
- Practice using libraries like pandas, matplotlib
- Practice explaining targeted problems
- Practice visualization

## Data
Data I used comes from a website called "Kaggle" which is a data science community with resources
The file i used can be found at link: https://www.kaggle.com/datasets/rahulvyasm/netflix-movies-and-tv-shows

## Data description:
- show_id: A unique identifier for each title.
- type: The category of the title, which is either 'Movie' or 'TV Show'.
- title: The name of the movie or TV show.
- director: The director(s) of the movie or TV show. (Contains null values for some entries, especially TV shows where this information might not be applicable.)
cast: The list of main actors/actresses in the title. (Some entries might not have this information.)
- country: The country or countries where the movie or TV show was produced.
- date_added: The date the title was added to Netflix.
- release_year: The year the movie or TV show was originally released.
- rating: The age rating of the title.
- duration: The duration of the title, in minutes for movies and seasons for TV shows.
- listed_in: The genres the title falls under.
- description: A brief summary of the title.


### Importing libraries

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Loading data

In [4]:
netflix_movies = pd.read_csv('data/netflix_titles.csv', delimiter=',', decimal='.', index_col=0, encoding='latin1')

### Data exploration

In [5]:
netflix_movies.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8809 entries, s1 to s8809
Data columns (total 25 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   type          8809 non-null   object 
 1   title         8809 non-null   object 
 2   director      6175 non-null   object 
 3   cast          7984 non-null   object 
 4   country       7978 non-null   object 
 5   date_added    8799 non-null   object 
 6   release_year  8809 non-null   int64  
 7   rating        8805 non-null   object 
 8   duration      8806 non-null   object 
 9   listed_in     8809 non-null   object 
 10  description   8809 non-null   object 
 11  Unnamed: 12   0 non-null      float64
 12  Unnamed: 13   0 non-null      float64
 13  Unnamed: 14   0 non-null      float64
 14  Unnamed: 15   0 non-null      float64
 15  Unnamed: 16   0 non-null      float64
 16  Unnamed: 17   0 non-null      float64
 17  Unnamed: 18   0 non-null      float64
 18  Unnamed: 19   0 non-null      f

### Data cleaning

In [6]:
# drop unnamed columns
netflix_movies.drop(netflix_movies.columns[netflix_movies.columns.str.contains('unnamed', case=False)], axis=1, inplace=True)

In [7]:
# null counts
null_counts = netflix_movies.isnull().sum()
print("Number of null values in each column:\n{}".format(null_counts))

Number of null values in each column:
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64


In [8]:
netflix_movies.describe()

Unnamed: 0,release_year
count,8809.0
mean,2014.181292
std,8.818932
min,1925.0
25%,2013.0
50%,2017.0
75%,2019.0
max,2024.0


In [9]:
text_columns = netflix_movies.select_dtypes(include=['object']).columns

In [10]:
text_columns

Index(['type', 'title', 'director', 'cast', 'country', 'date_added', 'rating',
       'duration', 'listed_in', 'description'],
      dtype='object')

In [17]:
# convert dates to datetime
netflix_movies['date_added'] = netflix_movies['date_added'].str.strip()
netflix_movies['date_added'] = pd.to_datetime(netflix_movies['date_added'], infer_datetime_format = True)

  netflix_movies['date_added'] = pd.to_datetime(netflix_movies['date_added'], infer_datetime_format = True)
