# Investigating Netflix Movies Project in DataCamp



## The data
### **netflix_data.csv**
| Column | Description |
|--------|-------------|
| `show_id` | The ID of the show |
| `type` | Type of show |
| `title` | Title of the show |
| `director` | Director of the show |
| `cast` | Cast of the show |
| `country` | Country of origin |
| `date_added` | Date added to Netflix |
| `release_year` | Year of Netflix release |
| `duration` | Duration of the show in minutes |
| `description` | Description of the show |
| `genre` | Show genre |

In [1]:
# Importing pandas and matplotlib
import pandas as pd
import matplotlib.pyplot as plt

# Read in the Netflix CSV as a DataFrame
netflix_df = pd.read_csv("netflix_data.csv")

## Pandas head() Method
In pandas, the head() method is used to view the first few rows of a DataFrame or Series.

In [2]:
#Display first five rows
netflix_df.head(5)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,duration,description,genre
0,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,93,After a devastating earthquake hits Mexico Cit...,Dramas
1,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,78,"When an army recruit is found dead, his fellow...",Horror Movies
2,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,80,"In a postapocalyptic world, rag-doll robots hi...",Action
3,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,123,A brilliant group of students become card-coun...,Dramas
4,s6,TV Show,46,Serdar Akar,"Erdal Beşikçioğlu, Yasemin Allen, Melis Birkan...",Turkey,"July 1, 2017",2016,1,A genetics professor experiments with a treatm...,International TV


## Pandas DataFrame shape Property
The shape property returns a tuple containing the number of rows and columns in the DataFrame.

### Python Tuple
Tuples are used to store multiple items in a single variable.
Tuples are one of the four built-in data types in Python used to store collections of data.
Tuples are ordered, unchangeable, and allow duplicate values.

In [3]:
#Dimensions of Dataset (rows, columns)
netflix_df.shape

(4812, 11)

## Pandas DataFrame dtypes Property
The pandas DataFrame.dtypes attribute returns a series with the data type of each column.
Columns with mixed types are stored with the object dtype.


In [4]:
#Show the data type of existing columns
netflix_df.dtypes

show_id         object
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
duration         int64
description     object
genre           object
dtype: object

In [5]:
#Provides a quick overview of dataframe by displaying the data types of each column, how many rows it has, and any missing values
print(netflix_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4812 entries, 0 to 4811
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       4812 non-null   object
 1   type          4812 non-null   object
 2   title         4812 non-null   object
 3   director      4812 non-null   object
 4   cast          4812 non-null   object
 5   country       4812 non-null   object
 6   date_added    4812 non-null   object
 7   release_year  4812 non-null   int64 
 8   duration      4812 non-null   int64 
 9   description   4812 non-null   object
 10  genre         4812 non-null   object
dtypes: int64(2), object(9)
memory usage: 413.7+ KB
None


In [6]:
#Check duplicated rows
duplicate_rows_netflix_df=netflix_df[netflix_df.duplicated()]
print("Number of duplicate rows:", duplicate_rows_netflix_df.shape)

Number of duplicate rows: (0, 11)


In [7]:
#Check null values
print(netflix_df.isnull().sum())

show_id         0
type            0
title           0
director        0
cast            0
country         0
date_added      0
release_year    0
duration        0
description     0
genre           0
dtype: int64


In [8]:
#Identify data patterns
netflix_df.describe()

Unnamed: 0,release_year,duration
count,4812.0,4812.0
mean,2012.711554,99.566708
std,9.517978,30.889305
min,1942.0,1.0
25%,2011.0,88.0
50%,2016.0,99.0
75%,2018.0,116.0
max,2021.0,253.0


In [9]:
#Data counts per type
print(netflix_df.value_counts('type'))

type
Movie      4677
TV Show     135
dtype: int64


In [10]:
#Count movies
netflix_df_subset=netflix_df[netflix_df["type"]=="Movie"]
print(netflix_df_subset.value_counts('type'))

type
Movie    4677
dtype: int64


In [11]:
#Create a pandas dataframe which is Netflix "movie" films released between 1990 to 1999
new_netflix_df=netflix_df[(netflix_df['release_year']>=1990) & (netflix_df['release_year']<2000) & (netflix_df['type']=='Movie')]
print(new_netflix_df.value_counts('type'))

type
Movie    183
dtype: int64


In [12]:
#Modify dataframe with chosen columns display
keep = ['title', 'country', 'genre', 'release_year', 'duration']
netflix_movies = new_netflix_df[keep]
print(netflix_movies)

                                title        country  ... release_year  duration
6                                 187  United States  ...         1997       119
118                 A Dangerous Woman  United States  ...         1993       101
145            A Night at the Roxbury  United States  ...         1998        82
167   A Thin Line Between Love & Hate  United States  ...         1996       108
194                      Aashik Awara          India  ...         1993       154
...                               ...            ...  ...          ...       ...
4672                      West Beirut         France  ...         1999       106
4689      What's Eating Gilbert Grape  United States  ...         1993       118
4718                   Wild Wild West  United States  ...         1999       106
4746                       Wyatt Earp  United States  ...         1994       191
4756                      Yaar Gaddar          India  ...         1994       148

[183 rows x 5 columns]


In [None]:
#Plot histogram to visualize the data according to number of movies and duration
plt.hist(netflix_movies['duration'])