# ** Investigating my Netflix Dataset**

### Dain Russell, 2020



## **Table of Contents**

* [Introduction](#introduction)
* [Data Wrangling](#data wrangling)
    * [General Properties](#general properties)
    * [Dataset Observations](#dataset observations)
    * [Data Cleaning](#data cleaning)
* [Exploratory Data Analysis](#exploratory data analysis)   
    * [Research Question 1](#research question 1)
    * [Research Question 2](#research question 2)
    * [Research Question 3](#research question 3)
    * [Research Question 4](#research question 3)
* [Conclusion](#conclusion)
* [Citations](#citations)





## **Introduction** <a class="anchor" id="introduction"></a>
> In this project, we analyzed a dataset and then communicated the findings about it. We used the Python libraries NumPy, pandas, and Matplotlib to make your analysis easier.

> This data set contains information about my netflix viewing activity collected from Netflix. This includes, profile name, start time, duration, title, and country.









#### Let's get started!
We set up the import statements for all of the packages we plan to use.



In [1]:
# import statements for all of the packages 
import pandas as pd  
import numpy as np  
import csv 
import seaborn as sns
import matplotlib.pyplot as plt

# 'magic word' so that your visualizations are plotted
%matplotlib inline




## **Data Wrangling**<a class="anchor" id="data wrangling"></a>
In this section of the report, the data is loaded, checked for  cleanliness, and then trimmed and cleaned for analysis. 







### **General Properties**<a class="anchor" id="general properties"></a>


Here we load and read the data into the pandas dataframe we are calling `netflix_df`.

**Now let's preview the first 5 and last 5 rows of our data.**

In [7]:
# Load and Read the CSV File Using Pandas read_csv function from excel spreadsheet
netflix_df = pd.read_csv(r'C:\DataAnalytics\ViewingActivity.csv')

#printing first five rows with defined columns of tmdb-movies database
netflix_df.head()


Unnamed: 0,Profile Name,Start Time,Duration,Attributes,Title,Supplemental Video Type,Device Type,Bookmark,Latest Bookmark,Country
0,Dain,2020-11-24 03:26:17,00:02:00,,Bates Motel: Season 4: 'Til Death Do You Part ...,,Vizio MG186 MT5597DV CAST/HYBRID INX Smart TV,00:02:46,00:02:46,US (United States)
1,Dain,2020-11-24 02:43:01,00:42:06,,Bates Motel: Season 4: Goodnight Mother (Episo...,,Vizio MG186 MT5597DV CAST/HYBRID INX Smart TV,00:42:56,00:42:56,US (United States)
2,Dain,2020-11-24 01:52:48,00:42:31,Autoplayed: user action: Unspecified;,Bates Motel: Season 4: A Danger to Himself and...,,Vizio MG186 MT5597DV CAST/HYBRID INX Smart TV,00:42:31,00:42:31,US (United States)
3,Dain,2020-11-24 01:52:41,00:00:05,Autoplayed: user action: None;,Bates Motel: Season 5_hook_primary_16x9,HOOK,Vizio MG186 MT5597DV CAST/HYBRID INX Smart TV,00:00:05,00:00:05,US (United States)
4,Dain,2020-11-24 01:52:26,00:00:05,Autoplayed: user action: None;,Survivor: Season 28_hook_primary_16x9,HOOK,Vizio MG186 MT5597DV CAST/HYBRID INX Smart TV,00:00:05,00:00:05,US (United States)


**Generating the shape of our original dataframe in terms of Rows and Columns.**

In [None]:
# dataframe.shape 
# Get the number of rows and columns
print(("There are {} rows and {} columns in the dataset.".format(netflix_df.shape[0], netflix_df.shape[1])))


In [None]:
#raw datset summary that displays missing values in each column
netflix_df.info()

We can see from the above summary that there are no values for the columns "Attribute" and "Supplemental Video Type".
Below we are using a heatmap to visualize the missing value occurence.

In [None]:
#plot a heatmap to visualize the location of missing values
sns.heatmap(netflix_df.isnull())
plt.show()

for i in df.columns:
    null_rate = netflix_df[i].isna().sum() / len(netflix_df) * 100 
    if null_rate > 0 :
        print("{}'s null rate :{}%".format(i,round(null_rate,2)))


**Checking for Null Values**

In [None]:
netflix_df.isnull().sum()

**Descriptive Summary Statistics on Raw Data.**

In [None]:
#this function will generate decriptive statistics summary
netflix_df.describe()

###### **Data Set Observations**<a class="anchor" id="dataset observations"></a>

* The movies have a mean popularity of 0.6 with at least 50% having a popularity value of 0.4. The max popularity is 33 and the minimum value is 0.

* The movies have a mean budget is 3.091321e+07. Both the minimum values for budget and revenue are 0.

* Certain columns, like ‘cast’ and ‘genres’, contain multiple values separated by pipe (|) characters.



### **Data Cleaning**<a class="anchor" id="data cleaning"></a>
Let's clean up the data in order to make it easier to handle since we are only focusing on just a few columns.



**Dropping Columns that are not needed and set inplace=True so we keep the changes**

**Dataset after dropping unwanted columns.**

# **Exploratory Data Analysis**<a class="anchor" id="exploratory data analysis"></a>
Tip: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

Using a histogram to visualize what the data looks like for each column