# Assignment: Streaming Service Data Analysis
Course: IMDD-A-EXP-25 (2025/2026)

Student:

### Overview
In this assignment, you will choose one of the popular streaming services—Netflix, Amazon Prime, Disney+, or Hulu—and perform a comprehensive data analysis on the content available on that platform. The goal is to apply the data analysis techniques you've learned, including data loading, inspection, cleaning, selection, aggregation, and type conversion, to gain insights into the content offered by your chosen streaming service.


### Objective
By the end of this assignment, you will:

- Load and inspect a dataset for your chosen streaming service.
- Clean the data by handling missing values and converting data types as necessary.
- Perform basic data selection and filtering to focus your analysis on specific aspects of the dataset.
- Calculate and analyze summary statistics to derive insights into the content available on the platform.

### Instructions
Choose Your Dataset: Select one dataset representing content from Netflix, Amazon Prime, Disney+, or Hulu. The datasets should be similar in structure, containing information such as titles, type (Movie or TV Show), director, cast, country of origin, date added, release year, rating, duration, and genre. From there do the following activities:
1. Loading Data: Load the dataset using Pandas. Display the first few rows of the dataset to get an initial overview.
2. Data Inspection: Use `info()` to summarize the dataset's structure, checking for missing data and understanding the data types. Use `describe()` to generate summary statistics for numerical columns.
3. Handling Missing Data: Identify any missing values in the dataset and choose an appropriate method to handle them (e.g., dropping rows, filling with default values). Document your approach and reasoning.
4. Data Selection and Filtering: Select specific rows and columns to focus on a particular aspect of the data (e.g., filtering movies released in a specific year, selecting titles from a particular country). Perform conditional filtering to narrow down your dataset based on specific criteria. Be creative.
5. Basic Aggregations: Calculate key statistics such as the total number of entries, the number of movies and TV shows, the distribution of content ratings, and the average duration of movies.
Generate counts or averages based on specific categories, such as the number of titles by genre or by country.
6. Data Type Conversion: Convert any columns that are not in the appropriate format (e.g., convert date columns to datetime objects, convert string-based numerical columns to integers).
Verify the conversions by inspecting the data types again.


## 1. Loading Data with Pandas

Easy and clear to understand

In [3]:
# Importing Pandas Library
import pandas as pd

#Loading Dataset from CSV file

df_netflix = pd.read_csv('netflix.csv')

# Display the first few rows from the dataset

df_netflix.head(3)





Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...


## 2. DataFrame Inspection

In [4]:
# Summary of Dataframe
df_netflix.info()

#Generate descriptive statistics for numerical columns

df_netflix.describe()

# Generate descriptive statistics for all columns, including categorical ones
df_netflix.describe(include='all')


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
count,8807,8807,8807,6173,7982,7976,8797,8807.0,8803,8804,8807,8807
unique,8807,2,8807,4528,7692,748,1767,,17,220,514,8775
top,s8807,Movie,Zubaan,Rajiv Chilaka,David Attenborough,United States,"January 1, 2020",,TV-MA,1 Season,"Dramas, International Movies","Paranormal activity at a lush, abandoned prope..."
freq,1,6131,1,19,19,2818,109,,3207,1793,362,4
mean,,,,,,,,2014.180198,,,,
std,,,,,,,,8.819312,,,,
min,,,,,,,,1925.0,,,,
25%,,,,,,,,2013.0,,,,
50%,,,,,,,,2017.0,,,,
75%,,,,,,,,2019.0,,,,


## 3. Handling Missing Data

Did not understand all functions completely, however the Datacamp videos are helping.

In [23]:
missing_data = df_netflix.isnull().sum()

missing_data

# Option 1: Drop rows with missing values

df_netflix.cleaned = df_netflix.dropna()

df_netflix.info()

# Option 2: Drop columns with missing values

df_netflix.cleaned_columns = df_netflix.dropna(axis=1)
df_netflix.cleaned.info()

# Option 3: Fill missing values with a specific value
df_filled = df_netflix.fillna('Unknown')
print('df_filled')







<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   show_id       8807 non-null   object 
 1   type          8807 non-null   object 
 2   title         8807 non-null   object 
 3   director      6173 non-null   object 
 4   cast          7982 non-null   object 
 5   country       7976 non-null   object 
 6   date_added    8797 non-null   object 
 7   release_year  8807 non-null   int64  
 8   rating        8803 non-null   object 
 9   duration      6128 non-null   float64
 10  listed_in     8807 non-null   object 
 11  description   8807 non-null   object 
dtypes: float64(1), int64(1), object(10)
memory usage: 825.8+ KB
<class 'pandas.core.frame.DataFrame'>
Index: 5185 entries, 7 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   show_id       5185 non-null   obj

In [26]:
# Option 4: Fill missing numerical data with the mean of the column
df_filled_mean = df_netflix.fillna(df_netflix.mean(numeric_only=True))
print(df_filled_mean)

     show_id     type                  title         director  \
0         s1    Movie   Dick Johnson Is Dead  Kirsten Johnson   
1         s2  TV Show          Blood & Water              NaN   
2         s3  TV Show              Ganglands  Julien Leclercq   
3         s4  TV Show  Jailbirds New Orleans              NaN   
4         s5  TV Show           Kota Factory              NaN   
...      ...      ...                    ...              ...   
8802   s8803    Movie                 Zodiac    David Fincher   
8803   s8804  TV Show            Zombie Dumb              NaN   
8804   s8805    Movie             Zombieland  Ruben Fleischer   
8805   s8806    Movie                   Zoom     Peter Hewitt   
8806   s8807    Movie                 Zubaan      Mozez Singh   

                                                   cast        country  \
0                                                   NaN  United States   
1     Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...   South Africa 

## 4. Selecting Data


In [27]:
# Select a single column
titles = df_netflix['title']
titles


#Selectig multiple columns
titles_and_cast = df_netflix[['title','cast']]
titles_and_cast

# Select a single row by index label
fifth_row = df_netflix.loc[5]
fifth_row

# Select multiple rows by index label
first_two_rows = df_netflix.loc[1:2]
first_two_rows





Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...


Relatively easier

In [7]:
# Select rows based on a condition , all movies released in India
movies_india = df_netflix[df_netflix['country'] == 'India']
movies_india

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
24,s25,Movie,Jeans,S. Shankar,"Prashanth, Aishwarya Rai Bachchan, Sri Lakshmi...",India,"September 21, 2021",1998,TV-14,166 min,"Comedies, International Movies, Romantic Movies",When the father of the man she loves insists t...
39,s40,TV Show,Chhota Bheem,,"Vatsal Dubey, Julie Tejwani, Rupa Bhimani, Jig...",India,"September 16, 2021",2021,TV-Y7,3 Seasons,Kids' TV,"A brave, energetic little boy with superhuman ..."
50,s51,TV Show,Dharmakshetra,,"Kashmira Irani, Chandan Anand, Dinesh Mehta, A...",India,"September 15, 2021",2014,TV-PG,1 Season,"International TV Shows, TV Dramas, TV Sci-Fi &...","After the ancient Great War, the god Chitragup..."
66,s67,TV Show,Raja Rasoi Aur Anya Kahaniyan,,,India,"September 15, 2021",2014,TV-G,1 Season,"Docuseries, International TV Shows",Explore the history and flavors of regional In...
...,...,...,...,...,...,...,...,...,...,...,...,...
8773,s8774,Movie,Yanda Kartavya Aahe,Kedar Shinde,"Ankush Choudhary, Smita Shewale, Mohan Joshi, ...",India,"January 1, 2018",2006,TV-PG,151 min,"Comedies, Dramas, International Movies",Thanks to an arranged marriage that was design...
8775,s8776,TV Show,Yeh Meri Family,,"Vishesh Bansal, Mona Singh, Akarsh Khurana, Ah...",India,"August 31, 2018",2018,TV-PG,1 Season,"International TV Shows, TV Comedies","In the summer of 1998, middle child Harshu bal..."
8798,s8799,Movie,Zed Plus,Chandra Prakash Dwivedi,"Adil Hussain, Mona Singh, K.K. Raina, Sanjay M...",India,"December 31, 2019",2014,TV-MA,131 min,"Comedies, Dramas, International Movies",A philandering small-town mechanic's political...
8799,s8800,Movie,Zenda,Avadhoot Gupte,"Santosh Juvekar, Siddharth Chandekar, Sachit P...",India,"February 15, 2018",2009,TV-14,120 min,"Dramas, International Movies",A change in the leadership of a political part...


In [8]:
# Select specific rows and columns
specific_data = df_netflix.loc[1:10, ['title', 'country', 'rating']]
specific_data

Unnamed: 0,title,country,rating
1,Blood & Water,South Africa,TV-MA
2,Ganglands,,TV-MA
3,Jailbirds New Orleans,,TV-MA
4,Kota Factory,India,TV-MA
5,Midnight Mass,,TV-MA
6,My Little Pony: A New Generation,,PG
7,Sankofa,"United States, Ghana, Burkina Faso, United Kin...",TV-MA
8,The Great British Baking Show,United Kingdom,TV-14
9,The Starling,United States,PG-13
10,"Vendetta: Truth, Lies and The Mafia",,TV-MA


In [9]:
# Select rows using integer-location based indexing
first_three_rows_iloc = df_netflix.iloc[0:3]
first_three_rows_iloc


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...


## 5. Basic Aggregations

I found basic aggregations a bit harder to code mostly because the functions become more complex.

In [10]:
# Finding total enteries
total_entries = df_netflix.shape[0]
print(f"Total number of entries: {total_entries}")


Total number of entries: 8807


In [11]:
#Count the genre of movies
duration_counts = df_netflix['duration'].value_counts()
duration_counts


Unnamed: 0_level_0,count
duration,Unnamed: 1_level_1
1 Season,1793
2 Seasons,425
3 Seasons,199
90 min,152
97 min,146
...,...
228 min,1
18 min,1
205 min,1
201 min,1


In [29]:
# Filter for movies and drop rows where 'duration' is missing
df_movies_duration = df_netflix.dropna(subset=['duration'])
df_movies_duration

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90.0,Documentaries,"As her father nears the end of his life, filmm..."
6,s7,Movie,My Little Pony: A New Generation,"Robert Cullen, José Luis Ucha","Vanessa Hudgens, Kimiko Glenn, James Marsden, ...",,"September 24, 2021",2021,PG,91.0,Children & Family Movies,Equestria's divided. But a bright-eyed hero be...
7,s8,Movie,Sankofa,Haile Gerima,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...","United States, Ghana, Burkina Faso, United Kin...","September 24, 2021",1993,TV-MA,125.0,"Dramas, Independent Movies, International Movies","On a photo shoot in Ghana, an American model s..."
9,s10,Movie,The Starling,Theodore Melfi,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, T...",United States,"September 24, 2021",2021,PG-13,104.0,"Comedies, Dramas",A woman adjusting to life after a loss contend...
12,s13,Movie,Je Suis Karl,Christian Schwochow,"Luna Wedler, Jannis Niewöhner, Milan Peschel, ...","Germany, Czech Republic","September 23, 2021",2021,TV-MA,127.0,"Dramas, International Movies",After most of her family is murdered in a terr...
...,...,...,...,...,...,...,...,...,...,...,...,...
8801,s8802,Movie,Zinzana,Majid Al Ansari,"Ali Suliman, Saleh Bakri, Yasa, Ali Al-Jabri, ...","United Arab Emirates, Jordan","March 9, 2016",2015,TV-MA,96.0,"Dramas, International Movies, Thrillers",Recovering alcoholic Talal wakes up inside a s...
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158.0,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88.0,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88.0,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."


In [66]:
# Check what date columns I have
print("Available columns:")
print(df_netflix.columns.tolist())

# Convert the date_added column
if 'date_added' in df_netflix.columns:
    print("Before conversion:")
    print(df_netflix['date_added'].head())
    print(f"Data type: {df_netflix['date_added'].dtype}")

    # Convert to datetime
    df_netflix['date_added'] = pd.to_datetime(df_netflix['date_added'], errors='coerce')

    print("\nAfter conversion:")
    print(df_netflix['date_added'].head())
    print(f"Data type: {df_netflix['date_added'].dtype}")

Available columns:
['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added', 'release_year', 'rating', 'duration', 'listed_in', 'description']
Before conversion:
0   2021-09-25
1   2021-09-24
2   2021-09-24
3   2021-09-24
4   2021-09-24
Name: date_added, dtype: datetime64[ns]
Data type: datetime64[ns]

After conversion:
0   2021-09-25
1   2021-09-24
2   2021-09-24
3   2021-09-24
4   2021-09-24
Name: date_added, dtype: datetime64[ns]
Data type: datetime64[ns]


In [64]:
# Convert with error handling - invalid dates become NaT (Not a Time)
df_netflix['date_added'] = pd.to_datetime(df_netflix['date_added'], errors='coerce')
print(df_netflix['date_added'])

0      2021-09-25
1      2021-09-24
2      2021-09-24
3      2021-09-24
4      2021-09-24
          ...    
8802   2019-11-20
8803   2019-07-01
8804   2019-11-01
8805   2020-01-11
8806   2019-03-02
Name: date_added, Length: 8807, dtype: datetime64[ns]


In [31]:
# Calculate the average movie duration
average_movieduration_netflix = df_netflix['duration'].mean()
print(f"Average movie duration: {average_movieduration_netflix:.2f} minutes")

Average movie duration: 99.58 minutes


In [32]:
# Calculate the total number of movies released the year I was born (1988)
movies_1988_count = df_netflix[df_netflix['release_year'] == 1988].shape[0]
print(f"Total number of movies released in 1988: {movies_1988_count}")

Total number of movies released in 1988: 18


## 6. Data Type Conversion

In [33]:
# Convert the 'Released_Year' column to integer
df_netflix['release_year'] = df_netflix['release_year'].astype(int)
df_netflix['release_year']

Unnamed: 0,release_year
0,2020
1,2021
2,2021
3,2021
4,2021
...,...
8802,2007
8803,2018
8804,2009
8805,2006


I have not completely understood this, I used AI to help me solve it but I am not satisfied with my learning outcome for this. I do no still understand how to convert columns.

In [44]:
# Convert the duration to integer (remove ' min' and then convert) - Movies only
if df_netflix['duration'].dtype == 'object' or df_netflix['duration'].dtype.name == 'string':
    movies_mask = df_netflix['type'] == 'Movie'
    df_netflix.loc[movies_mask, 'duration'] = pd.to_numeric(
        df_netflix.loc[movies_mask, 'duration'].str.replace(' min', ''),
        errors='coerce'
    )
else:
    print("Duration column is not string type - conversion may have already been done")


Duration column is not string type - conversion may have already been done


In [49]:
# Verify the changes
print(df_netflix.dtypes)


show_id          object
type             object
title            object
director         object
cast             object
country          object
date_added       object
release_year      int64
rating           object
duration        float64
listed_in        object
description      object
dtype: object


In [52]:
print(df_netflix['duration'].head(10))

0     90.0
1      NaN
2      NaN
3      NaN
4      NaN
5      NaN
6     91.0
7    125.0
8      NaN
9    104.0
Name: duration, dtype: float64


This is my first coding assignment as a non-coder, I find coding very logical, however i still dont know many functions and how they work. I am going through the Datacamp course and it is getting better, however it is taking me more than than I expected.

In [56]:
print(df_netflix['duration'].head(10))
print(f"Data type: {df_netflix['duration'].dtype}")
print(f"Number of NaN values: {df_netflix['duration'].isnull().sum()}")

0     90.0
1      NaN
2      NaN
3      NaN
4      NaN
5      NaN
6     91.0
7    125.0
8      NaN
9    104.0
Name: duration, dtype: float64
Data type: float64
Number of NaN values: 2679
