# NETFLIX EDA PROJECT - SQL

Netflix, a leading global streaming platform, possesses a dataset containing information 
about its shows. However, the dataset requires cleaning and analysis to derive valuable 
insights for business decision-making on the Netflix dataset to help the company gain 
insights into their content offerings

STEP -1 Importing all the necessary Libraries

In [7]:
import numpy as np 
import pandas as pd 
import sqlite3
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

STEP-2 Loading the data into jupyter notebook and forming connection with Sqlite with netflix database


In [8]:
df = pd.read_csv("C:\\Users\\DELL\\OneDrive\\Desktop\\netflix_01.csv")
conn = sqlite3.connect('NETFLIX.db')

STEP-3 Giving the table name as "netflix_data" and creating column schema

In [14]:
Table_name = 'netfliX_data3'
Column_schema ='show_id INTEGER,type TEXT, title TEXT, director TEXT,country TEXT,date_added INTEGER,release_year INTEGER,rating TEXT,duration TEXT,listed_in TEXT'


STEP-4 Creating the table query

In [15]:
Create_Table_Query = f"CREATE TABLE {Table_name} ({Column_schema})"
conn.execute(Create_Table_Query)

<sqlite3.Cursor at 0x2098b3725c0>

STEP-5 Writting Pandas DataFrame (df) to a SQL database table.

In [16]:
df.to_sql(Table_name, conn, if_exists='append', index=False)

8790

STEP-6 checking the data types

In [12]:
df.dtypes

show_id         object
type            object
title           object
director        object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
dtype: object

STEP-7 As we can see the date_added is object therefore need to convert it to datetime format

In [13]:
 df['date_added'] = pd.to_datetime(df['date_added'])

In [17]:
df.dtypes

show_id                 object
type                    object
title                   object
director                object
country                 object
date_added      datetime64[ns]
release_year             int64
rating                  object
duration                object
listed_in               object
dtype: object

**Segment 1: Database - Tables, Columns, Relationships**

**As we can see that there are total Rows = 8790 and Columns = 10**

In [16]:
seg1_a = pd.read_sql("""SELECT *
                        FROM netflix_data3;""",conn)
display(seg1_a)

Unnamed: 0,show_id,type,title,director,country,date_added,release_year,rating,duration,listed_in
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,United States,25/09/2021,2020,PG-13,90 min,Documentaries
1,s3,TV Show,Ganglands,Julien Leclercq,France,24/09/2021,2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act..."
2,s6,TV Show,Midnight Mass,Mike Flanagan,United States,24/09/2021,2021,TV-MA,1 Season,"TV Dramas, TV Horror, TV Mysteries"
3,s14,Movie,Confessions of an Invisible Girl,Bruno Garotti,Brazil,22/09/2021,2021,TV-PG,91 min,"Children & Family Movies, Comedies"
4,s8,Movie,Sankofa,Haile Gerima,United States,24/09/2021,1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies"
...,...,...,...,...,...,...,...,...,...,...
8785,s8797,TV Show,Yunus Emre,Not Given,Turkey,17/01/2017,2016,TV-PG,2 Seasons,"International TV Shows, TV Dramas"
8786,s8798,TV Show,Zak Storm,Not Given,United States,13/09/2018,2016,TV-Y7,3 Seasons,Kids' TV
8787,s8801,TV Show,Zindagi Gulzar Hai,Not Given,Pakistan,15/12/2016,2012,TV-PG,1 Season,"International TV Shows, Romantic TV Shows, TV ..."
8788,s8784,TV Show,Yoko,Not Given,Pakistan,23/06/2018,2016,TV-Y,1 Season,Kids' TV


**Identify and handle any missing values in the dataset** ==> as we can see there are no null values in any column of the table

In [17]:
seg1_c = pd.read_sql("""select count(*) 
                        from netflix_data3
                        where type IS NULL 
                        OR title IS NULL
                        OR director IS NULL
                        OR country IS NULL 
                        OR date_added IS NULL
                        OR release_year IS NULL
                        OR rating IS NULL
                        OR duration IS NULL 
                        OR listed_in IS NULL;""",conn)
display(seg1_c)

Unnamed: 0,count(*)
0,0


**Segment 2: Content Analysis**

**Analysing the distribution of content types (movies vs. TV shows) in the dataset.** ==>
As we can see there are more number of movies in the dataset than that of TV shows 


In [18]:
seg2_a = pd.read_sql("""select type, count(*) as count 
                           from netflix_data3 
                           group by type;""",conn)
display(seg2_a)

Unnamed: 0,type,count
0,Movie,6126
1,TV Show,2664


**Determining the top 10 countries with the highest number of productions on Netflix** ==> as we can that United States has the highest number of production on netflix

In [41]:
seg2_b = pd.read_sql("""
SELECT country, country_count
FROM (
    SELECT country, COUNT(*) AS country_count,
           RANK() OVER (ORDER BY COUNT(*) DESC) AS country_rank
    FROM netflix_data3
    GROUP BY country
) AS subquery
WHERE country_rank <= 10;
""", conn)
display(seg2_b)

Unnamed: 0,country,country_count
0,United States,3240
1,India,1057
2,United Kingdom,638
3,Pakistan,421
4,Not Given,287
5,Canada,271
6,Japan,259
7,South Korea,214
8,France,213
9,Spain,182


**Investigating the trend of content additions over the years allowing us to observe the trend over the years.** ==> Therefore in year 2018 highest number of netflix content was released that is 1146

In [33]:
seg2_c = pd.read_sql("""select release_year,count(*) as count
                               from netflix_data3
                               group by release_year
                               order by count desc ;""",conn)
display(seg2_c)

Unnamed: 0,release_year,count
0,2018,1146
1,2019,1030
2,2017,1030
3,2020,953
4,2016,901
...,...,...
69,1966,1
70,1961,1
71,1959,1
72,1947,1


**Analyse the relationship between content duration and release year**

In this query Movie is in min and TV Shows are in seasons
there for we need to calculate avg duration for both simultaneously
I used nested case statement in which for "Movie used  SUBSTR(duration, 1, INSTR(duration, 'min') - 1) to extracts the substring from the duration column starting from the first character and ending before 'min' and in next line for "TV Show used 
SUBSTR(duration, 1, INSTR(duration, ' Season') - 1) to extracts the substring from the duration column starting from the first character and ending before ' Season'. This gives us the number of seasons.
Therefore CAST(... AS INT)*5 * 45 converts the extracted substring to an integer and multiplies it by the assumed value of 5 episodes in each season which are of 45 min duration

In [35]:
seg2_d = pd.read_sql("""
SELECT release_year,
       CASE
          WHEN type = 'Movie' THEN AVG(CAST(SUBSTR(duration, 1, INSTR(duration, ' min') - 1) AS INT))
          WHEN type = 'TV Show' THEN AVG(CAST(SUBSTR(duration, 1, INSTR(duration, ' Season') - 1) AS INT)*5 * 45)
          ELSE 0
       END AS average_duration
FROM netflix_data3
WHERE type = 'Movie' OR (type = 'TV Show' AND INSTR(duration, ' Season') > 0)
GROUP BY release_year
ORDER BY release_year;""",conn)

display(seg2_d)

Unnamed: 0,release_year,average_duration
0,1925,225.000000
1,1942,35.000000
2,1943,62.666667
3,1944,52.000000
4,1945,38.500000
...,...,...
69,2017,103.543689
70,2018,124.672775
71,2019,57.440777
72,2020,49.986359


**Identifying the directors with the most content on Netflix** ==> In the result of this query we can see that Rajiv Chilaka has directed the highest number of Movies or TV Shows

In [36]:
seg2_e = pd.read_sql("""select director,count(*) as count
                                from netflix_data3
                                group by director
                                order by count desc;""",conn)
display(seg2_e)

Unnamed: 0,director,count
0,Not Given,2588
1,Rajiv Chilaka,20
2,"Raúl Campos, Jan Suter",18
3,Alastair Fothergill,18
4,Suhas Kadav,16
...,...,...
4523,Aamir Khan,1
4524,Aamir Bashir,1
4525,Aadish Keluskar,1
4526,A. Salaam,1


**Segment 3: Genre and Category Analysis**

**Determining the unique genres and categories present in the dataset** ==> Therefore there are total 513 unique genre/ category

In [35]:
seg3_a = pd.read_sql("""select distinct(listed_in) 
                        from netflix_data3;""",conn)
display(seg3_a)

Unnamed: 0,listed_in
0,Documentaries
1,"Crime TV Shows, International TV Shows, TV Act..."
2,"TV Dramas, TV Horror, TV Mysteries"
3,"Children & Family Movies, Comedies"
4,"Dramas, Independent Movies, International Movies"
...,...
508,"Classic & Cult TV, TV Horror, TV Mysteries"
509,"Crime TV Shows, TV Comedies"
510,"Classic & Cult TV, Kids' TV, TV Comedies"
511,"Classic & Cult TV, TV Sci-Fi & Fantasy"


**Calculate the percentage of movies and TV shows in each genre** ==> Here "Dramas,International Movies" has highest number of count in listed_in(genre) with percentage of 4.12%

In [37]:
seg3_b = pd.read_sql("""select listed_in AS genre,count(*) AS total_count,
         Round(count(*) * 100.0 / (select count(*) from netflix_data3), 2) AS percentage
         from netflix_data3
         group by genre
         order by total_count desc;""",conn)
display(seg3_b)

Unnamed: 0,genre,total_count,percentage
0,"Dramas, International Movies",362,4.12
1,Documentaries,359,4.08
2,Stand-Up Comedy,334,3.80
3,"Comedies, Dramas, International Movies",274,3.12
4,"Dramas, Independent Movies, International Movies",252,2.87
...,...,...,...
508,"Action & Adventure, Classic Movies, Internatio...",1,0.01
509,"Action & Adventure, Children & Family Movies, ...",1,0.01
510,"Action & Adventure, Children & Family Movies, ...",1,0.01
511,"Action & Adventure, Anime Features, Horror Movies",1,0.01


**Identifyng the most popular genres/categories based on the number of productions.** ==> Here as we can see that "Dramas ,International Movies" is most popular with highest number of count of 362 

In [37]:
seg3_c = pd.read_sql("""select genre_category, count
From (
    Select listed_in AS genre_category,
           Count(*) AS count,
           Rank() over (Order by count(*) DESC) AS genre_rank
    From netflix_data3
    Group by genre_category
) AS subquery
Where genre_rank <= 10;""",conn)
display(seg3_c)

Unnamed: 0,genre_category,count
0,"Dramas, International Movies",362
1,Documentaries,359
2,Stand-Up Comedy,334
3,"Comedies, Dramas, International Movies",274
4,"Dramas, Independent Movies, International Movies",252
5,Kids' TV,219
6,Children & Family Movies,215
7,"Children & Family Movies, Comedies",201
8,"Documentaries, International Movies",186
9,"Dramas, International Movies, Romantic Movies",180


**Calculating the cumulative sum of content duration within each genre** ==>Here as mentioned before we have used substring to calculate sum of content duration and as a result we can see that "Action & Adventure" has highest cumulative sum of content duration of 13426

In [34]:
seg3_d = pd.read_sql("""
    Select listed_in AS genre,
           Sum(case
                   when type = 'Movie' then CAST(SUBSTR(duration, 1, INSTR(duration, ' min') - 1) AS INT)
                   when type = 'TV Show' then CAST(SUBSTR(duration, 1, INSTR(duration, ' Season') - 1) AS INT) * 5 * 45
                   ELSE 0
               END) AS cumulative_duration
    From netflix_data3
    Group by genre;
""", conn)

display(seg3_d)

Unnamed: 0,genre,cumulative_duration
0,Action & Adventure,13426
1,"Action & Adventure, Anime Features",84
2,"Action & Adventure, Anime Features, Children &...",367
3,"Action & Adventure, Anime Features, Classic Mo...",239
4,"Action & Adventure, Anime Features, Horror Movies",96
...,...,...
508,"TV Horror, TV Mysteries, Teen TV Shows",225
509,"TV Horror, Teen TV Shows",675
510,"TV Sci-Fi & Fantasy, TV Thrillers",675
511,TV Shows,3600


**Segment 4: Release Date Analysis**

**Determining the distribution of content releases by month and year** ==>Here we used strftime to extract month and year from the date_added column , therefore here we can see that in descember 2019 the release count was highest.

In [38]:
seg4_a= pd.read_sql("""select strftime('%Y', date_added) AS release_year,
                             strftime('%m', date_added) AS release_month,
                             Count(*) AS release_count
                      From netflix_data3
                      Group by release_year, release_month
                      Order by release_count desc 
                      ;""", conn)

display(seg4_a)

Unnamed: 0,release_year,release_month,release_count
0,2019,12,122
1,2019,11,109
2,2018,12,106
3,2020,10,106
4,2021,06,105
...,...,...,...
518,2021,07,1
519,2021,05,1
520,2019,02,1
521,2020,11,1


**Analyzing the seasonal patterns in content releases** ==> Here i took 4 different seasons i.e winter,summer,rainy and autumn and based on that we can see that the highest number of release count in "rainy season" i.e 2309

In [40]:
seg4_b = pd.read_sql("""Select case
                              WHEN strftime('%m', date_added) IN ('12', '01', '02') THEN 'Winter'
                              WHEN strftime('%m', date_added) IN ('03', '04', '05') THEN 'Summer'
                              WHEN strftime('%m', date_added) IN ('06', '07', '08') THEN 'Rainy'
                              WHEN strftime('%m', date_added) IN ('09', '10', '11') THEN 'Autumn'
                              ELSE 'Unknown'
                           END AS season,
                           COUNT(*) AS release_count
                        From netflix_data3
                        Group by season
                        Order by release_count desc;""",conn)
display(seg4_b)

Unnamed: 0,season,release_count
0,Rainy,2309
1,Autumn,2234
2,Summer,2136
3,Winter,2111


**Identify the months and years with the highest number of releases**==> As we can see that in july 2021 there is release count of 257 which is greater than other month and year

In [127]:
seg4_c = pd.read_sql("""Select release_month_year, release_count
                        From (
                               Select strftime('%Y-%m', date_added) AS release_month_year,
                               Count(*) AS release_count,
                               Rank() Over (Order by count(*) desc) AS release_rank
                               From netflix_data3
                               Group by release_month_year
                         ) AS subquery
                         Where release_rank <= 5;""",conn)
display(seg4_c)

Unnamed: 0,release_month_year,release_count
0,2021-07,257
1,2019-11,255
2,2019-12,215
3,2021-06,207
4,2020-01,205


**Segment 5: Rating Analysis**

**Investigate the distribution of ratings across different genres.**==> Here the query represents the highest number of rating ount and genre 

In [43]:
seg5_a = pd.read_sql("""Select listed_in, rating, COUNT(*) AS count
                        From netflix_data3
                        Group by listed_in, rating
                        Order by count desc;""",conn)
display(seg5_a)

Unnamed: 0,listed_in,rating,count
0,Stand-Up Comedy,TV-MA,284
1,"Dramas, International Movies",TV-MA,154
2,"Dramas, Independent Movies, International Movies",TV-MA,142
3,"Comedies, Dramas, International Movies",TV-14,139
4,"Dramas, International Movies",TV-14,139
...,...,...,...
1227,"TV Dramas, TV Sci-Fi & Fantasy, Teen TV Shows",TV-14,1
1228,"TV Dramas, TV Thrillers",TV-14,1
1229,"TV Horror, TV Mysteries, Teen TV Shows",TV-MA,1
1230,"TV Sci-Fi & Fantasy, TV Thrillers",TV-14,1


**Relationship between Ratings and Content Duration**


Here we calculate the average content duration for each rating category considering both movies and TV shows. For movies, we extract the duration in minutes as before. For TV shows, we assume each season having 5 episodes with 45-minute and multiply it by the number of seasons. The results are grouped by rating and sorted in ascending order of rating.

In [132]:
seg5_b = pd.read_sql("""Select rating, AVG(CASE
                                         WHEN type = 'Movie' THEN CAST(SUBSTR(duration, 1, INSTR(duration, ' min') - 1) AS INT)
                                         WHEN type = 'TV Show' THEN CAST(SUBSTR(duration, 1, INSTR(duration, ' Season') - 1) AS INT)*5 * 45
                                         ELSE 0
                                       END) AS average_duration
                        From netflix_data3
                        Group by rating
                        Order by rating;""",conn)
display(seg5_b)

Unnamed: 0,rating,average_duration
0,G,90.268293
1,NC-17,125.0
2,NR,106.835443
3,PG,98.28223
4,PG-13,108.330612
5,R,107.01627
6,TV-14,210.656004
7,TV-G,223.581818
8,TV-MA,196.762871
9,TV-PG,196.16144


**Segment 6: Co-occurrence Analysis**

**Relationship between Genres/Categories and Content Duration** ==> Here the average duration is calculated for each genre where "Classic & Cult TV Action & Adventure,TV Horror has highest avg duration


In [44]:
seg6_b = pd.read_sql("""Select listed_in AS genre, AVG(CASE
                                WHEN type = 'Movie' THEN CAST(SUBSTR(duration, 1, INSTR(duration, ' min') - 1) AS INT)
                                WHEN type = 'TV Show' THEN CAST(SUBSTR(duration, 1, INSTR(duration, ' Season') - 1) AS INT)*5 * 45
                                ELSE 0
                             END) AS average_duration
                        From netflix_data3
                        Group by genre
                        Order by average_duration DESC;""",conn)
display(seg6_b)

Unnamed: 0,genre,average_duration
0,"Classic & Cult TV, TV Action & Adventure, TV H...",2025.000000
1,"Classic & Cult TV, TV Comedies",1800.000000
2,"Crime TV Shows, TV Action & Adventure, TV Sci-...",1575.000000
3,"Classic & Cult TV, TV Action & Adventure, TV S...",1575.000000
4,"British TV Shows, Classic & Cult TV, TV Comedies",1462.500000
...,...,...
508,"Children & Family Movies, Comedies, LGBTQ Movies",46.000000
509,Movies,45.641509
510,"Action & Adventure, Documentaries, Sports Movies",40.000000
511,"Anime Features, Documentaries",36.000000


**Segment 7: International Expansion Analysis**

**Countries with Netflix Content Offerings** ==> Here there are total 86 unique countries

In [141]:
seg7_a = pd.read_sql("""Select DISTINCT country
                        From netflix_data3;""",conn)
display(seg7_a)

Unnamed: 0,country
0,United States
1,France
2,Brazil
3,United Kingdom
4,India
...,...
81,Senegal
82,Belarus
83,Puerto Rico
84,Cyprus


**Distribution of Content Types in Different Countries** ==> Here we counted highest number of count for content types(Movie and TV Shows) for different countries and got results of United states having highest count of Movie and India having highest count of TV Shows

In [39]:
seg7_b = pd.read_sql("""Select country, type, COUNT(*) AS count
                        From netflix_data3
                        Group by country, type
                        Order by count desc;""",conn)
display(seg7_b)

Unnamed: 0,country,type,count
0,United States,Movie,2395
1,India,Movie,976
2,United States,TV Show,845
3,United Kingdom,Movie,387
4,Pakistan,TV Show,350
...,...,...,...
133,Switzerland,TV Show,1
134,United Arab Emirates,TV Show,1
135,Uruguay,TV Show,1
136,West Germany,Movie,1


**Relationship between Content Duration and Country of Production** ==> Here i calculated average duration count for types of content and country where we got country United Arab Emirates having highest average duration count for TV Show 

In [45]:
seg7_c = pd.read_sql("""Select country, type, AVG(CASE
                                   WHEN type = 'Movie' THEN CAST(SUBSTR(duration, 1, INSTR(duration, ' min') - 1) AS INT)
                                   WHEN type = 'TV Show' THEN CAST(SUBSTR(duration, 1, INSTR(duration, ' Season') - 1) AS INT)*5*45
                                   ELSE 0
                               END) AS average_duration
                        From netflix_data3
                        Group by country, type
                        Order by average_duration desc;""",conn)
display(seg7_c)

Unnamed: 0,country,type,average_duration
0,United Arab Emirates,TV Show,675.000000
1,Ireland,TV Show,637.500000
2,Denmark,TV Show,613.636364
3,Canada,TV Show,586.607143
4,United States,TV Show,516.568047
...,...,...,...
133,Pakistan,Movie,76.633803
134,Georgia,Movie,71.500000
135,Guatemala,Movie,69.000000
136,Syria,Movie,51.500000


**Segment 8: Recommendations for Content Strategy**

Based on the analysis, provide recommendations for the types of content Netflix 
should focus on producing.

Identify potential areas for expansion and growth based on the analysis of the 
dataset

Based on the analysis conducted in the previous segments, we can provide recommendations for the types of content Netflix should focus on producing and identify potential areas for expansion and growth. Here are some suggestions:

1)Content Types: Analyzing the distribution of content types (movies and TV shows) in the dataset. Identify which type is more popular and has higher demand among viewers. Based on this analysis, we can prioritize the production of the more popular content type.

2)Genres and Categories: Analyzing the distribution of genres and categories in the dataset. Identify the most popular genres and categories based on the number of productions. Focus on producing content in these popular genres and categories to cater to viewer preferences.

3)Content Duration: Analyzing the relationship between content duration and viewer engagement. Identify the ideal content duration that keeps viewers engaged and interested. Based on this analysis, produce content with durations that align with viewer preferences and attention spans.

4)Ratings and Viewer Feedback: Consider the distribution of ratings across different genres and analyze the relationship between ratings and content duration. Pay attention to genres with higher ratings and positive viewer feedback. Use this information to guide the production of high-quality content that resonates with viewers.

5)Co-occurrence Analysis:This analysis can provide insights into potential cross-genre content combinations that have a higher chance of success.

6)International Expansion: Analyze the countries with the highest number of productions on Netflix. Consider expanding content offerings in these countries or explore opportunities to collaborate with local production industries to tap into new markets.

7)Seasonal Patterns: Investigate the seasonal patterns in content releases. Identify periods with higher content additions and plan content strategies accordingly to align with viewers' preferences during specific seasons.

8)Emerging Trends: Stay updated with emerging trends in the entertainment industry. Monitor viewership patterns, popular genres, and content formats to adapt and produce content that caters to evolving viewer preferences.

**These recommendations and insights can help shape Netflix's content strategy, prioritize production efforts, and identify potential areas for expansion and growth**