#Exploratory Data Analysis 

---
## Metadata Description

**Project Title:** BrightLearn Viewership Analysis  
**Purpose:** To explore and understand viewership data from the BrightLearn platform.  
**Goal:** Identify data quality issues, understand patterns, and prepare data for deeper analysis or modeling.  

**Dataset Source:**  
Loaded from a Spark table called `workspace.brightlearn.viewership_analysis`.  

**Libraries Used:**  
- **pandas:** For working with data in a table format, cleaning, and analysis.  
- **numpy:** For handling numbers, calculations, and missing values.   

**Main Tasks:**  
1. Load the data from Spark.  
2. Convert it into a pandas DataFrame.  
3. Perform Exploratory Data Analysis (EDA):  
   - Check data types and structure.  
   - Investigate missing values and duplicates.  
   - Explore distributions and summary statistics.  
   - Identify potential data quality issues.  


---


Load Libraries and Data

In [0]:
# I will use pandas because it helps me work with data in a table-like format, similar to Excel.
# It allows me to clean, filter, and explore my dataset easily using simple functions.

import pandas as pd

# I will use numpy because it supports numerical operations and calculations.
# It helps with things like averages, medians, standard deviations, and handling missing values (np.nan).

import numpy as np

df_spark = spark.table("workspace.brightlearn.viewership_analysis")

# Convert the Spark DataFrame into a pandas DataFrame for easier analysis
df = df_spark.toPandas()

# Preview the first few rows
df.head(6)


Unnamed: 0,DateID,CustomerID,TotalTimeWatched,Platform,PlayEventType,VideoTitle
0,20201101,EW1DENH0EC1J3M9WAOZF9LSV004O,300,Leanback,LiveTV,F1 '20: Emilia Romagna GP
1,20201101,EW1DENH0EC1J3M9WAOZF9LSV004O,300,Leanback,LiveTV,F1 '20: Emilia Romagna GP
2,20201101,6TS2LLY0L3G66FVY86Q0JEZE000K,360,Leanback,Other,Chasing The Sun
3,20201101,6TS2LLY0L3G66FVY86Q0JEZE000K,360,Leanback,Other,Chasing The Sun
4,20201101,6PMV67PLJ2S47S68J0Y30XFK003C,120,Leanback,LiveTV,Sonic The Hedgehog
5,20201101,6PMV67PLJ2S47S68J0Y30XFK003C,120,Leanback,LiveTV,Frozen II


In [0]:
# I want to check the dimensions of my data
df.info()

# Preview top records
df.head(5)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 118534 entries, 0 to 118533
Data columns (total 6 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   DateID            118534 non-null  object
 1   CustomerID        118534 non-null  object
 2   TotalTimeWatched  118534 non-null  int64 
 3   Platform          118534 non-null  object
 4   PlayEventType     118534 non-null  object
 5   VideoTitle        118534 non-null  object
dtypes: int64(1), object(5)
memory usage: 5.4+ MB


Unnamed: 0,DateID,CustomerID,TotalTimeWatched,Platform,PlayEventType,VideoTitle
0,20201101,EW1DENH0EC1J3M9WAOZF9LSV004O,300,Leanback,LiveTV,F1 '20: Emilia Romagna GP
1,20201101,EW1DENH0EC1J3M9WAOZF9LSV004O,300,Leanback,LiveTV,F1 '20: Emilia Romagna GP
2,20201101,6TS2LLY0L3G66FVY86Q0JEZE000K,360,Leanback,Other,Chasing The Sun
3,20201101,6TS2LLY0L3G66FVY86Q0JEZE000K,360,Leanback,Other,Chasing The Sun
4,20201101,6PMV67PLJ2S47S68J0Y30XFK003C,120,Leanback,LiveTV,Sonic The Hedgehog


In [0]:
#I am also in generating summary statistics to get a sense of the statistical distribution of that data.
df.describe()


Unnamed: 0,TotalTimeWatched
count,118534.0
mean,2046.86986
std,3735.512071
min,1.0
25%,240.0
50%,1020.0
75%,2400.0
max,88500.0


In [0]:
#I am also interested in looking at the missing values per column, so that we can handle them if and where necessary.

df.isnull().sum()

DateID              0
CustomerID          0
TotalTimeWatched    0
Platform            0
PlayEventType       0
VideoTitle          0
dtype: int64

In [0]:
#Next, I want to see if we have any duplicates
df.duplicated().sum()

# Optionally drop them
df = df.drop_duplicates()