
<h1 align="center">Online Courses Analysis</h1>

### **Description**

*This dataset captures user engagement metrics from an online course platform, facilitating analyses on factors influencing course completion. It includes user demographics, course-specific data, and engagement metrics.*

| **Feature**               | **Description**                                                                                     |
|---------------------------|-----------------------------------------------------------------------------------------------------|
| **UserID**                | Unique identifier for each user                                                                      |
| **CourseCategory**        | Category of the course taken by the user (e.g., Programming, Business, Arts)                         |
| **TimeSpentOnCourse**     | Total time spent by the user on the course in hours                                                  |
| **NumberOfVideosWatched** | Total number of videos watched by the user                                                           |
| **NumberOfQuizzesTaken**  | Total number of quizzes taken by the user                                                            |
| **QuizScores**            | Average scores achieved by the user in quizzes (percentage)                                         |
| **CompletionRate**        | Percentage of course content completed by the user                                                   |
| **DeviceType**            | Type of device used by the user (Device Type: Desktop (0) or Mobile (1))                           |
| **CourseCompletion**      | Course completion status (0: Not Completed, 1: Completed)                                            |

**Completion Rates**:
- **4,641** courses have not been completed.
- **3,568** courses have been completed.

The data shows a relatively balanced distribution between completed and non-completed courses. Approximately **56%** of courses are not completed, while **44%** reach completion. This balanced distribution suggests a need for further analysis to understand factors influencing course completion and to enhance overall course engagement strategies.

---

## **1. Importing Necessary Libraries**

*We begin by importing all the required libraries for data manipulation, visualization, model training, evaluation, and saving.*

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

## **2. Data Reading & Understanding**

*We load the dataset, inspect its structure, and perform initial exploratory steps to understand the data.*

---

In [2]:
# Load the dataset
df = pd.read_csv(r"C:\Users\khes7001\OneDrive - NIQ\Desktop\Test Sample Dashboard\Git\EpsilionAI_Eslam_Khaled\Sourse\online_course_engagement_data.csv")

# Display the shape of the dataset
print(f"Dataset Shape: {df.shape}")

# Display the first 10 rows of the dataset
df.head(10)


Dataset Shape: (9180, 9)


Unnamed: 0,UserID,CourseCategory,TimeSpentOnCourse,NumberOfVideosWatched,NumberOfQuizzesTaken,QuizScores,CompletionRate,DeviceType,CourseCompletion
0,5618,Health,29.97971935,17,3,50.36565595,20.860773,1,0
1,4326,Arts,27.80263951,1,5,62.61596979,65.632415,1,0
2,5849,Arts,86.8204847,14,2,78.4589624,63.812007,1,1
3,4992,Science,35.03842663,17,10,59.19885273,95.433162,0,1
4,3866,Programming,92.49064696,16,0,98.428285,18.102478,0,0
5,8650,Health,79.46612884,12,7,70.23332895,76.484023,0,1
6,4321,Health,78.90872424,10,2,86.83653261,22.588896,1,0
7,4589,Business,12.06823675,16,3,61.55364646,27.410991,1,0
8,4215,Business,81.93570918,8,4,90.26456415,33.308437,0,1
9,8089,Programming,83.39402571,15,10,63.95635296,33.2613,1,0


In [3]:
# Get information about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9180 entries, 0 to 9179
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   UserID                 9180 non-null   int64  
 1   CourseCategory         9180 non-null   object 
 2   TimeSpentOnCourse      9083 non-null   object 
 3   NumberOfVideosWatched  9092 non-null   object 
 4   NumberOfQuizzesTaken   9180 non-null   int64  
 5   QuizScores             9084 non-null   object 
 6   CompletionRate         9180 non-null   float64
 7   DeviceType             9180 non-null   int64  
 8   CourseCompletion       9180 non-null   int64  
dtypes: float64(1), int64(4), object(4)
memory usage: 645.6+ KB


In [4]:
# Describe the dataset
df.describe()


Unnamed: 0,UserID,NumberOfQuizzesTaken,CompletionRate,DeviceType,CourseCompletion
count,9180.0,9180.0,9180.0,9180.0,9180.0
mean,4502.338671,5.09488,50.305617,0.501416,0.396078
std,2596.313009,3.155766,28.940782,0.500025,0.489108
min,1.0,0.0,0.009327,0.0,0.0
25%,2255.75,2.0,25.609713,0.0,0.0
50%,4493.5,5.0,50.151207,1.0,0.0
75%,6754.25,8.0,75.514245,1.0,1.0
max,9000.0,10.0,99.979711,1.0,1.0


In [5]:
# Get number of unique values in each column and their unique values
for col in df.columns:
    print(f"Column: {col}")
    print(f"Number of Unique Values: {df[col].nunique()}")
    print("Unique Values:")
    print(df[col].unique())
    print('-' * 40)


Column: UserID
Number of Unique Values: 8123
Unique Values:
[5618 4326 5849 ... 6323 3652 5595]
----------------------------------------
Column: CourseCategory
Number of Unique Values: 5
Unique Values:
['Health' 'Arts' 'Science' 'Programming' 'Business']
----------------------------------------
Column: TimeSpentOnCourse
Number of Unique Values: 7969
Unique Values:
['29.97971935' '27.80263951' '86.8204847' ... '38.21251152' '70.04866546'
 '93.58978113']
----------------------------------------
Column: NumberOfVideosWatched
Number of Unique Values: 22
Unique Values:
['17' '1' '14' '16' '12' '10' '8' '15' '3' '13' nan '7' '20' '6' '11' '0'
 '5' '18' '19' '9' '2' '4' '?']
----------------------------------------
Column: NumberOfQuizzesTaken
Number of Unique Values: 11
Unique Values:
[ 3  5  2 10  0  7  4  9  1  8  6]
----------------------------------------
Column: QuizScores
Number of Unique Values: 7981
Unique Values:
['50.36565595' '62.61596979' '78.4589624' ... '69.50829722' '79.655182

## **3. Data Cleaning & Preprocessing**

*We clean the dataset by handling missing values, removing duplicates, and correcting data types to ensure data quality.*

In [6]:
# Replace '?' with NaN in the entire DataFrame
df.replace('?', np.nan, inplace=True)

# Check for missing values in each column
missing_values = df.isna().sum()
print("Missing Values in Each Column:")
print(missing_values)


Missing Values in Each Column:
UserID                     0
CourseCategory             0
TimeSpentOnCourse        181
NumberOfVideosWatched    183
NumberOfQuizzesTaken       0
QuizScores               183
CompletionRate             0
DeviceType                 0
CourseCompletion           0
dtype: int64


In [4]:
# Check for duplicates
duplicate_count = df.duplicated().sum()
print(f"Number of Duplicate Rows: {duplicate_count}")

# Drop duplicates and reset the index
df.drop_duplicates(inplace=True)
df.reset_index(drop=True, inplace=True)

# Verify that duplicates are removed
duplicate_count_after = df.duplicated().sum()
print(f"Number of Duplicate Rows after Dropping: {duplicate_count_after}")


Number of Duplicate Rows: 0
Number of Duplicate Rows after Dropping: 0


In [5]:
# Edit columns data type
df['TimeSpentOnCourse'] = pd.to_numeric(df['TimeSpentOnCourse'], errors='coerce').astype('float')
df['NumberOfVideosWatched'] = pd.to_numeric(df['NumberOfVideosWatched'], errors='coerce').round().astype('Int64')
df['QuizScores'] = pd.to_numeric(df['QuizScores'], errors='coerce').astype('float')


In [6]:
# Identify columns with missing values
missing_cols = df.columns[df.isna().any()].tolist()
print(f"Columns with Missing Values: {missing_cols}")


Columns with Missing Values: ['TimeSpentOnCourse', 'NumberOfVideosWatched', 'QuizScores']


In [7]:
# Deal with missing values in numerical columns
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns           
df[numerical_cols] = df[numerical_cols].fillna(df[numerical_cols].median())

# Verify that there are no more missing values
print("Missing Values after Imputation:")
print(df.isna().sum())


Missing Values after Imputation:
UserID                   0
CourseCategory           0
TimeSpentOnCourse        0
NumberOfVideosWatched    0
NumberOfQuizzesTaken     0
QuizScores               0
CompletionRate           0
DeviceType               0
CourseCompletion         0
dtype: int64


In [8]:
# Check if there are outliers in the numerical columns
cols = ['TimeSpentOnCourse', 'NumberOfVideosWatched', 'CompletionRate', 'NumberOfQuizzesTaken', 'QuizScores']
for col in cols:
    fig = px.box(df, x=col, title=f'Box Plot of {col}')
    fig.show()

*We did not remove outliers as the dataset does not provide sufficient context to determine whether these extreme values are errors or represent genuine variations in user behavior.*

---

In [9]:
# Save cleaned data to CSV file
cleaned_data = df.to_csv('cleaned_data.csv', index=False)
print("Cleaned data has been saved to 'cleaned_data.csv'.")


Cleaned data has been saved to 'cleaned_data.csv'.


## **4. Exploratory Data Analysis (EDA)**

*We perform exploratory data analysis to uncover patterns, correlations, and insights within the data.*

In [10]:
# Count of Course Completion Status
count_course_completion = df['CourseCompletion'].value_counts()
print("Course Completion Counts:")
print(count_course_completion)


Course Completion Counts:
CourseCompletion
0    4642
1    3568
Name: count, dtype: int64


In [11]:
# Pie chart of Course Completion Status
fig = px.pie(df, names='CourseCompletion', title='Course Completion Status Distribution', template='plotly_dark')
fig.show()


**Completion Rates**:
- **4,641** courses have not been completed.
- **3,568** courses have been completed.

The data shows a relatively balanced distribution between completed and non-completed courses. Approximately **56%** of courses are not completed, while **44%** reach completion. This balanced distribution suggests a need for further analysis to understand factors influencing course completion and to enhance overall course engagement strategies.


In [12]:
# Count of Course Categories
course_category_counts = df['CourseCategory'].value_counts()
print("Course Category Counts:")
print(course_category_counts)


Course Category Counts:
CourseCategory
Business       1672
Programming    1655
Science        1654
Health         1648
Arts           1581
Name: count, dtype: int64


In [13]:
# Pie chart of Course Categories
fig = px.pie(df, names='CourseCategory', title='Course Category Distribution', hole=0.5, template='plotly_dark')
fig.show()


**Business Courses**:

There are **1,671** courses in the Business category, making it the most abundant category in the dataset.

**Programming Courses**:

With **1,655** courses, the Programming category is slightly behind Business but still holds a significant portion of the dataset.

**Science Courses**:

The Science category contains **1,654** courses, closely following Programming and Business in terms of quantity.

**Health Courses**:

There are **1,648** courses in the Health category, slightly fewer than those in Science.

**Arts Courses**:

The Arts category has the fewest courses with **1,581**, though it remains a notable category within the dataset.

Overall, the dataset shows a high number of courses across all categories, with Business, Programming, and Science having the highest counts.

---

## **5. Visualization and Insights**

*We create various visualizations to explore the distribution and relationships of key features.*

In [14]:
# Define the columns for histograms
cols = ['TimeSpentOnCourse', 'NumberOfVideosWatched', 'CompletionRate', 'NumberOfQuizzesTaken', 'QuizScores']

# Number of columns to plot
num_columns = len(cols)

# Determine the number of rows and columns needed for the subplots grid
num_cols = 3  # Number of columns in the grid
num_rows = (num_columns + num_cols - 1) // num_cols  # Calculate number of rows required

# Create a subplot figure
fig = make_subplots(rows=num_rows, cols=num_cols, subplot_titles=cols)

# Loop through each column and add a histogram to the subplot grid
for i, col in enumerate(cols):
    row = i // num_cols + 1  # Calculate the row index
    col_pos = i % num_cols + 1  # Calculate the column index
    
    # Add histogram for each column
    fig.add_trace(go.Histogram(x=df[col], name=col), row=row, col=col_pos)

# Update layout for the entire figure
fig.update_layout(height=800, width=1000, showlegend=True, title_text='Histograms of DataFrame Columns')

# Adjust spacing between subplots
fig.update_xaxes(showticklabels=True)
fig.update_yaxes(showticklabels=True)

# Display the plot
fig.show()


1. **Course Category Distribution**:
   - The most frequent course categories, ranked from highest to lowest, are:
     1. **Business**
     2. **Programming**
     3. **Science**
     4. **Health**
     5. **Arts**

2. **Time Spent on Courses**:
   - The majority of users spent between **49 to 51 hours** on courses.
   - In contrast, the least amount of time was spent by users in the range of **99 to 101 hours**.

3. **Number of Videos Watched**:
   - Courses with an average of **10 videos watched** are the most common.

4. **Number of Quizzes Taken**:
   - Most courses have a consistent average number of quizzes taken.

5. **Quiz Scores**:
   - The highest recorded quiz score is **75**.


In [15]:
import plotly.express as px
from IPython.display import display, Markdown

# Define the columns for grouped histograms
cols = [
    'TimeSpentOnCourse',
    'NumberOfVideosWatched',
    'NumberOfQuizzesTaken',
    'QuizScores',
    'CompletionRate',
    'CourseCategory'
]

# Define the insights for each column
insights = {
    'TimeSpentOnCourse': """
**Time Spent on Courses**:

The majority of users spent between **49 to 51 hours** on courses. In contrast, the least amount of time was spent by users in the range of **99 to 101 hours**.
""",
    'NumberOfVideosWatched': """
**Number of Videos Watched**:

Courses with an average of **10 videos watched** are the most common.
""",
    'NumberOfQuizzesTaken': """
**Number of Quizzes Taken**:

Most courses have a consistent average number of quizzes taken.
""",
    'QuizScores': """
**Quiz Scores**:

The highest recorded quiz score is **75**.
""",
    'CompletionRate': """
**Course Completion by Time Spent**:

Courses with less than **20 hours** of time spent have a completion rate of less than **25%**. Courses with more than **70 hours** of time spent have a balanced completion and non-completion rate, indicating that both completion and non-completion rates are similar.
""",
    'CourseCategory': """
**Course Completion by Number of Videos Watched**:

Courses where fewer than **5 videos** are watched have a higher percentage of non-completion. Conversely, courses with more than **6 videos** watched show a lower percentage of non-completion, suggesting that more videos correlate with higher completion rates.

**Impact of Number of Quizzes on Completion Rate**:

As the number of quizzes increases, the percentage of course completions rises, while the percentage of non-completions decreases. This indicates a positive correlation between the number of quizzes and course completion rates.
"""
}

# Plot histograms with grouping by CourseCompletion and display insights below
for col in cols:
    # Create the histogram
    fig = px.histogram(
        df,
        x=col,
        barmode='group',
        color='CourseCompletion',
        marginal='box',
        title=f'Histogram of {col} by Course Completion',
        template='plotly_dark'
    )
    
    # Display the figure
    fig.show()
    
    # Retrieve and display the corresponding insight
    insight = insights.get(col, "No insight available for this chart.")
    display(Markdown(insight))



**Time Spent on Courses**:

The majority of users spent between **49 to 51 hours** on courses. In contrast, the least amount of time was spent by users in the range of **99 to 101 hours**.



**Number of Videos Watched**:

Courses with an average of **10 videos watched** are the most common.



**Number of Quizzes Taken**:

Most courses have a consistent average number of quizzes taken.



**Quiz Scores**:

The highest recorded quiz score is **75**.



**Course Completion by Time Spent**:

Courses with less than **20 hours** of time spent have a completion rate of less than **25%**. Courses with more than **70 hours** of time spent have a balanced completion and non-completion rate, indicating that both completion and non-completion rates are similar.



**Course Completion by Number of Videos Watched**:

Courses where fewer than **5 videos** are watched have a higher percentage of non-completion. Conversely, courses with more than **6 videos** watched show a lower percentage of non-completion, suggesting that more videos correlate with higher completion rates.

**Impact of Number of Quizzes on Completion Rate**:

As the number of quizzes increases, the percentage of course completions rises, while the percentage of non-completions decreases. This indicates a positive correlation between the number of quizzes and course completion rates.


## **6. Correlation Analysis**

*We analyze the correlations between numerical features to identify any multicollinearity and understand feature relationships.*

---

In [21]:
# Drop unnecessary columns
df = df.drop(columns=['UserID', 'DeviceType'])

# Calculate the correlation matrix
correlation_matrix = df.corr(numeric_only=True)

# Create the heatmap using Plotly Express
fig = px.imshow(correlation_matrix,
                text_auto=True,  # Automatically show values in the heatmap cells
                color_continuous_scale='Viridis',  # Use 'Viridis' for a perceptually uniform color scale
                labels={'color': 'Correlation'},  # Label for the color bar
                title='Correlation Heatmap')  # Title for the heatmap

# Update layout to improve appearance
fig.update_layout(
    height=600,  # Set the height of the figure
    width=800,  # Set the width of the figure
    xaxis_title='Variables',  # Label for the x-axis
    yaxis_title='Variables',  # Label for the y-axis
    title_x=0.5  # Center the title
)

# Display the heatmap
fig.show()


# Conclusion

*In this analysis, we successfully explored and cleaned the dataset, performed exploratory data analysis, and visualized key metrics related to online course engagement and completion. Key findings include:*

- *A relatively balanced distribution between completed and non-completed courses, indicating the importance of understanding factors that influence course completion.*
- *Business, Programming, and Science are the most prevalent course categories, suggesting areas of high user interest.*
- *Time spent on courses and the number of videos watched positively correlate with course completion rates.*
- *An increased number of quizzes is associated with higher completion rates, highlighting the role of assessments in maintaining user engagement.*

*By leveraging these insights, the platform can develop targeted strategies to enhance course completion rates, such as optimizing course length, increasing interactive content like quizzes, and focusing on the most popular course categories.*

*Lets go to the Machine leaning Phase!*

---