# Introduction

DataCamp's [Data Model](https://enterprise-docs.datacamp.com/data-connector/explore-data-model/data-model) offers a vast amount of enterprise user data that can be further utilized for reliable and accurate data analytics. As it stands, several metrics like the user XP fail to provide an accurate measurement of user performance due to external factors like exploits. Apart from that, a few metrics such as the course difficulty are missing from the data model, making it difficult to keep track of the user's skill level. Despite its shortcomings, a lot of useful data still remains to be analyzed for insights on user engagement. Through feature engineering with proper mathematical references, data analytics, and machine learning models, a plethora of meaningful insights can be derived from the existing dataset. By delving into several aspects that would prove to be useful in achieving the goals of **Project DREAMS**, this document addresses potential improvements in scholar management such as planning, monitoring, and resource allocation.

---

# Research Questions

<a  name="learning-analytics"></a>
## 📊 Learning Analytics

### **I. How can the median time spent per course and its chapters be used to determine course difficulty?**

  

**Relevance:** Currently, a `difficulty` column does not exist in the `Data Model`, and the `difficulty` shown on the interface for each course is only limited to three categories: `Beginner`, `Intermediate`, `Advanced`. Additionally, these `difficulty` values are set by the instructor and it does not reflect user experience, making it arbitrary. Due to the lack of having a reliable difficulty metric, the group would not be able to properly measure a course's true difficulty, as well as make accurate course recommendations to cater to different skill levels. Having access to a reliable difficulty metric allows the group to make appropriate recommendations to beginners and experienced scholars alike. With data transformation, a proper difficulty metric can be made by utilizing the statistics of time spent on a particular course.

  

**Table and Columns Required:**

- `course_fact`: `user_id`, `course_id`, `completed_at`, `time_spent`

- `course_dim`: `course_id`, `title`, `technology`, `topic`

- `chapter_fact`: `user_id`, `chapter_id`, `time_spent`

- `chapter_dim`: `chapter_id`, `title`, `course_title`

- `user_dim`: `user_id`, `first_name`, `last_name`, `email`

  

**Methodology:**

1. Filter for completed courses using:

>  `course_fact.completed_at != NULL`

2. Perform a `LEFT JOIN` with `user_dim` and `course_fact` on `user_id`

3. Perform another `LEFT JOIN` with `course_dim` on `course_id`

4. Perform another `LEFT JOIN` with `chapter_fact` on `user_id`

5. Perform another `LEFT JOIN` with `chapter_dim` on `chapter_id`

6. Group by `course_title`

7. Remove `time_spent` values outside the **Interquartile Range (IQR)**

8. Create a `chapter_difficulty` column by performing a [Min-Max scaling](https://machinelearninggeek.com/feature-scaling-minmax-standard-and-robust-scaler/) across all chapters of all courses:

> $chapter\_difficulty=\frac{chapter\_time\_spent - min(time\_spent)}{max(time\_spent) - min(time\_spent)}$

9. Create `course_difficulty` column by getting the mean of related `chapter_difficulty` rows

10. Finalize by getting the median `chapter_difficulty` and `course_difficulty`

  

**Potential Insights and Actions:**

- Having access to a proper `course_difficulty` metric will allow the team to make appropriate course recommendations to beginners and advanced scholars alike.

- `chapter_difficulty` and `course_difficulty` may serve as a weight for further data transformation.

- Scholars may refer to `course_difficulty` as a reference to see which course caters to their skill level.

- The same process may also be applied to `Projects` allowing for better project recommendations.

- This process may be added to DataCamp's course recommendation algorithm, allowing for a smoother transition between skill levels.

- This metric allows for a dynamic difficulty setting. A course difficulty changes and gets closer to its true value as more scholars finish the course.

- This metric is objective and user-focused, it allows the team to find patterns on which courses the scholars find difficult.

### **II. How productive are scholars across different topics? (E.g. Data Visualization, Machine Learning, etc.)**

  

**Relevance:** Scholar performance may not be reliably measured through XP alone as this metric is susceptible to confounding variables. For example, XP can easily be exploited through the `Practice` system, meaning a scholar's XP might not reflect their performance. A more reliable metric is required to accurately measure scholar performance: `productivity`. Such metric will reveal which topics the scholars are performing well on, and which topics might require intervention.

  

**Tables and Columns Required:**

- `chapter_fact`: `user_id`, `chapter_id`, `completed_at`, `time_spent`

- `chapter_dim`: `chapter_id`, `title`, `technology`, `topic`, `xp`

- `user_dim`: `user_id`, `first_name`, `last_name`, `email`

  

**Methodology:**

1. Perform a `LEFT JOIN` on `chapter_fact` and `user_dim` on `user_id`

2. Perform another `LEFT JOIN` with `chapter_dim` on `chapter_id`

3. Filter for observations where the chapter is accomplished using:

>`chapter_fact.completed_at != NULL`

4. Remove `time_spent` values outside the **Interquartile Range (IQR)**

5. Following the [Productivity Ratio](https://www.indeed.com/career-advice/career-development/productivity-ratio), calculate the `productivity` metric for each scholar:

>$productivity=\frac{total\_xp}{time\_spent}$

  

- For a more reliable approach, utilize the median `chapter_difficulty` metric from the previous research question as weights:

> $productivity=\frac{(chapter\_difficulty)(chapter\_xp)}{time\_spent}$

6. Group by `chapter_dim.topic` and filter for topics with more than 30 observations <br  />

(A minimum size of 30 observations was chosen to ensure that the **Central Limit Theorem** takes effect)

7. Calculate the summary statistics for `productivity`

  

**Potential Insights and Actions**:

- If the `productivity` median for a certain topic is lower than most topics, the team could host a workshop for that topic or provide additional materials and support to boost understanding.

- If the `productivity` median for a certain topic is higher than most topics, this could be interpreted as a great learning material, allowing for better recommendations.

### **III. How can course statistics be used to segregate scholar skill-level?**

  

**Relevance:**

Skill-level segregation is useful for gaining an insight about the composition of skill-levels in the scholar population. By splitting the scholars into three categories (`beginner`, `middle`, and `advanced`), resource allocation may be maximized by catering the events and interventions to the most dominant skill-level among the population.

  

**Tables and Columns Required:**

- `course_fact`: `user_id`, `course_id`, `completed_at`

- `course_dim`: `course_Id`, `title`

- `user_dim`: `user_id`, `first_name`, `last_name`, `email`

  

**Methodology:**

1. Filter completed courses using:

>  `course_fact.completed_at != NULL`

2. Perform a `LEFT JOIN` with `course_fact` and `course_dim` on `course_id`

3. Perform another `LEFT JOIN` with `user_dim` on `user_id`

4. Utilize the `course_difficulty` metric from the previous research question by appending their values to their respective rows

5. Group by `user_id`

6. Prepare a dataset containing three columns: sum of `course_difficulty`, average value of `course_difficulty`, and number of courses completed

7. Standardize the dataset

8. Train a clustering model and use it to group the dataset into 3 clusters

  

**Potential Insights and Actions:**

- If majority of the population belongs to the `beginner` group, the team may plan interventions to help beginners catch-up.

- If majority of the population belongs to the `middle` group, the team may plan interventions to help middle-level scholars in upskilling.

- If majority of the population belongs to the `advanced` group, the team may plan interventions to help advanced scholars progress through their journey. (E.g. leadership opportunities, research ideas)

- Historical skill-level group data may also show scholar skill progression throughout the term.

<a  name="engagement-trends"></a>
## 📅 Engagement Trends

### **IV. What day of the week and time of the day are the scholars most active?**

  

**Relevance:**

Finding the study pattern of scholars allows the team to plan workshops, interventions, and events more effectively. For instance, the ideal time and day for the team to host a workshop is when the scholars are most active. This gives most scholars the opportunity to join the workshop as the group is sure that most of them will be available during the given time based on the data. Having an idea of the scholars study patterns allows the group to maximize workshop and intervention efficiency.

  

**Tables and Columns Required:**

- `xp_fact`: `user_id`, `created_date`, `xp`

- `user_dim`: `user_id`, `first_name`, `last_name`, `email`

  

**Methodology:**

1. Perform a `LEFT JOIN` with `xp_fact` and `user_dim` on `user_id`

2. Extract `day_of_week` and `hour` from `created_date`

3. Group by `day_of_week` and `hour`

4. Count unique `user_id`

5. Visualize the days and hours the scholars are most active by plotting a heatmap or a lineplot

  

**Potential Insights and Actions:**

- The team may maximize workshop/intervention efficiency by picking the day and time of which the scholars are most active.

- This also serves as a long-term planner as the group will be able to track trends.
### **V. What is the weekly growth rate of scholar engagement?**

  

**Relevance:**

Analyzing the growth of scholar engagement allows the team to take the appropriate course of actions in order to maximize scholar productivity. This gives an idea of how active the scholars are for the past week, allowing for planning on the following week's activity. As an example, this may serve as a basis for determining whether the scholars are up for an activity/event the following week.

  

**Tables and Columns Required:**

- `xp_fact`: `user_id`, `event`, `created_date`, `xp`

- `user_dim`: `user_id`, `first_name`, `last_name`, `email`

  

**Methodology:**

1. Perform a `LEFT JOIN` with `xp_fact` and `user_dim` on `user_id`

2. Filter for the last 14 days from `created_date` and split into two groups: `current_week` and `last_week`

3. Take the 7-day `xp` sums of each week and calculate the [Percentage Increase](https://www.cuemath.com/percentage-increase-formula/) using:

$percentage\_increase=(\frac{current\_week - last\_week}{last\_week}) 100$

4. Interpret results and make data-driven decisions

  

**Potential Insights and Actions:**

- If the growth rate is positive, the team may plan events and activities to make the most out of student engagement.

- If the growth rate is negative, the team may plan interventions to increase student engagement.

<a  name="retention-prediction"></a>
## ⚠️ Retention Prediction

### **VI. What are the telltale signs that put scholars at-risk of dropping?**

  

**Relevance:**

Understanding the study patterns and other behaviors of scholars who have forfeited their scholarship will allow the group to predict the drop rate of current scholars. Identifying scholars who are at-risk allows the team to reach out and provide support to scholars who are at need. Additionally, further analysis of study patterns may reveal their reason for dropping. (E.g. They got stuck at a course)

  

**Tables and Columns Required:**

- `xp_fact`: `user_id`, `event`, `created_date`, `xp`

- `user_team_bridge`: `user_id`, `team_id`, `joined_team_id`, `left_team_date`

- `user_dim`: `user_id`, `first_name`, `last_name`, `email`

  

**Methodology:**

1. Filter for the `team_id` of interest in `user_team_bridge`

2. Perform a `LEFT JOIN` with `user_team_bridge` and `user_dim` on `user_id`

3. Perform another `LEFT JOIN` with `xp_fact` on `user_id`

4. Create a column `is_dropped` and label accordingly if the scholar's `left_team_date` is `NOT NULL`

5. Split and scale the data appropriately

6. Using the data, train and tune a time-series regression model to predict scholar drop rate

7. A clustering model may also be used to group dropped scholars' behavior and study patterns for further analysis

  

**Potential Insights and Actions:**

- The team may provide support to the scholars who are at-risk of dropping through emails, reminders, or workshops.

- Uncovering the study patterns and behaviors which determine dropping may help the team plan proactive measures.

<a  name="honest-fair-play"></a>
## 🏆 Honesty & Fair Play

### **VII. How can XP exploitation be discouraged to maintain learning productivity and fair competition?**

  

**Relevance:**

From a newcomer's perspective, seeing absurd amounts of XP could intimidate or discourage them from competing for the leaderboards; this kills their motivation. The leaderboards not only serve as a metric for seeing a scholar's standing, but it also serves as a motivation for some scholars. Apart from intimidating other scholars, those who exploit XP themselves do not realize that they are simply wasting time as they aren't learning anything useful from the exploit. The time they waste on XP exploitation could be put to good use by actually taking courses and applying knowledge on projects. Successfully discouraging the use of XP exploits will foster a more productive and competitive learning environment for the scholars.

  

**Tables and Columns Required:**

- `xp_fact`: `user_id`, `event`, `created_date`, `xp`

- `user_dim`: `user_id`, `first_name`, `last_name`, `email`

  

**Methodology:**

1. Perform a `LEFT JOIN` with `xp_fact` and `user_dim` on `user_id`

2. Group by `event`

3. Filter for data with `created_date` not more than 30 days old

4. Create `practice_xp_proportion` by calculating the proportion of 30-day total `xp` gained from practice `event`s to their 30-day total `xp`:

> $practice\_xp\_proportion=\frac{practice\_xp}{total\_xp}$

5. Flag suspiciously high `practice_xp_ratio`

6. Train a `LogisticRegression` model to flag scholars that have been exploiting `xp`

  

**Potential Insights and Actions:**

- Those who are flagged may be first reviewed and excluded from the leaderboards if proven guilty, this way fair competition is maintained, motivating other scholars to compete in the leaderboards.

- Removing exploiters from the leaderboards kills their only purpose for exploiting, this way they have no other choice but to compete fairly if they still want a spot in the leaderboards.

- Productive XP grinding is promoted if exploitation is stopped.

---

# Conclusion

In summary, the proposals included in this document contribute to the objectives of **Project DREAMS**. Proposals under the **Learning Analytics** category makes use of feature engineering backed with sound mathematical concepts to provide a more grounded, secure, and accurate metric for monitoring scholar performance. On the other hand, **Engagement Trends** aims to provide insights regarding scholar study patterns and engagement rate to serve as a basis for workshop and intervention planning. The **Retention Prediction** category utilizes a machine learning model to study the behaviors and statistics of dropped scholars in order to predict and prevent those who are at-risk of dropping. Lastly, the **Honesty & Fair Play** category addresses the XP exploitation issue, how it affects scholars, how a predictive model could help solve the issue, and how preventing XP exploitation would benefit not only the community, but also those who exploit XP themselves. All categories and proposals within contribute to the ultimate goal of **Project DREAMS**: improving the scholar's learning experience by improving scholar management.
