# Modeling and Predicting Student Dropout in Yiya Air Science Courses

In this project, our first aim to identify the points in the Yiya Air Science courses 2 and 3 where students dropout. There are different ways extents and ways to measure drop out. For example, dropout can be defined as a:
1. **Usage Gap**, where the time since the students last response is more than a certain amount of days e.g (e.g more than 7 days since last response). This can be seen as students being "on track" or "behind".
2. **Total Response Stop**, where a student has sent their last response to the system before completing all tasks
3. **Total Question Stop**, where a student stops submitting graded problems in the course before completing all tasks [(Borella, et al., 2019)](https://mitili.mit.edu/sites/default/files/project-documents/a24-Borrella_Caballero_Ponce_2019.pdf)
4. **Total Completion Stop**, where a student stops completing tasks (lesson modules and tests)
5. **Opt out** - student voluntarily unenrolls from the course without achieving a certificate [(Tinto, 1975)](https://www-tandfonline-com.cmu.idm.oclc.org/doi/full/10.1080/01587919.2017.1369006#)

Our second aim is to measure and visualize the rate of each of these dropout metrics among students over time and each course step. We can do so by diving the total occurances by the number of studnets. (Counts of students meeting dropout criteria / total number of students).

Our third aim is to identify features (variables) that are predictive of dropout with a classification model. Prior research suggests a need to refine success and dropout metrics [(Kohler, 2019)](https://www-tandfonline-com.cmu.idm.oclc.org/doi/full/10.1080/01587919.2017.1369006). It is uncertain if we have all of the following features in our data. However, we can try features grounded in previous theories, empircal results, and our best judgement on what makes sense for the context of the learning environment and the student population.  but some features that we may be able to use in our model and evaluate for predictive power include: 
- Student's date of first submission of a problem [(Liyanagunawardena, 2014)](https://centaur.reading.ac.uk/36002/)
- Student Intentions [(Liyanagunawardena, 2014)](https://centaur.reading.ac.uk/36002/)
    - Unsure
    - Browse Course
    - Audit
    - Complete
- Participation type [(Kohler, 2013)](https://er.educause.edu/articles/2013/6/retention-and-intention-in-massive-open-online-courses-in-depth)
    - Browser
    - Passive Participant 
    - Active Participant
    - Community Contributor
We can try to align participant types with [Yiya User Definitions](https://docs.google.com/document/d/1KNgPFdRSBniQouaKIQQPuI-DlAL-2xs70t4rpps5Qq4/edit)
- Completion Behavior
    - Number of lesson modules (tasks) completed
    - Number of lesson questions completed
    - Number of questions answered correctly
    - Number of tests they've completed 
    - Whether they are a returning learner - There should be a record for each Registration, Previously registered key in response as well. Data team will have to get back to us on how to best query the data to get an accurate user.
- Time on Task metric
    - Minutes spent on a task / Average time of session - We can use the telcomsession table and the created and duration feature. There can be telecom issues so users can get kicked out before completing the script. We'll have to consider this. We may have to compare with the tasks table.
- Review Behaviors 
    - Number of Times Revising Previous Lesson
- Effect of payment
    - Payment Status (Scholarship or Self Paid with mobile money) 
    - payment survey question - Required question in the response table. Shariffa will share the key. 
    
Note about "effect of payment": The date at which non-paid users lost access to course 3 was [*To be confirmed by yiya team, date was potentially nov 4th*]
Some required questions have required in their key?

# Exploring the Database

Yiya has a large database containing course content, user information, and interactions. The data spans multiple courses. The Entity Relationship Diagram below shows the relationships between the tables and the fields each table contains. We have the following tables:
- `content` - which describes the text or audio content users see or hear on their phones
- `outbound` - which tracks cellular data being from the system to users
- `inbound` - which tracks cellular data being to the system from users
- `response` - which describes the values users select from menus, surveys, lessons, tests, and other activities
- `user` - which describes the status of each user
- `task` - which records completions of tasks i.e completion of surveys, lessons, tests, etc.
- `registration` - which records each record of a user registration for a course
- `course` - which records each instance of a course run
- `channel`- **TECH TEAM INPUT NEEDED:** *the purpose of this table is unclear to the author of this notebook at time of writing*
- `telecommession` - which describes each session started on USSD by users.

❗️ Additional notes are needed on conditions under which each record is generated. 

**Recommendations for Future Tables and Fields**
- Survey table - Adding a table for surveys would help with the duplicate response problem in the response table. I recommend logging survey responses after the last survey question has been completed to ensure accurate survey data.
- Task Table Fields
    - time_taken - record the time between a user starting a tasks and completing a task after the task has been completed
    - score - record the score for lessons or tests after they have been completed. non scored tasks can be left empty.

Now that we have had an overview of the tables, we can quickly look at each table to identify data that can be used to model student dropout behaviors.

![Image of ERD][1]

[1]: /Users/ddbutler/repos_new/yiya_data_analysis/Documentation/ERD_2022-10-25.png

## Content Table
First, we'll examine the content table.

In [29]:
import pandas as pd

pd.set_option('display.max_columns', 1000) #show columns in scrollable table
pd.set_option('display.max_rows', 500)
pd.set_option("max_colwidth", None) #don't truncate data in columns.
pd.set_option('display.max_columns', 1000) #show columns in scrollable table
pd.set_option('display.max_rows', 500)
pd.set_option("max_colwidth", None) #don't truncate data in columns.

#read file from data folder, return file
def read_data(file_name, folder="/Users/ddbutler/Desktop/Repos/Yiya-Solutions-Analysis/yiya-completion-analysis/course3_data_v2_pickle/"):
    #combine foler and file name to get the full path
    df = pd.read_pickle(data_folder + file_name)
    return df

content_df = read_data(file_name="content.pkl")
content_df.sample(3, random_state=5) #See sample of data

Unnamed: 0_level_0,created,updated,script,section,version,kind,content,correct_value
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
22962,2022-09-29 18:42:11,2022-09-29 18:42:11,content/overview-yiya-airscience-course-2021-overview,.,zE2zM35idcz-xF38YMsAWQ,text,"Intro: What is STEM?\n1: Identify the problem\n2: Investigate\n3: Brainstorm\n4: Plan\n5: Create\n6: Test\n7: Improve\n8: Launch\n0. moto sandra, Continue...",
38423,2022-10-18 10:38:10,2022-10-18 10:38:10,content/airscience-2022a/course/step-3-brainstorm/step-3-lesson-1/go-to-questions,q1-which-technology-are-we-going-to-create-in-this-course,p1i5dVVWfi1ZAPsvqpZDYg,text,That's not it OLIK JOSEPH... Try again.\n1/4: Which technology are we going to create in this course?\n1. Pedal powered washing machine\n2. An electric washing machine,Pedal powered washing machine
30187,2022-10-07 10:27:15,2022-10-07 10:27:15,content/airscience-2022a/course/step-1-identify/step-1-lesson-4/go-to-questions,what-is-6-out-of-10-as-a-percentage,cBR0we9JPtr2JmCRlp2ELw,text,Thank you P for answering lesson questions!,


Based on this sample of the data we'll assume the following descriptions of the most relevant fields: 
- `created`: The created field is most likely the time content was served to the user
- `script` : The script field seems to be the activity being presented to a user.  *TODO❗️* There's a pattern that need to be described e,g
    - `content/airscience-2022a/course/step-3-brainstorm/step-3-lesson-1/go-to-questions`
    - `content/airscience-2022a/course/step-1-identify/step-1-lesson-4/go-to-questions`
    - `content/overview-yiya-airscience-course-2021-overview`
- `section`: The section field seems to describe the specific question or action within a script activity. 
- `version`: **Tech Team Input Needed**
- `kind`: The kind field suggests there are different types of content. *TODO❗️* We will look at the types of content later.
- `content`: The content field seems to be what the user actually sees or hears. 
- `correct_value`: The correct values field seems to mark the correct value for lesson and tests.

Now that we have an understanding of the field descriptions, we can look at the contents of the content data table in bit more more detail.


In [38]:
#content_df.describe(datetime_is_numeric=True, include="all")
#Input: a dataframe
#Output: dataframe describing data
#Description: generates summary of information about dataframe
def explore_data(df):
    #Set display options for easier viewing
    pd.set_option('display.max_columns', 1000) #show columns in scrollable table
    pd.set_option('display.max_rows', 500)
    pd.set_option("max_colwidth", None) #don't truncate data in columns.

    #See information on fields
    print("Information on Fields")
    print("----------------------")
    content_df.info()

    #See info on missing data
    print("\nPercentage of Missing Values")
    print("----------------------")
    print((df.isna().sum() / df.shape[0]) * 100)

    return df.describe(datetime_is_numeric=True, include="all")

#Check kinds of content
print("Kinds of Content: ",content_df["kind"].unique())
print("---")
#Print data summary
explore_data(content_df)

Kinds of Content:  ['text' 'audio']
---
Information on Fields
----------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 78532 entries, 1 to 78562
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   created        78532 non-null  datetime64[ns]
 1   updated        78532 non-null  datetime64[ns]
 2   script         78531 non-null  object        
 3   section        78531 non-null  object        
 4   version        78461 non-null  object        
 5   kind           78532 non-null  object        
 6   content        78532 non-null  object        
 7   correct_value  73517 non-null  object        
dtypes: datetime64[ns](2), object(6)
memory usage: 5.4+ MB

Percentage of Missing Values
----------------------
created          0.000000
updated          0.000000
script           0.001273
section          0.001273
version          0.090409
kind             0.000000
content          0.000000
cor

Unnamed: 0,created,updated,script,section,version,kind,content,correct_value
count,78532,78532,78531,78531,78461,78532,78532,73517.0
unique,,,326,474,1067,2,62808,104.0
top,,,content/course-menu,.,cBR0we9JPtr2JmCRlp2ELw,text,Welcome to Yiya AirScience!\n1. On the Radio today\n2. Course Overview\n3. Revise previous content\n4. What to do next\n5. On the Radio today\n6. Course Overview\n7. What to do next Previous Lessons\n8. On the Radio today\n9. Course Overview\n10. Revise previous c,
freq,,,6091,31981,6007,78398,632,52070.0
mean,2022-10-08 08:24:13.047101696,2022-10-08 08:24:13.047114496,,,,,,
min,2021-06-17 19:10:00,2021-06-17 19:10:00,,,,,,
25%,2022-09-26 14:11:26.750000128,2022-09-26 14:11:26.750000128,,,,,,
50%,2022-10-19 10:09:05.500000,2022-10-19 10:09:05.500000,,,,,,
75%,2022-11-12 21:29:16.249999872,2022-11-12 21:29:16.249999872,,,,,,
max,2022-12-28 13:20:07,2022-12-28 13:20:07,,,,,,


Look at the description of the data above, 78,532 instances of users receiving content. The data is fairly complete. The only column with significant missing values is the `correct_value` column, which suggests that about 6% of content served is non lesson or test. The dates range from `2021-06-17` to `2022-10-19`.

There are 326 unique scripts (activities) between the first to most recent course.  We have 2 kinds of content: `text` and `audio`.  The most recent content suggests the course menu, which serves as the entry point for users.

### Questions
Understanding of the course table, allows us to ask some questions about dropout.
- Do students who persist listen to audio content at a higher rate?
- What content is communicated via audio?

In [46]:
audio_content = content_df.query("kind == 'audio'")
print("\nAudio Content Summary")
print("Unique Scripts", audio_content["script"].unique())
print("\nUnique Sections", audio_content["section"].unique())
audio_content.sample(3, random_state=5)



Audio Content Summary
Unique Scripts ['content/for-testers' 'content/for-registered' 'content/active_users'
 'content/airscience-2022a/schedules/for-testers'
 'content/airscience-2022b/schedules/for-testers'
 'content/airscience-2022a/schedules/for-registered'
 'content/airscience-2022b/schedules/for-registered']

Unique Sections ['Friday' 'Friday, June 25' 'Monday, June 28' 'Wednesday, June 30'
 'July 1' 'Wednesday July 07, 2021' 'Thursday July 08, 2021'
 'Friday July 09, 2021' 'Sunday July 11, 2021' 'Monday July 12, 2021'
 'Monday July 13, 2021' 'Sunday July 18, 2021' 'Sunday July 18'
 'Saturday July 24, 2021' 'Sunday July 25, 2021' 'Sunday July 25'
 'Thursday July 29' 'Sunday August 1, 2021' 'Monday August 09, 2021'
 'Tuesday August 10, 2021' 'Tuesday August 10' 'Sunday August 15,2021'
 'Friday  August 20, 2021' 'Friday August 27, 2021'
 'Friday September 17, 2021' 'Sunday, September 19'
 'Friday October 1, 2021' 'Sunday, October 3' 'Monday, October 4'
 'Tuesday, October 5' 'Thursd

Unnamed: 0_level_0,created,updated,script,section,version,kind,content,correct_value
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1141,2021-07-24 07:00:01,2021-07-24 07:00:01,content/for-testers,"Saturday July 24, 2021",o-YlVlmqLRwtXmEmthIjIg,audio,../.gitbook/assets/step_2_robocall.mp3,
1483,2021-08-10 13:00:07,2021-08-10 13:00:07,content/for-registered,Tuesday August 10,,audio,../.gitbook/assets/step3_brainstorm_robocall.mp3,
47071,2022-10-30 13:00:04,2022-10-30 13:00:04,content/airscience-2022a/schedules/for-registered,"Sunday, October 30, 2022",cO-o8BWsu1Q-5xoPrDfeFQ,audio,../../../.gitbook/assets/Course_3_Step_5_Robocall.mp3,


Looking at the sample, the `created `dates tell us that audio content is used for multiple courses and the sections are just labels for the days the content was sent out. 

**Do students who persist listen to audio content at a higher rate?** - It looks like the content table can't answer this question because there's no easy link with user_id. However, the channel table oes has `user_id` and `kind`. We'll mine that data from there. Although, the content description for audio content does have useful labels for names that suggest which part of the course the audio is for.What content is communicated via audio?

**What content is communicated via audio?** - These seem to be mostly robocalls. Based on the curriculum documentation, robocalls are previews of what is coming up in the course. TODO❗️: A Question for yiya team, what is the purpose of robocalls? When are they sent out in respect to broadcasts? Can robocalls be requested by users any time?

In [51]:
outbound_df = read_data("response.pkl")
outbound_df.sample(5)

Unnamed: 0_level_0,created,updated,user_id,source_id,key,value,correct
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
692552,2021-09-03 20:02:44,2021-09-03 20:02:44,39163,1461514,the-create-step-survey#did-you-create-your-first-prototype-of-solar-cells,Yes,0
1284153,2022-09-30 11:02:23,2022-09-30 11:02:23,62626,2552778,airscience-2022a/course/intro-step/intro-step-lesson-3/go-to-questions#what-is-a-prototype,The final technology that someone makes,0
910968,2021-10-17 11:15:40,2021-10-17 11:15:40,48274,1897309,course-spring-2021/intro-step/intro-step-lesson-2/take-quiz#q2,Solids,0
1021520,2021-11-03 17:01:40,2021-11-03 17:01:40,47997,2073461,yiya-airscience-tests-2021/take-step-2-test#q12,Hibiscus juice because it has anthocyanin,0
971524,2021-10-28 05:57:21,2021-10-28 05:57:21,43940,1993392,course-spring-2021/assessment-questions/go-to-exam-questions#q7,"For measuring electrical current, voltage etc",1
