Skip to content

My process for attempting to create a quick English proficiency test from game data and a decision tree

Notifications You must be signed in to change notification settings

equinlan/english-test

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Creating an English proficiency test using a decision tree

Overview

This project attempts to build a functional English proficiency test from a sparse dataset consisting of user interactions with multiple choice questions. It began by exploring and refining the available data, moved on to proving that a machine learning model can predict proficiency based on that data, then culminated (for the purposes of this repository) in the deployment of a Web application.

This repository is a capstone project deliverable for the Udacity Data Science Nanodegree. All work was done by Evan Quinlan, including data analysis, data pipeline construction, Web application development, and Web application deployment, during enrollment in the nanodegree program. Work was done while employed by Transparent Language, the proprietor of the data, and is shared publicly and submitted to Udacity with permission of Transparent Language. Please see the License section for more information.

Problem

While I was enrolled in the nanodegree program, a unique data science problem fell into my lap that fit the requirements for a capstone project.

The CEO of the company I work for as Director of Product, Transparent Language, wanted to know if an English proficiency test could be created based on data generated by the company's English language game, Which Is English?. The game challenges players to answer binary questions about the English language then provides them with a score, which may be posted on a leaderboard. A good score, however, doesn't necessarily reflect true English knowledge, since players may have memorized answers or sped through the game to increase their points. The CEO did not want to devote "official" company resources to this endeavor, so he allowed me to pursue it personally and in an "unofficial" capacity, agreeing to let me share the work as long as I respected proprietary company information.

The question is, can data generated by the game form the basis of an English proficiency test? And moreover, can this be done more efficiently than traditional computerized adaptive testing techniques would allow?

What "proficiency test" means to some may differ wildly from what it means to others, but in its broadest sense, it is a series of questions (preferably adaptive) which, after being answered (or skipped, as allowed), indicate the degree of some relative, latent ability in the test-taker. The test need only measure some kind of English language ability, so the answer does not depend on testing all four major language skills, namely reading, writing, listening, and speaking.

Solution

To solve this problem, I will do a little innovating and try to build a decision tree that can be dissected and treated as an adaptive English proficiency test. The steps I took were:

  1. Explore available data from a MySQL database
  2. Extract and explore relevant data
  3. Fill in sparse data with Funk SVD, creating a matrix of predictions of how well each user will answer each item
  4. Use predictive data to create target variables representing user English proficiency scores
  5. Train and test a decision tree
  6. Use the decision tree structure to build an adaptive test

The idea is, I present each node of the decision tree as a test question. Upon receiving an answer, I follow the tree's internal logic to choose the next question. Just like in a traditional adaptive test, questions with the most information (i.e. best ability to discriminate users by their latent ability) are presented first.

Test generation

Libraries

Although unused in this project, to employ the fallback strategy of computerized adaptive testing, I'd use the excellent catsim library.

Code is written in Python 3.7.

Metrics

I use a decision tree algorithm to solve my problem. The metrics used are these:

  1. The coefficient of determination as a general accuracy indicator for the regressor
  2. Mean error: I want to generate a test that is not just statistically accurate, but also likely to be accurate for each individual test-taker. This means a low mean error is critical.
  3. Max error: I need to know the worst case scenario for a test-taker.
  4. Tree depth: I want to try and generate a test of a reasonable length, which correlates directly with tree depth.

More information on why these metrics were chosen can be found in the Methodology notebook.

Results

Reflection

The approach I took was a little unorthodox. There's a lot of expertise on the subject of computerized adaptive testing, and using a decision tree as an adaptive test is an "out-there" approach that might offend the sensibilities of some. However, what I've created seems to have some face validity at least. I won't know if what it really measures until it's compared with some other, more established English proficiency testing scale. In the meantime, however, the scale of [20, 120] seems to be useful. If the approach turns out not to be viable, I can always fall back to implementing more traditional approaches using item response theory.

Solving this problem has been a series of ups and downs. Machine learning tools are all sufficiently abstract that using something like Funk SVD to create an adaptive English proficiency test is possible, but there are no handbooks for applying those tools for that purpose. That made the most difficult part of this project imagining, from scratch, how the various methods and techniques with which I was familiar could be applied to a domain I had never seen them used for (although, just because I haven't seen it...). For instance, the method of using the item bias generated by SVD as a stand-in for item difficulty—which would normally be determined scientifically or by panels of experts—needed a lot of validation before I was convinced it wasn't an insane approach to take.

Improvement

Decision tree regressors have the drawback that they cannot be truly adaptive; once you're on a branch, you're stuck between the minimum and maximum values offered by that branch's leaves. However, if the inputs to the tree are valid enough, then that might not be a show-stopping issue.

Some things I'd like to try going forward:

  • Train separate models for native and non-native English test-takers, since they may exhibit very different properties.
  • Choose an optimal bucket size for delivering clusters of items as test challenges.
  • Train the model on actual, possible answers based on bucket size (for instance, for a bucket size of seven, round all training inputs to the nearest seventh).
  • Try creating buckets using truncated SVD. Bucketing items based on their difficulty is logically sound, but some items have specific properties that cause certain test-takers to respond differently to them than others. What if we should really be grouping items based on the grammatical principles they test? Those are the kinds of latent factors that truncated SVD might be able to discover.
  • Include the first encounters between Which Is English? users and items that are timeouts, as I left them out of my data, and they have value as wrong answers. Without them, my training data is essentially incomplete.

Web application

Overview

A Web application implements the "state-of-the-art" version of the adaptive test conceived by the end of the Methodology notebook.

The application presents a test generated by a decision tree then presents the user with their score. The test was built from the nodes of the decision tree, each of which represents a bucket of seven Which Is English? items.

Libraries

Implementation Notes

Timeouts are considered incorrect.

Test results are logged in a back-end database, but no user data is collected. Users are authenticated anonymously using Firebase.

Link

My Language Number (Prototype) Hosted by Firebase

Files

Exploration and Methodology

There are two primary notebooks:

The data folder contains queries and data and samples of data queried from a MySQL database:

NOTE: first_encounters.csv has to be extracted from first_encounters.zip and placed in the /data directory to be properly referenced by notebooks.

Web Application

The mlna-web directory contains the Angular application that deploys the test.

The reformatted tree nodes and question data can be found in mlna-web/src/assets as node_data.json and item_data.json, respectively.

Acknowledgements

Data for this project was provided by Transparent Language. Which Is English? items were authored asynchronously by multiple authors employed by Transparent Language. Special thanks to Michael Quinlan, President and CEO, for giving me the opportunity to pretend to be a real-life data scientist for a bit.

License

Default copyright laws apply, and no one may reproduce, distribute, or create derivative works without permission.

About

My process for attempting to create a quick English proficiency test from game data and a decision tree

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages