# IELTS Essay Regression

this notebook is our submission for the Satria Data BDC 2025 Internal Selection

## Project Overview

### Problem Statement
This project focuses on developing a regression model to predict IELTS essay scores based on various textual features and linguistic characteristics. The goal is to create an automated scoring system that can accurately evaluate essay quality across the four key IELTS assessment criteria.

### Objectives
- Build robust regression models to predict IELTS essay scores for each assessment criterion
- Analyze key features that contribute to scoring in each criterion
- Implement feature engineering techniques for text data
- Evaluate model performance using appropriate metrics
- Provide insights into what makes a high-scoring IELTS essay for each criterion

### Dataset Information
- **Source**: IELTS essay dataset with scored essays
- **Features**: QnA (Question and Answer/Essay text only)
- **Targets**: Four IELTS scoring criteria:
  - `task_achievement`: How well the essay addresses the task requirements
  - `coherence_and_cohesion`: Logical organization and flow of ideas
  - `lexical_resource`: Vocabulary range, accuracy, and appropriateness
  - `grammatical_range`: Grammar accuracy, complexity, and variety
- **Score Range**: Typically 0-10 for each criterion

### Methodology
1. **Data Preprocessing**
   - Text cleaning and normalization of QnA data

2. **Feature Engineering from QnA Text**
   - Text-based features (word count, sentence length, vocabulary diversity)
   - Task-specific features (topic relevance, argument structure)
   - Coherence features (transition words, paragraph structure)
   - Lexical features (vocabulary complexity, word frequency analysis)
   - Grammar features (sentence complexity, error patterns)
   - Qwen3-8b Embeddings to capture semantic meaning and context

3. **Model Development**
   - Multi-target regression approach for four criteria (Catboost with MultiRMSE loss)
     - Using Catboost's MultiRMSEWithMissingValues loss function to handle multi-target regression effectively and accommodate missing values on targets
   - Model selection and hyperparameter tuning

4. **Evaluation**
   - Cross-validation for each target
     - Later on the competition, we use anchors from our previous submission since the crossvalidation is too far from the actual leaderboard metric
   - MSE as main performance metric

### Expected Outcomes
- Accurate prediction models for each IELTS scoring criterion
- Identification of key factors affecting each assessment area
- Insights for targeted essay improvement recommendations
- Comprehensive analysis of inter-criterion relationships
- Visualization of results and feature importance per criterion

### Team Information
**Satria Data BDC 2025 Internal Selection Submission**

---

*This overview will be updated as the project progresses with specific results and findings.*