# Vietnamese Students' Feedback Corpus Download

This notebook downloads and saves the Vietnamese Students' Feedback Corpus (UIT-VSFC) data to local files.

The dataset consists of over 16,000 sentences which are human-annotated with two different tasks:
- Sentiment-based classification (negative, neutral, positive)
- Topic-based classification (lecturer, training_program, facility, others)

In [10]:
import os
import requests
from urllib.parse import urlparse
import time
from pathlib import Path
import pandas as pd
import csv

print("Required libraries imported successfully!")

Required libraries imported successfully!


## Define Dataset URLs

The dataset is split into train, validation, and test sets. Each split contains three files:
- sentences: The actual feedback text
- sentiments: Sentiment labels (0=negative, 1=neutral, 2=positive)
- topics: Topic labels (0=lecturer, 1=training_program, 2=facility, 3=others)

In [11]:
# Define the URLs for downloading the dataset
URLS = {
    "train": {
        "sentences": "https://drive.google.com/uc?id=1nzak5OkrheRV1ltOGCXkT671bmjODLhP&export=download",
        "sentiments": "https://drive.google.com/uc?id=1ye-gOZIBqXdKOoi_YxvpT6FeRNmViPPv&export=download",
        "topics": "https://drive.google.com/uc?id=14MuDtwMnNOcr4z_8KdpxprjbwaQ7lJ_C&export=download",
    },
    "validation": {
        "sentences": "https://drive.google.com/uc?id=1sMJSR3oRfPc3fe1gK-V3W5F24tov_517&export=download",
        "sentiments": "https://drive.google.com/uc?id=1GiY1AOp41dLXIIkgES4422AuDwmbUseL&export=download",
        "topics": "https://drive.google.com/uc?id=1DwLgDEaFWQe8mOd7EpF-xqMEbDLfdT-W&export=download",
    },
    "test": {
        "sentences": "https://drive.google.com/uc?id=1aNMOeZZbNwSRkjyCWAGtNCMa3YrshR-n&export=download",
        "sentiments": "https://drive.google.com/uc?id=1vkQS5gI0is4ACU58-AbWusnemw7KZNfO&export=download",
        "topics": "https://drive.google.com/uc?id=1_ArMpDguVsbUGl-xSMkTF_p5KpZrmpSB&export=download",
    },
}

print("Dataset URLs defined successfully!")
print(f"Total splits: {len(URLS)}")
print(f"Files per split: {len(URLS['train'])}")

Dataset URLs defined successfully!
Total splits: 3
Files per split: 3


## Create Download Directory

Create a local directory to store the downloaded files.

In [12]:
# Create download directory
download_dir = Path("vietnamese_feedback_csv")
download_dir.mkdir(exist_ok=True)

print(f"Download directory created: {download_dir.absolute()}")

Download directory created: /Users/ducqhle/Documents/AI_Thinking_DoAn/DoAn/vietnamese_feedback_csv


## Download Function

Define a function to download files from Google Drive URLs.

In [13]:
def download_file(url, filename, max_retries=3):
    """Download a file from URL with retry mechanism."""
    for attempt in range(max_retries):
        try:
            print(f"Downloading {filename} (attempt {attempt + 1}/{max_retries})...")
            
            # Create a session for better connection handling
            session = requests.Session()
            
            # Set headers to mimic a browser
            headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
            }
            
            response = session.get(url, headers=headers, stream=True, timeout=30)
            response.raise_for_status()
            
            # Save the file
            with open(filename, 'wb') as f:
                for chunk in response.iter_content(chunk_size=8192):
                    if chunk:
                        f.write(chunk)
            
            file_size = os.path.getsize(filename)
            print(f"✓ Successfully downloaded {filename} ({file_size} bytes)")
            return True
            
        except Exception as e:
            print(f"✗ Error downloading {filename}: {str(e)}")
            if attempt < max_retries - 1:
                print(f"Retrying in 5 seconds...")
                time.sleep(5)
            else:
                print(f"Failed to download {filename} after {max_retries} attempts")
                return False
    
    return False

print("Download function defined successfully!")

Download function defined successfully!


## Download All Files

Download all the dataset files for train, validation, and test splits.

In [14]:
# Download all files and create CSV datasets
download_results = {}
total_files = sum(len(files) for files in URLS.values())
downloaded_count = 0

print(f"Starting download of {total_files} files...\n")

# Define label mappings
sentiment_labels = ["negative", "neutral", "positive"]
topic_labels = ["lecturer", "training_program", "facility", "others"]

for split_name, files in URLS.items():
    print(f"=== Processing {split_name.upper()} split ===")
    download_results[split_name] = {}
    
    # Check if CSV already exists
    csv_filename = download_dir / f"{split_name}_data.csv"
    if csv_filename.exists():
        print(f"✓ {csv_filename.name} already exists")
        downloaded_count += len(files)
        continue
    
    # Download temporary files
    temp_files = {}
    all_success = True
    
    for file_type, url in files.items():
        temp_filename = download_dir / f"temp_{split_name}_{file_type}.txt"
        
        # Download the file
        success = download_file(url, temp_filename)
        download_results[split_name][file_type] = success
        
        if success:
            temp_files[file_type] = temp_filename
            downloaded_count += 1
        else:
            all_success = False
            break
        
        # Small delay between downloads
        time.sleep(1)
    
    # If all files downloaded successfully, create CSV
    if all_success and len(temp_files) == 3:
        try:
            print(f"Creating CSV file for {split_name} split...")
            
            # Read all data
            with open(temp_files['sentences'], 'r', encoding='utf-8') as f:
                sentences = [line.strip() for line in f.readlines()]
            
            with open(temp_files['sentiments'], 'r', encoding='utf-8') as f:
                sentiments = [int(line.strip()) for line in f.readlines()]
            
            with open(temp_files['topics'], 'r', encoding='utf-8') as f:
                topics = [int(line.strip()) for line in f.readlines()]
            
            # Create DataFrame
            data = []
            for sentence, sentiment, topic in zip(sentences, sentiments, topics):
                data.append({
                    'sentence': sentence,
                    'sentiment': sentiment,
                    'sentiment_label': sentiment_labels[sentiment],
                    'topic': topic,
                    'topic_label': topic_labels[topic]
                })
            
            df = pd.DataFrame(data)
            
            # Save as CSV
            df.to_csv(csv_filename, index=False, encoding='utf-8')
            print(f"✓ Created {csv_filename.name} with {len(df)} records")
            
            # Clean up temporary files
            for temp_file in temp_files.values():
                temp_file.unlink()
            
        except Exception as e:
            print(f"✗ Error creating CSV for {split_name}: {e}")
            all_success = False
    
    print()

print(f"Download completed: {downloaded_count}/{total_files} files processed")

Starting download of 9 files...

=== Processing TRAIN split ===
Downloading vietnamese_feedback_csv/temp_train_sentences.txt (attempt 1/3)...
✓ Successfully downloaded vietnamese_feedback_csv/temp_train_sentences.txt (898090 bytes)
Downloading vietnamese_feedback_csv/temp_train_sentiments.txt (attempt 1/3)...
✓ Successfully downloaded vietnamese_feedback_csv/temp_train_sentiments.txt (22852 bytes)
Downloading vietnamese_feedback_csv/temp_train_topics.txt (attempt 1/3)...
✓ Successfully downloaded vietnamese_feedback_csv/temp_train_topics.txt (22852 bytes)
Creating CSV file for train split...
✓ Created train_data.csv with 11426 records

=== Processing VALIDATION split ===
Downloading vietnamese_feedback_csv/temp_validation_sentences.txt (attempt 1/3)...
✓ Successfully downloaded vietnamese_feedback_csv/temp_validation_sentences.txt (118628 bytes)
Downloading vietnamese_feedback_csv/temp_validation_sentiments.txt (attempt 1/3)...
✓ Successfully downloaded vietnamese_feedback_csv/temp_val

## Verify Downloaded Files

Check the downloaded files and display basic statistics.

In [15]:
# Verify downloaded CSV files
print("=== File Verification ===")
print(f"Download directory: {download_dir.absolute()}\n")

total_size = 0
file_stats = []

for split_name in URLS.keys():
    csv_filename = download_dir / f"{split_name}_data.csv"
    
    if csv_filename.exists():
        file_size = csv_filename.stat().st_size
        total_size += file_size
        
        try:
            # Read CSV to get row count and column info
            df = pd.read_csv(csv_filename)
            row_count = len(df)
            columns = list(df.columns)
            
            print(f"✓ {csv_filename.name}:")
            print(f"  Size: {file_size:,} bytes")
            print(f"  Rows: {row_count:,}")
            print(f"  Columns: {columns}")
            
            file_stats.append({
                'split': split_name,
                'size': file_size,
                'rows': row_count,
                'columns': len(columns)
            })
            
        except Exception as e:
            print(f"✗ {csv_filename.name}: Error reading file - {e}")
    else:
        print(f"✗ {split_name}_data.csv: File not found")
    
    print()

print(f"Total downloaded size: {total_size:,} bytes ({total_size/1024/1024:.2f} MB)")

=== File Verification ===
Download directory: /Users/ducqhle/Documents/AI_Thinking_DoAn/DoAn/vietnamese_feedback_csv

✓ train_data.csv:
  Size: 1,175,110 bytes
  Rows: 11,426
  Columns: ['sentence', 'sentiment', 'sentiment_label', 'topic', 'topic_label']

✓ validation_data.csv:
  Size: 156,743 bytes
  Rows: 1,583
  Columns: ['sentence', 'sentiment', 'sentiment_label', 'topic', 'topic_label']

✓ test_data.csv:
  Size: 324,249 bytes
  Rows: 3,166
  Columns: ['sentence', 'sentiment', 'sentiment_label', 'topic', 'topic_label']

Total downloaded size: 1,656,102 bytes (1.58 MB)


## Sample Data Preview

Display a few sample records from the training set to verify the data format.

In [16]:
# Preview sample data
print("=== Sample Data Preview (Training Set) ===")

try:
    # Read CSV file
    csv_file = download_dir / "train_data.csv"
    
    if csv_file.exists():
        df = pd.read_csv(csv_file)
        
        print("Sample records:")
        print("-" * 80)
        
        # Display first 5 records
        for i, (_, row) in enumerate(df.head().iterrows(), 1):
            print(f"Record {i}:")
            print(f"  Sentence: {row['sentence']}")
            print(f"  Sentiment: {row['sentiment']} ({row['sentiment_label']})")
            print(f"  Topic: {row['topic']} ({row['topic_label']})")
            print()
        
        print(f"Dataset shape: {df.shape[0]} rows, {df.shape[1]} columns")
        print(f"Columns: {list(df.columns)}")
    else:
        print("Training CSV file not found")

except Exception as e:
    print(f"Error reading sample data: {e}")

=== Sample Data Preview (Training Set) ===
Sample records:
--------------------------------------------------------------------------------
Record 1:
  Sentence: slide giáo trình đầy đủ .
  Sentiment: 2 (positive)
  Topic: 1 (training_program)

Record 2:
  Sentence: nhiệt tình giảng dạy , gần gũi với sinh viên .
  Sentiment: 2 (positive)
  Topic: 0 (lecturer)

Record 3:
  Sentence: đi học đầy đủ full điểm chuyên cần .
  Sentiment: 0 (negative)
  Topic: 1 (training_program)

Record 4:
  Sentence: chưa áp dụng công nghệ thông tin và các thiết bị hỗ trợ cho việc giảng dạy .
  Sentiment: 0 (negative)
  Topic: 0 (lecturer)

Record 5:
  Sentence: thầy giảng bài hay , có nhiều bài tập ví dụ ngay trên lớp .
  Sentiment: 2 (positive)
  Topic: 0 (lecturer)

Dataset shape: 11426 rows, 5 columns
Columns: ['sentence', 'sentiment', 'sentiment_label', 'topic', 'topic_label']


## Dataset Statistics

Display comprehensive statistics about the downloaded dataset.

In [17]:
# Dataset statistics
print("=== Dataset Statistics ===")

total_sentences = 0
split_stats = {}

for split_name in ['train', 'validation', 'test']:
    csv_file = download_dir / f"{split_name}_data.csv"
    
    if csv_file.exists():
        try:
            df = pd.read_csv(csv_file)
            count = len(df)
            
            split_stats[split_name] = count
            total_sentences += count
            print(f"{split_name.capitalize()} set: {count:,} sentences")
            
            # Display sentiment distribution
            sentiment_dist = df['sentiment_label'].value_counts()
            print(f"  Sentiment distribution: {dict(sentiment_dist)}")
            
            # Display topic distribution
            topic_dist = df['topic_label'].value_counts()
            print(f"  Topic distribution: {dict(topic_dist)}")
            print()
            
        except Exception as e:
            print(f"{split_name.capitalize()} set: Error reading file - {e}")
    else:
        print(f"{split_name.capitalize()} set: File not found")

print(f"Total sentences: {total_sentences:,}")

# Calculate percentages
if total_sentences > 0:
    print("\nSplit distribution:")
    for split_name, count in split_stats.items():
        percentage = (count / total_sentences) * 100
        print(f"  {split_name.capitalize()}: {percentage:.1f}%")

=== Dataset Statistics ===
Train set: 11,426 sentences
  Sentiment distribution: {'positive': np.int64(5643), 'negative': np.int64(5325), 'neutral': np.int64(458)}
  Topic distribution: {'lecturer': np.int64(8166), 'training_program': np.int64(2201), 'others': np.int64(562), 'facility': np.int64(497)}

Validation set: 1,583 sentences
  Sentiment distribution: {'positive': np.int64(805), 'negative': np.int64(705), 'neutral': np.int64(73)}
  Topic distribution: {'lecturer': np.int64(1151), 'training_program': np.int64(267), 'others': np.int64(95), 'facility': np.int64(70)}

Test set: 3,166 sentences
  Sentiment distribution: {'positive': np.int64(1590), 'negative': np.int64(1409), 'neutral': np.int64(167)}
  Topic distribution: {'lecturer': np.int64(2290), 'training_program': np.int64(572), 'others': np.int64(159), 'facility': np.int64(145)}

Total sentences: 16,175

Split distribution:
  Train: 70.6%
  Validation: 9.8%
  Test: 19.6%


## Data Loading Function

Create a helper function to load the dataset for further analysis or machine learning tasks.

In [18]:
def load_dataset(split='train'):
    """Load a specific split of the Vietnamese Students' Feedback dataset from CSV.
    
    Args:
        split (str): Dataset split to load ('train', 'validation', or 'test')
    
    Returns:
        pandas.DataFrame: DataFrame containing sentence, sentiment, topic, and labels
    """
    csv_file = download_dir / f"{split}_data.csv"
    
    if not csv_file.exists():
        raise FileNotFoundError(f"CSV file for {split} split not found: {csv_file}")
    
    df = pd.read_csv(csv_file)
    return df

def load_all_data():
    """Load all splits and combine them into a single DataFrame with split column."""
    all_data = []
    
    for split in ['train', 'validation', 'test']:
        try:
            df = load_dataset(split)
            df['split'] = split
            all_data.append(df)
        except FileNotFoundError:
            print(f"Warning: {split} split not found, skipping...")
    
    if all_data:
        combined_df = pd.concat(all_data, ignore_index=True)
        return combined_df
    else:
        raise FileNotFoundError("No dataset files found")

# Test the functions
try:
    sample_df = load_dataset('train').head(3)
    print("✓ CSV data loading function created successfully!")
    print(f"Sample loaded data structure:")
    print(f"  Columns: {list(sample_df.columns)}")
    print(f"  Shape: {sample_df.shape}")
    print("\nFirst record:")
    print(sample_df.iloc[0].to_dict())
except Exception as e:
    print(f"Error testing data loading function: {e}")

✓ CSV data loading function created successfully!
Sample loaded data structure:
  Columns: ['sentence', 'sentiment', 'sentiment_label', 'topic', 'topic_label']
  Shape: (3, 5)

First record:
{'sentence': 'slide giáo trình đầy đủ .', 'sentiment': 2, 'sentiment_label': 'positive', 'topic': 1, 'topic_label': 'training_program'}


## Summary

The Vietnamese Students' Feedback Corpus has been successfully downloaded and saved to CSV files. The dataset is now ready for sentiment analysis and topic classification tasks.

### Dataset Information:
- **Source**: UIT-VSFC (Vietnamese Students' Feedback Corpus)
- **Size**: Over 16,000 annotated sentences
- **Tasks**: Sentiment analysis and topic classification
- **Splits**: Training, validation, and test sets
- **Labels**: 
  - Sentiment: negative (0), neutral (1), positive (2)
  - Topic: lecturer (0), training_program (1), facility (2), others (3)

### CSV Files Created:
- `vietnamese_feedback_csv/` directory containing CSV files
- `train_data.csv` - Training set with all features
- `validation_data.csv` - Validation set with all features  
- `test_data.csv` - Test set with all features

### CSV Structure:
Each CSV file contains the following columns:
- `sentence`: The feedback text
- `sentiment`: Numeric sentiment label (0, 1, 2)
- `sentiment_label`: Text sentiment label (negative, neutral, positive)
- `topic`: Numeric topic label (0, 1, 2, 3)
- `topic_label`: Text topic label (lecturer, training_program, facility, others)

### Usage:
```python
# Load a specific split
train_df = load_dataset('train')

# Load all data with split information
all_df = load_all_data()

# Access data using pandas operations
positive_feedback = train_df[train_df['sentiment_label'] == 'positive']
lecturer_feedback = train_df[train_df['topic_label'] == 'lecturer']
```

You can now use this data for various NLP tasks including sentiment analysis, topic classification, and other Vietnamese text processing applications using pandas and scikit-learn.