# Module 1: Data Ingestion & Resume Handling

**Project:** CVSense - Intelligent Resume Classifier  
**Module Owner:** Ammaar Ahmed 
**Date:** January 2026

---

## Overview

This module is responsible for:
1. **Data Collection**: Downloading resume and job description datasets from Kaggle
2. **PDF Extraction**: Converting PDF resumes to text format
3. **Data Organization**: Storing data in a structured format for downstream modules
4. **Data Validation**: Ensuring quality and consistency of extracted data

---

## Dependencies

Required packages:
- `pandas`: Data manipulation
- `PyPDF2` or `pdfplumber`: PDF text extraction
- `opendatasets`: Kaggle dataset download
- `pathlib`: File path handling
- `json`: Data serialization

## 1. Environment Setup & Imports

In [1]:
# Install required packages
!pip install -q pandas PyPDF2 pdfplumber opendatasets kaggle

In [2]:
import pandas as pd
import os
import json
import re
from pathlib import Path
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Load environment variables if .env file exists
try:
    from dotenv import load_dotenv
    
    # Load .env file from project root
    env_path = Path('/home/ammaar/CODE/CVSense/.env')
    if env_path.exists():
        load_dotenv(env_path, override=True)
        
        # Set Kaggle credentials from environment variables
        kaggle_user = os.getenv('KAGGLE_USERNAME')
        kaggle_key = os.getenv('KAGGLE_KEY')
        
        if kaggle_user and kaggle_key:
            os.environ['KAGGLE_USERNAME'] = kaggle_user
            os.environ['KAGGLE_KEY'] = kaggle_key
            print("‚úì Loaded credentials from .env file")
            print(f"‚úì Kaggle user: {kaggle_user}")
        else:
            print("‚ö† Warning: KAGGLE_USERNAME or KAGGLE_KEY not found in .env")
            print("  ‚Üí Edit .env and add your credentials (see README.md)")
    else:
        print("‚ö† No .env file found. Checking for ~/.kaggle/kaggle.json...")
        kaggle_json = Path.home() / '.kaggle' / 'kaggle.json'
        if kaggle_json.exists():
            print("‚úì Using system-wide Kaggle credentials")
        else:
            print("  ‚Üí Create .env file with credentials (see README.md)")
except ImportError:
    print("‚ö† python-dotenv not installed. Install with: pip install python-dotenv")

# PDF extraction libraries
try:
    import pdfplumber
    PDF_LIBRARY = 'pdfplumber'
    print("‚úì Using pdfplumber for PDF extraction")
except ImportError:
    import PyPDF2
    PDF_LIBRARY = 'PyPDF2'
    print("‚úì Using PyPDF2 for PDF extraction")

# Kaggle dataset download
import opendatasets as od

print("\n‚úì All imports successful")

‚úì Loaded credentials from .env file
‚úì Kaggle user: ammaarx
‚úì Using pdfplumber for PDF extraction

‚úì All imports successful


## 2. Configuration & Directory Setup

### üîê Kaggle API Setup

Make sure you've set up Kaggle credentials in `.env` file:

```bash
# In project root
cp .env.example .env
# Edit .env with your Kaggle username & API key from kaggle.com/account
```

The `.env` file is in `.gitignore` - your credentials stay private!

In [3]:
# Project paths
PROJECT_ROOT = Path('/home/ammaar/CODE/CVSense')
DATA_DIR = PROJECT_ROOT / 'data'
RESUME_DIR = DATA_DIR / 'resumes'
JOB_DESC_DIR = DATA_DIR / 'job_descriptions'
MODULE_DIR = PROJECT_ROOT / 'module_1_data_ingestion'

# Dataset configuration
MAX_RESUMES = 100
MAX_JOB_DESCRIPTIONS = 50

# Create directories if they don't exist
RESUME_DIR.mkdir(parents=True, exist_ok=True)
JOB_DESC_DIR.mkdir(parents=True, exist_ok=True)

print("Directory Structure:")
print(f"  ‚îî‚îÄ‚îÄ Project Root: {PROJECT_ROOT}")
print(f"      ‚îî‚îÄ‚îÄ Data: {DATA_DIR}")
print(f"          ‚îú‚îÄ‚îÄ Resumes: {RESUME_DIR}")
print(f"          ‚îî‚îÄ‚îÄ Job Descriptions: {JOB_DESC_DIR}")
print(f"\n‚úì Directories configured successfully")

Directory Structure:
  ‚îî‚îÄ‚îÄ Project Root: /home/ammaar/CODE/CVSense
      ‚îî‚îÄ‚îÄ Data: /home/ammaar/CODE/CVSense/data
          ‚îú‚îÄ‚îÄ Resumes: /home/ammaar/CODE/CVSense/data/resumes
          ‚îî‚îÄ‚îÄ Job Descriptions: /home/ammaar/CODE/CVSense/data/job_descriptions

‚úì Directories configured successfully


## 3. Dataset Download from Kaggle

We'll download a publicly available resume dataset from Kaggle. You'll need your Kaggle API credentials.

**Note:** Make sure you have `kaggle.json` in `~/.kaggle/` directory with your API credentials.

In [4]:
# Download Resume Dataset from Kaggle
RESUME_DATASET_URL = 'https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset'
DATASET_SLUG = 'snehaanbhawal/resume-dataset'

try:
    print("Downloading resume dataset from Kaggle...")
    print("(Using credentials from .env or ~/.kaggle/kaggle.json)\n")
    
    # Check if credentials are available in environment
    kaggle_user = os.getenv('KAGGLE_USERNAME')
    kaggle_key = os.getenv('KAGGLE_KEY')
    kaggle_dir = Path.home() / '.kaggle'
    kaggle_json = kaggle_dir / 'kaggle.json'
    
    # If .env credentials exist, create/update kaggle.json
    if kaggle_user and kaggle_key and kaggle_user != 'your_kaggle_username_here':
        print(f"‚úì Using credentials from .env (user: {kaggle_user})")
        
        # Create .kaggle directory if it doesn't exist
        kaggle_dir.mkdir(exist_ok=True)
        
        # Write credentials to kaggle.json
        kaggle_creds = {
            "username": kaggle_user,
            "key": kaggle_key
        }
        with open(kaggle_json, 'w') as f:
            json.dump(kaggle_creds, f)
        
        # Set proper permissions (Unix-like systems)
        try:
            kaggle_json.chmod(0o600)
        except:
            pass
        
        print("‚úì Created ~/.kaggle/kaggle.json from .env credentials")
    
    elif kaggle_json.exists():
        print("‚úì Using existing ~/.kaggle/kaggle.json")
    else:
        raise ValueError(
            "Kaggle credentials not found!\n"
            "Setup: cp .env.example .env and add your credentials\n"
            "Get API key from: https://www.kaggle.com/account"
        )
    
    # Download dataset using Kaggle API (more reliable than opendatasets)
    from kaggle.api.kaggle_api_extended import KaggleApi
    
    api = KaggleApi()
    api.authenticate()
    
    temp_download_dir = MODULE_DIR / 'temp_kaggle_data'
    temp_download_dir.mkdir(exist_ok=True)
    
    print(f"Downloading from Kaggle (this may take a minute)...")
    api.dataset_download_files(DATASET_SLUG, path=str(temp_download_dir), unzip=True)
    
    print("\n‚úì Dataset downloaded successfully")
    
except Exception as e:
    print(f"‚ö† Error downloading dataset: {e}")
    print("\nCreating sample dataset for demonstration...")
    print("For production: add Kaggle credentials to .env file")

Downloading resume dataset from Kaggle...
(Using credentials from .env or ~/.kaggle/kaggle.json)

‚úì Using credentials from .env (user: ammaarx)
‚úì Created ~/.kaggle/kaggle.json from .env credentials
Downloading from Kaggle (this may take a minute)...
Dataset URL: https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset

‚úì Dataset downloaded successfully


## 4. Load and Process Resume Dataset

In [5]:
# Find the downloaded CSV file
temp_download_dir = MODULE_DIR / 'temp_kaggle_data'

# Look for CSV files in the downloaded directory
csv_files = list(temp_download_dir.rglob('*.csv'))

if csv_files:
    resume_csv_path = csv_files[0]
    print(f"Found resume dataset: {resume_csv_path.name}")
    
    # Load the dataset
    df_resumes = pd.read_csv(resume_csv_path)
    
    print(f"\nDataset shape: {df_resumes.shape}")
    print(f"Columns: {list(df_resumes.columns)}")
    print(f"\nFirst few rows:")
    display(df_resumes.head())
else:
    print("No CSV files found. Creating sample dataset...")
    # Create a small sample dataset for demonstration
    df_resumes = pd.DataFrame({
        'Category': ['Data Science', 'Software Engineering', 'Web Development'],
        'Resume': [
            'Experienced Data Scientist with Python, ML, and statistical analysis skills...',
            'Software Engineer proficient in Java, C++, and system design...',
            'Full-stack web developer with React, Node.js, and database experience...'
        ]
    })
    print("Sample dataset created for demonstration")

Found resume dataset: Resume.csv

Dataset shape: (2484, 4)
Columns: ['ID', 'Resume_str', 'Resume_html', 'Category']

First few rows:


Unnamed: 0,ID,Resume_str,Resume_html,Category
0,16852973,HR ADMINISTRATOR/MARKETING ASSOCIATE\...,"<div class=""fontsize fontface vmargins hmargin...",HR
1,22323967,"HR SPECIALIST, US HR OPERATIONS ...","<div class=""fontsize fontface vmargins hmargin...",HR
2,33176873,HR DIRECTOR Summary Over 2...,"<div class=""fontsize fontface vmargins hmargin...",HR
3,27018550,HR SPECIALIST Summary Dedica...,"<div class=""fontsize fontface vmargins hmargin...",HR
4,17812897,HR MANAGER Skill Highlights ...,"<div class=""fontsize fontface vmargins hmargin...",HR


In [6]:
# Limit to MAX_RESUMES
if len(df_resumes) > MAX_RESUMES:
    print(f"Dataset has {len(df_resumes)} resumes. Limiting to {MAX_RESUMES}...")
    df_resumes = df_resumes.sample(n=MAX_RESUMES, random_state=42).reset_index(drop=True)
else:
    print(f"Using all {len(df_resumes)} resumes from dataset")

print(f"\n‚úì Working with {len(df_resumes)} resumes")

Dataset has 2484 resumes. Limiting to 100...

‚úì Working with 100 resumes


## 5. Data Cleaning & Standardization

In [7]:
def clean_text(text):
    """
    Basic text cleaning for resumes and job descriptions.
    
    Args:
        text (str): Raw text to clean
        
    Returns:
        str: Cleaned text
    """
    if pd.isna(text):
        return ""
    
    # Convert to string
    text = str(text)
    
    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text)
    
    # Remove special characters that might cause issues
    text = text.replace('\x00', '')
    
    # Strip leading/trailing whitespace
    text = text.strip()
    
    return text

# Identify the resume text column (it might have different names)
resume_text_col = None
for col in df_resumes.columns:
    if 'resume' in col.lower() or 'text' in col.lower():
        resume_text_col = col
        break

if resume_text_col:
    print(f"Resume text column identified: '{resume_text_col}'")
    df_resumes['cleaned_resume'] = df_resumes[resume_text_col].apply(clean_text)
else:
    print("Could not identify resume text column automatically")
    print(f"Available columns: {list(df_resumes.columns)}")
    # Use the second column by default if exists
    if len(df_resumes.columns) > 1:
        resume_text_col = df_resumes.columns[1]
        print(f"Using column: '{resume_text_col}'")
        df_resumes['cleaned_resume'] = df_resumes[resume_text_col].apply(clean_text)

Resume text column identified: 'Resume_str'


## 6. Data Validation Functions

In [8]:
def validate_resume_text(text, min_length=50):
    """
    Validate resume text quality.
    
    Args:
        text (str): Resume text to validate
        min_length (int): Minimum acceptable length
        
    Returns:
        dict: Validation results with 'valid' flag and 'issues' list
    """
    issues = []
    
    if not text or pd.isna(text):
        issues.append("Empty text")
        return {'valid': False, 'issues': issues}
    
    text = str(text)
    
    # Check minimum length
    if len(text) < min_length:
        issues.append(f"Text too short ({len(text)} chars)")
    
    # Check for excessive non-alphabetic characters
    alpha_ratio = sum(c.isalpha() or c.isspace() for c in text) / len(text)
    if alpha_ratio < 0.5:
        issues.append(f"Low alphabetic content ({alpha_ratio:.2%})")
    
    # Check for common extraction errors
    if text.count('ÔøΩ') > 5:
        issues.append("Contains encoding errors")
    
    return {
        'valid': len(issues) == 0,
        'issues': issues,
        'length': len(text),
        'alpha_ratio': alpha_ratio
    }

# Validate all resumes
print("Validating resume data quality...\n")

if 'cleaned_resume' in df_resumes.columns:
    df_resumes['validation'] = df_resumes['cleaned_resume'].apply(validate_resume_text)
    df_resumes['is_valid'] = df_resumes['validation'].apply(lambda x: x['valid'])
    
    valid_count = df_resumes['is_valid'].sum()
    invalid_count = len(df_resumes) - valid_count
    
    print(f"Validation Results:")
    print(f"  ‚úì Valid resumes: {valid_count}")
    print(f"  ‚úó Invalid resumes: {invalid_count}")
    
    if invalid_count > 0:
        print(f"\nSample issues:")
        invalid_samples = df_resumes[~df_resumes['is_valid']].head(3)
        for idx, row in invalid_samples.iterrows():
            print(f"  Resume {idx}: {row['validation']['issues']}")

Validating resume data quality...

Validation Results:
  ‚úì Valid resumes: 100
  ‚úó Invalid resumes: 0


## 7. PDF Extraction Utility (for future PDF resumes)

In [9]:
def extract_text_from_pdf(pdf_path):
    """
    Extract text from a PDF file.
    
    Args:
        pdf_path (str or Path): Path to PDF file
        
    Returns:
        str: Extracted text from PDF
    """
    pdf_path = Path(pdf_path)
    
    if not pdf_path.exists():
        raise FileNotFoundError(f"PDF file not found: {pdf_path}")
    
    text = ""
    
    if PDF_LIBRARY == 'pdfplumber':
        try:
            with pdfplumber.open(pdf_path) as pdf:
                for page in pdf.pages:
                    page_text = page.extract_text()
                    if page_text:
                        text += page_text + "\n"
        except Exception as e:
            print(f"Error extracting with pdfplumber: {e}")
    
    else:  # PyPDF2
        try:
            with open(pdf_path, 'rb') as file:
                pdf_reader = PyPDF2.PdfReader(file)
                for page in pdf_reader.pages:
                    page_text = page.extract_text()
                    if page_text:
                        text += page_text + "\n"
        except Exception as e:
            print(f"Error extracting with PyPDF2: {e}")
    
    return clean_text(text)

print("‚úì PDF extraction function defined")
print(f"  Using: {PDF_LIBRARY}")

‚úì PDF extraction function defined
  Using: pdfplumber


## 8. Create Job Descriptions Dataset

Creating sample job descriptions that align with common resume categories.

In [10]:
# Sample job descriptions for common tech roles
sample_job_descriptions = [
    {
        'job_id': 'JD001',
        'title': 'Senior Data Scientist',
        'category': 'Data Science',
        'description': '''We are seeking a Senior Data Scientist to join our AI team. 
        The ideal candidate will have strong experience in machine learning, statistical analysis, 
        and Python programming. Responsibilities include developing predictive models, conducting 
        A/B testing, and presenting insights to stakeholders. Required skills: Python, SQL, 
        TensorFlow/PyTorch, scikit-learn, pandas, statistics, and data visualization. 
        Experience with cloud platforms (AWS/GCP) is a plus.'''
    },
    {
        'job_id': 'JD002',
        'title': 'Full Stack Software Engineer',
        'category': 'Software Engineering',
        'description': '''Looking for a Full Stack Software Engineer to build scalable web applications. 
        You will work on both frontend and backend development using modern technologies. 
        Required skills: JavaScript, React, Node.js, RESTful APIs, databases (PostgreSQL/MongoDB), 
        Git, and Agile methodologies. Experience with cloud deployment and CI/CD pipelines preferred. 
        Strong problem-solving and communication skills required.'''
    },
    {
        'job_id': 'JD003',
        'title': 'Machine Learning Engineer',
        'category': 'Data Science',
        'description': '''Seeking Machine Learning Engineer to develop and deploy ML models at scale. 
        Responsibilities include model training, optimization, and productionization. Required skills: 
        Python, deep learning frameworks (TensorFlow/PyTorch), MLOps, Docker, Kubernetes, and 
        experience with large-scale datasets. Knowledge of NLP and computer vision is a plus. 
        PhD or Masters in Computer Science or related field preferred.'''
    },
    {
        'job_id': 'JD004',
        'title': 'Frontend Developer',
        'category': 'Web Development',
        'description': '''We need a creative Frontend Developer to build beautiful user interfaces. 
        You will work with designers to implement responsive web applications. Required skills: 
        HTML5, CSS3, JavaScript, React or Vue.js, responsive design, cross-browser compatibility, 
        and version control (Git). Experience with TypeScript, testing frameworks, and UI/UX 
        principles is highly valued.'''
    },
    {
        'job_id': 'JD005',
        'title': 'DevOps Engineer',
        'category': 'DevOps',
        'description': '''Looking for DevOps Engineer to manage our cloud infrastructure and CI/CD pipelines. 
        Responsibilities include automation, monitoring, and ensuring system reliability. Required skills: 
        Linux, Docker, Kubernetes, Jenkins/GitLab CI, AWS/Azure, Terraform, scripting (Python/Bash), 
        and networking fundamentals. Experience with monitoring tools (Prometheus, Grafana) preferred.'''
    },
    {
        'job_id': 'JD006',
        'title': 'Data Analyst',
        'category': 'Data Science',
        'description': '''Seeking Data Analyst to transform data into actionable insights. You will create 
        dashboards, perform statistical analysis, and support business decision-making. Required skills: 
        SQL, Excel, Python/R, data visualization (Tableau/Power BI), statistical analysis, and 
        business intelligence. Strong analytical thinking and communication skills essential.'''
    },
    {
        'job_id': 'JD007',
        'title': 'Backend Developer',
        'category': 'Software Engineering',
        'description': '''We are hiring a Backend Developer to build robust server-side applications. 
        You will design APIs, optimize databases, and ensure system scalability. Required skills: 
        Java/Python/Node.js, RESTful API design, databases (SQL and NoSQL), microservices architecture, 
        caching (Redis), and message queues. Experience with distributed systems is a plus.'''
    },
    {
        'job_id': 'JD008',
        'title': 'Mobile App Developer',
        'category': 'Mobile Development',
        'description': '''Looking for Mobile App Developer to create native mobile applications. 
        You will develop features for iOS and Android platforms. Required skills: Swift/Kotlin, 
        mobile UI/UX patterns, RESTful APIs, local databases, push notifications, and app store 
        deployment. Experience with React Native or Flutter is beneficial.'''
    },
    {
        'job_id': 'JD009',
        'title': 'Cloud Architect',
        'category': 'Cloud Computing',
        'description': '''Seeking experienced Cloud Architect to design scalable cloud solutions. 
        You will define architecture patterns, security standards, and migration strategies. 
        Required skills: AWS/Azure/GCP, cloud architecture patterns, security best practices, 
        infrastructure as code, networking, and cost optimization. Relevant certifications preferred.'''
    },
    {
        'job_id': 'JD010',
        'title': 'QA Automation Engineer',
        'category': 'Quality Assurance',
        'description': '''We need QA Automation Engineer to build and maintain test automation frameworks. 
        You will design test strategies and ensure product quality. Required skills: Test automation 
        (Selenium/Cypress), programming (Python/Java), API testing, CI/CD integration, test frameworks, 
        and bug tracking tools. Experience with performance and security testing is a plus.'''
    }
]

# Create DataFrame
df_jobs = pd.DataFrame(sample_job_descriptions)
df_jobs['cleaned_description'] = df_jobs['description'].apply(clean_text)

print(f"Created {len(df_jobs)} job descriptions")
print(f"\nCategories: {df_jobs['category'].unique()}")
print(f"\n‚úì Job descriptions dataset ready")

Created 10 job descriptions

Categories: ['Data Science' 'Software Engineering' 'Web Development' 'DevOps'
 'Mobile Development' 'Cloud Computing' 'Quality Assurance']

‚úì Job descriptions dataset ready


## 9. Save Processed Data for Next Modules

In [11]:
# Save resumes to CSV
resume_output_path = DATA_DIR / 'processed_resumes.csv'
if 'cleaned_resume' in df_resumes.columns:
    # Save only essential columns
    columns_to_save = []
    if 'Category' in df_resumes.columns:
        columns_to_save.append('Category')
    columns_to_save.extend(['cleaned_resume', 'is_valid'])
    
    df_resumes[columns_to_save].to_csv(resume_output_path, index=False)
    print(f"‚úì Saved {len(df_resumes)} resumes to: {resume_output_path}")

# Save job descriptions to CSV
job_output_path = DATA_DIR / 'processed_job_descriptions.csv'
df_jobs.to_csv(job_output_path, index=False)
print(f"‚úì Saved {len(df_jobs)} job descriptions to: {job_output_path}")

# Save also as individual text files for easy access
for idx, row in df_jobs.iterrows():
    job_file = JOB_DESC_DIR / f"{row['job_id']}_{row['title'].replace(' ', '_')}.txt"
    with open(job_file, 'w', encoding='utf-8') as f:
        f.write(f"Title: {row['title']}\n")
        f.write(f"Category: {row['category']}\n")
        f.write(f"\nDescription:\n{row['description']}")

print(f"‚úì Saved individual job description files to: {JOB_DESC_DIR}")

‚úì Saved 100 resumes to: /home/ammaar/CODE/CVSense/data/processed_resumes.csv
‚úì Saved 10 job descriptions to: /home/ammaar/CODE/CVSense/data/processed_job_descriptions.csv
‚úì Saved individual job description files to: /home/ammaar/CODE/CVSense/data/job_descriptions


## 10. Generate Data Quality Report

In [12]:
def generate_data_quality_report():
    """
    Generate a comprehensive data quality report for ingested data.
    """
    report = {
        'timestamp': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
        'resumes': {
            'total_count': len(df_resumes),
            'valid_count': int(df_resumes['is_valid'].sum()) if 'is_valid' in df_resumes.columns else 0,
            'invalid_count': int((~df_resumes['is_valid']).sum()) if 'is_valid' in df_resumes.columns else 0,
        },
        'job_descriptions': {
            'total_count': len(df_jobs),
            'categories': list(df_jobs['category'].unique()),
        },
        'data_paths': {
            'resumes_csv': str(resume_output_path),
            'jobs_csv': str(job_output_path),
            'job_desc_dir': str(JOB_DESC_DIR),
        }
    }
    
    # Add statistics
    if 'cleaned_resume' in df_resumes.columns:
        valid_resumes = df_resumes[df_resumes['is_valid']]
        if len(valid_resumes) > 0:
            lengths = valid_resumes['cleaned_resume'].str.len()
            report['resumes']['avg_length'] = int(lengths.mean())
            report['resumes']['min_length'] = int(lengths.min())
            report['resumes']['max_length'] = int(lengths.max())
    
    return report

# Generate and save report
quality_report = generate_data_quality_report()
report_path = MODULE_DIR / 'data_quality_report.json'

with open(report_path, 'w') as f:
    json.dump(quality_report, f, indent=2)

print("\n" + "="*60)
print("DATA QUALITY REPORT")
print("="*60)
print(f"\nTimestamp: {quality_report['timestamp']}")
print(f"\nResumes:")
print(f"  Total: {quality_report['resumes']['total_count']}")
print(f"  Valid: {quality_report['resumes']['valid_count']}")
print(f"  Invalid: {quality_report['resumes']['invalid_count']}")
if 'avg_length' in quality_report['resumes']:
    print(f"  Avg Length: {quality_report['resumes']['avg_length']} characters")
    print(f"  Length Range: {quality_report['resumes']['min_length']} - {quality_report['resumes']['max_length']}")

print(f"\nJob Descriptions:")
print(f"  Total: {quality_report['job_descriptions']['total_count']}")
print(f"  Categories: {', '.join(quality_report['job_descriptions']['categories'])}")

print(f"\n‚úì Report saved to: {report_path}")
print("="*60)


DATA QUALITY REPORT

Timestamp: 2026-01-17 22:27:49

Resumes:
  Total: 100
  Valid: 100
  Invalid: 0
  Avg Length: 6345 characters
  Length Range: 1319 - 35217

Job Descriptions:
  Total: 10
  Categories: Data Science, Software Engineering, Web Development, DevOps, Mobile Development, Cloud Computing, Quality Assurance

‚úì Report saved to: /home/ammaar/CODE/CVSense/module_1_data_ingestion/data_quality_report.json


## 11. Create Data Format Specification Document

In [13]:
data_spec = """
# Data Format Specification for CVSense Pipeline

## Module 1 Output Format

### Processed Resumes (`data/processed_resumes.csv`)

**Columns:**
- `Category` (optional): The category/field of the resume (e.g., 'Data Science', 'Software Engineering')
- `cleaned_resume`: Cleaned and validated resume text ready for preprocessing
- `is_valid`: Boolean flag indicating if the resume passed quality validation

**Data Quality Standards:**
- Minimum text length: 50 characters
- Minimum alphabetic content ratio: 50%
- Encoding errors removed
- Excessive whitespace normalized

### Processed Job Descriptions (`data/processed_job_descriptions.csv`)

**Columns:**
- `job_id`: Unique identifier for the job posting (e.g., 'JD001')
- `title`: Job title
- `category`: Job category/field
- `description`: Original job description text
- `cleaned_description`: Cleaned job description ready for preprocessing

**Individual Files:** Each job description is also saved as a separate text file in `data/job_descriptions/`

## Expected Input for Module 2 (Text Preprocessing)

Module 2 should:
1. Load `data/processed_resumes.csv` and `data/processed_job_descriptions.csv`
2. Use only rows where `is_valid == True` for resumes
3. Apply text preprocessing to `cleaned_resume` and `cleaned_description` columns
4. Output format should maintain the same structure with additional preprocessed columns

## Data Validation Guidelines

### Why Data Quality Matters:
- **Poor PDF Extraction:** Corrupted characters, formatting issues can reduce matching accuracy
- **Text Quality:** Low-quality text leads to poor feature extraction and inaccurate similarity scores
- **Consistency:** Standardized format ensures all modules work correctly

### Common PDF Extraction Challenges:
1. **Encoding Issues:** Special characters may not extract correctly
2. **Layout Problems:** Multi-column resumes can have scrambled text
3. **Images as Text:** Text in images cannot be extracted without OCR
4. **Tables:** Table formatting often gets lost in extraction

## File Locations

```
CVSense/
‚îú‚îÄ‚îÄ data/
‚îÇ   ‚îú‚îÄ‚îÄ processed_resumes.csv          # Main resume dataset
‚îÇ   ‚îú‚îÄ‚îÄ processed_job_descriptions.csv # Main job descriptions dataset
‚îÇ   ‚îú‚îÄ‚îÄ resumes/                       # Individual resume files (if any)
‚îÇ   ‚îî‚îÄ‚îÄ job_descriptions/              # Individual job description files
‚îî‚îÄ‚îÄ module_1_data_ingestion/
    ‚îú‚îÄ‚îÄ data_ingestion.ipynb           # Main implementation notebook
    ‚îî‚îÄ‚îÄ data_quality_report.json       # Quality metrics and statistics
```

## Contact

For questions about data format or quality issues, contact the Module 1 owner.
"""

# Save specification document
spec_path = MODULE_DIR / 'DATA_FORMAT_SPECIFICATION.md'
with open(spec_path, 'w') as f:
    f.write(data_spec)

print(f"‚úì Data format specification saved to: {spec_path}")

‚úì Data format specification saved to: /home/ammaar/CODE/CVSense/module_1_data_ingestion/DATA_FORMAT_SPECIFICATION.md


## 12. Summary & Next Steps

In [14]:
print("\n" + "="*70)
print("MODULE 1: DATA INGESTION - COMPLETE")
print("="*70)

print("\nüìä DELIVERABLES:")
print(f"\n1. Resume Dataset:")
print(f"   ‚îî‚îÄ {resume_output_path}")
print(f"   ‚îî‚îÄ {len(df_resumes)} resumes ({df_resumes['is_valid'].sum() if 'is_valid' in df_resumes.columns else 0} valid)")

print(f"\n2. Job Descriptions Dataset:")
print(f"   ‚îî‚îÄ {job_output_path}")
print(f"   ‚îî‚îÄ {len(df_jobs)} job descriptions across {len(df_jobs['category'].unique())} categories")

print(f"\n3. Documentation:")
print(f"   ‚îî‚îÄ {report_path}")
print(f"   ‚îî‚îÄ {spec_path}")

print(f"\n4. Utilities:")
print(f"   ‚îî‚îÄ PDF extraction function (extract_text_from_pdf)")
print(f"   ‚îî‚îÄ Text cleaning function (clean_text)")
print(f"   ‚îî‚îÄ Validation function (validate_resume_text)")

print("\nüîÑ READY FOR MODULE 2:")
print("   ‚úì Data cleaned and validated")
print("   ‚úì Consistent format established")
print("   ‚úì Quality metrics documented")
print("   ‚úì Specification provided for downstream modules")

print("\nüìù KEY POINTS TO EXPLAIN:")
print("   ‚Ä¢ Data quality directly impacts ML model performance")
print("   ‚Ä¢ PDF extraction challenges: encoding, layout, images")
print("   ‚Ä¢ Validation ensures only quality data proceeds to next modules")
print("   ‚Ä¢ Standardized format enables smooth pipeline integration")

print("\n" + "="*70)
print("‚úÖ Module 1 implementation successful!")
print("="*70)


MODULE 1: DATA INGESTION - COMPLETE

üìä DELIVERABLES:

1. Resume Dataset:
   ‚îî‚îÄ /home/ammaar/CODE/CVSense/data/processed_resumes.csv
   ‚îî‚îÄ 100 resumes (100 valid)

2. Job Descriptions Dataset:
   ‚îî‚îÄ /home/ammaar/CODE/CVSense/data/processed_job_descriptions.csv
   ‚îî‚îÄ 10 job descriptions across 7 categories

3. Documentation:
   ‚îî‚îÄ /home/ammaar/CODE/CVSense/module_1_data_ingestion/data_quality_report.json
   ‚îî‚îÄ /home/ammaar/CODE/CVSense/module_1_data_ingestion/DATA_FORMAT_SPECIFICATION.md

4. Utilities:
   ‚îî‚îÄ PDF extraction function (extract_text_from_pdf)
   ‚îî‚îÄ Text cleaning function (clean_text)
   ‚îî‚îÄ Validation function (validate_resume_text)

üîÑ READY FOR MODULE 2:
   ‚úì Data cleaned and validated
   ‚úì Consistent format established
   ‚úì Quality metrics documented
   ‚úì Specification provided for downstream modules

üìù KEY POINTS TO EXPLAIN:
   ‚Ä¢ Data quality directly impacts ML model performance
   ‚Ä¢ PDF extraction challenges: enco