# Task 2.2.1: Data Cleaning for Reference Matching

## Problem: Match BibTeX Entries to References.json
Identify which extracted BibTeX entry corresponds to which arXiv ID in references.json.

**Data Cleaning Strategy:**
- Create hierarchical text versions: original → cleaned → no_stopwords
- Normalize authors: Parse names, extract last names, create searchable strings
- Extract temporal data: Year from BibTeX and references
- Structure metadata: ArXiv IDs, entry types, cleaned title variants

**Output:** cleaned_data.json for each paper with all hierarchy levels

### Import Libraries and Setup

In [None]:
import os
import re
import json
from pathlib import Path
from typing import Dict, List, Any, Set, Tuple, Optional
from collections import defaultdict
import logging
import unicodedata

# For text processing
import string
from difflib import SequenceMatcher

# Setup logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

### Configure Paths

In [None]:
# Project paths
import os
current_dir = Path(os.getcwd())

# If running from src/ directory, go up one level
if current_dir.name == 'src':
    BASE_DIR = current_dir.parent
else:
    # Otherwise assume we're already in the project root
    BASE_DIR = current_dir
PAPERS_DIR = BASE_DIR / "papers"
OUTPUT_DIR = BASE_DIR / "bibtex"
STUDENT_ID = "23127088"
PROCESSED_DIR = OUTPUT_DIR / STUDENT_ID

### Define DataLoader Class

In [None]:
class DataLoader:
    """Loads BibTeX entries and references.json data"""
    
    def __init__(self, papers_dir: Path, processed_dir: Path):
        self.papers_dir = papers_dir
        self.processed_dir = processed_dir
    
    def load_bibtex_from_file(self, bib_file: Path) -> Dict[str, Dict[str, str]]:
        """Parse a .bib file and extract entries"""
        entries = {}
        
        try:
            content = bib_file.read_text(encoding='utf-8', errors='ignore')
            
            # Pattern to match BibTeX entries
            entry_pattern = re.compile(
                r'@(\w+)\{([^,]+),([^@]+)\}',
                re.DOTALL
            )
            
            for match in entry_pattern.finditer(content):
                entry_type = match.group(1)
                entry_key = match.group(2).strip()
                entry_fields = match.group(3)
                
                # Parse fields
                fields = {'entry_type': entry_type}
                field_pattern = re.compile(r'(\w+)\s*=\s*\{([^}]+)\}')
                
                for field_match in field_pattern.finditer(entry_fields):
                    field_name = field_match.group(1)
                    field_value = field_match.group(2).strip()
                    fields[field_name] = field_value
                
                entries[entry_key] = fields
        
        except Exception as e:
            logger.error(f"Error loading BibTeX file {bib_file}: {e}")
        
        return entries
    
    def load_references_json(self, json_file: Path) -> List[Dict[str, Any]]:
        """Load references.json file"""
        try:
            with open(json_file, 'r', encoding='utf-8') as f:
                data = json.load(f)
                # references.json is a dict with arxiv_id as keys
                # Convert to list and add arxiv_id to each entry
                references = []
                for arxiv_id, ref_data in data.items():
                    ref_entry = ref_data.copy()
                    ref_entry['arxiv_id'] = arxiv_id
                    references.append(ref_entry)
                return references
        except Exception as e:
            logger.error(f"Error loading references JSON {json_file}: {e}")
            return []
    
    def load_paper_data(self, paper_id: str) -> Tuple[Dict, List]:
        """Load both BibTeX and references.json for a paper"""
        # Load processed BibTeX
        bib_file = self.processed_dir / paper_id / 'refs.bib'
        bibtex_entries = {}
        if bib_file.exists():
            bibtex_entries = self.load_bibtex_from_file(bib_file)
        
        # Load references.json
        ref_file = self.papers_dir / paper_id / 'references.json'
        references = []
        if ref_file.exists():
            references = self.load_references_json(ref_file)
        
        return bibtex_entries, references
    
    def get_all_papers(self) -> List[str]:
        """Get list of all processed papers"""
        papers = []
        if self.processed_dir.exists():
            papers = [d.name for d in self.processed_dir.iterdir() if d.is_dir()]
        return sorted(papers)

In [None]:
# Common English stop words (subset - focusing on words that don't help matching)
STOP_WORDS = set([
    'a', 'an', 'and', 'are', 'as', 'at', 'be', 'by', 'for', 'from', 'has', 'he',
    'in', 'is', 'it', 'its', 'of', 'on', 'that', 'the', 'to', 'was', 'will', 'with'
])

class TextCleaner:
    """Cleans and normalizes text for matching"""
    
    @staticmethod
    def normalize_unicode(text: str) -> str:
        """Convert Unicode to ASCII (handle accented characters)"""
        # Normalize to NFKD form and encode to ASCII
        normalized = unicodedata.normalize('NFKD', text)
        ascii_text = normalized.encode('ascii', 'ignore').decode('ascii')
        return ascii_text
    
    @staticmethod
    def remove_latex_commands(text: str) -> str:
        """Remove remaining LaTeX commands from text"""
        # Remove \command{content} but keep content
        text = re.sub(r'\\\w+\{([^}]*)\}', r'\1', text)
        # Remove standalone commands
        text = re.sub(r'\\\w+', '', text)
        # Remove curly braces
        text = re.sub(r'[{}]', '', text)
        return text
    
    @staticmethod
    def clean_title(title: str, remove_stopwords: bool = True) -> str:
        """Clean and normalize title text"""
        if not title:
            return ""
        
        # Remove LaTeX commands
        title = TextCleaner.remove_latex_commands(title)
        
        # Normalize Unicode
        title = TextCleaner.normalize_unicode(title)
        
        # Lowercase
        title = title.lower()
        
        # Remove punctuation except hyphens (important for compound words)
        title = re.sub(r'[^\w\s-]', ' ', title)
        
        # Normalize whitespace
        title = re.sub(r'\s+', ' ', title).strip()
        
        # Remove stop words if requested
        if remove_stopwords:
            words = title.split()
            words = [w for w in words if w not in STOP_WORDS]
            title = ' '.join(words)
        
        return title
    
    @staticmethod
    def normalize_author_name(author: str) -> str:
        """Normalize author name to standard format"""
        if not author:
            return ""
        
        # Normalize Unicode
        author = TextCleaner.normalize_unicode(author)
        
        # Lowercase
        author = author.lower()

        # Remove all the word "and"
        author = re.sub(r'\band\b', ' ', author)
        
        # Remove periods and commas
        author = re.sub(r'[.,]', ' ', author)
        
        # Normalize whitespace
        author = re.sub(r'\s+', ' ', author).strip()
        
        return author
    
    @staticmethod
    def extract_author_last_names(author_string: str) -> List[str]:
        """Extract last names from author string"""
        if not author_string:
            return []
        
        # Normalize first
        author_string = TextCleaner.normalize_author_name(author_string)
        
        # Split by 'and' or commas
        authors = re.split(r'\band\b|,', author_string)
        
        last_names = []
        for author in authors:
            author = author.strip()
            if not author:
                continue
            
            # Try to extract last name (usually first word or last word)
            words = author.split()
            if words:
                # If format is "Last, First" or "Last First", take first word
                # If format is "First Last", take last word
                # Heuristic: if first word has multiple chars, it's likely the last name
                if len(words[0]) > 2:
                    last_names.append(words[0])
                elif len(words) > 1:
                    last_names.append(words[-1])
        
        return last_names
    
    @staticmethod
    def extract_year(text: str) -> Optional[int]:
        """Extract year from text"""
        if not text:
            return None
        
        # Look for 4-digit year
        year_match = re.search(r'\b(19|20)\d{2}\b', str(text))
        if year_match:
            return int(year_match.group(0))
        return None
    
    @staticmethod
    def clean_arxiv_id(arxiv_id: str) -> str:
        """Normalize arXiv ID format"""
        if not arxiv_id:
            return ""
        
        # Remove 'arXiv:' prefix if present
        arxiv_id = re.sub(r'^arxiv:\s*', '', arxiv_id, flags=re.IGNORECASE)
        
        # Normalize whitespace
        arxiv_id = arxiv_id.strip()
        
        return arxiv_id

### Define ReferenceDataCleaner Class

Hierarchical text cleaning pipeline:
1. **Original**: Raw text as-is
2. **Cleaned**: Lowercase, remove punctuation, normalize unicode
3. **No stopwords**: Remove common words (the, a, an, etc.)

Also handles author normalization and year extraction.

In [None]:
class ReferenceDataCleaner:
    """Main processor for cleaning reference data"""
    
    def __init__(self):
        self.cleaner = TextCleaner()
    
    def clean_bibtex_entry(self, key: str, entry: Dict[str, str]) -> Dict[str, Any]:
        """Clean a single BibTeX entry"""
        cleaned = {
            'original_key': key,
            'entry_type': entry.get('entry_type', 'unknown'),
        }
        
        # Clean title
        if 'title' in entry:
            cleaned['title_original'] = entry['title']
            cleaned['title_cleaned'] = self.cleaner.clean_title(entry['title'], remove_stopwords=False)
            cleaned['title_no_stopwords'] = self.cleaner.clean_title(entry['title'], remove_stopwords=True)
        
        if 'author' in entry:
            cleaned['author_original'] = entry['author'] 
            cleaned['author_normalized'] = self.cleaner.normalize_author_name(entry['author'])
            cleaned['author_last_names'] = self.cleaner.extract_author_last_names(entry['author'])
        
        # Extract year
        year = None
        if 'year' in entry:
            year = self.cleaner.extract_year(entry['year'])
        if not year and 'note' in entry:
            year = self.cleaner.extract_year(entry['note'])
        cleaned['year'] = year
        
        # Keep other useful fields for matching
        for field in ['journal', 'booktitle', 'volume', 'pages']:
            if field in entry:
                cleaned[field] = entry[field]
        
        return cleaned
    
    def clean_reference_entry(self, ref: Dict[str, Any]) -> Dict[str, Any]:
        """Clean a reference from references.json"""
        cleaned = {}
        
        # ArXiv ID (primary key)
        if 'arxiv_id' in ref:
            cleaned['arxiv_id'] = ref['arxiv_id']  # Already extracted from dict key
        
        # Clean title (field is 'paper_title' in references.json)
        if 'paper_title' in ref:
            cleaned['title_original'] = ref['paper_title']
            cleaned['title_cleaned'] = self.cleaner.clean_title(ref['paper_title'], remove_stopwords=False)
            cleaned['title_no_stopwords'] = self.cleaner.clean_title(ref['paper_title'], remove_stopwords=True)
        
        # Clean authors
        if 'authors' in ref and isinstance(ref['authors'], list):
            cleaned['authors_original'] = ref['authors']
            # Create normalized author string with 'and' preserved
            author_string = ' and '.join(ref['authors'])
            cleaned['author_normalized'] = self.cleaner.normalize_author_name(author_string)
            cleaned['author_last_names'] = self.cleaner.extract_author_last_names(author_string)
        
        # Extract year from submission_date
        year = None
        if 'submission_date' in ref:
            year = self.cleaner.extract_year(ref['submission_date'])
        cleaned['year'] = year
        
        # Keep semantic scholar ID if present
        if 'semantic_scholar_id' in ref:
            cleaned['semantic_scholar_id'] = ref['semantic_scholar_id']
        
        return cleaned
    
    def process_paper(self, paper_id: str, bibtex_entries: Dict, references: List) -> Dict:
        """Process all data for a single paper"""
        result = {
            'paper_id': paper_id,
            'bibtex_cleaned': [],
            'references_cleaned': [],
            'statistics': {}
        }
        
        # Clean BibTeX entries
        for key, entry in bibtex_entries.items():
            cleaned = self.clean_bibtex_entry(key, entry)
            result['bibtex_cleaned'].append(cleaned)
        
        # Clean references
        for ref in references:
            cleaned = self.clean_reference_entry(ref)
            result['references_cleaned'].append(cleaned)
        
        # Statistics
        result['statistics'] = {
            'num_bibtex': len(result['bibtex_cleaned']),
            'num_references': len(result['references_cleaned']),
            'bibtex_with_title': sum(1 for b in result['bibtex_cleaned'] if b.get('title_cleaned')),
            'bibtex_with_author': sum(1 for b in result['bibtex_cleaned'] if b.get('author_normalized')),
            'references_with_title': sum(1 for r in result['references_cleaned'] if r.get('title_cleaned')),
            'references_with_author': sum(1 for r in result['references_cleaned'] if r.get('author_normalized')),
        }
        
        return result

In [None]:
# Initialize
loader = DataLoader(PAPERS_DIR, PROCESSED_DIR)
cleaner = ReferenceDataCleaner()

# Get available papers
all_papers = loader.get_all_papers()
print(f"Total papers available: {len(all_papers)}")
print(f"First 10 papers: {all_papers[:10]}")

Total papers available: 3180
First 10 papers: ['2312-15844', '2312-15845', '2312-15846', '2312-15847', '2312-15848', '2312-15851', '2312-15853', '2312-15855', '2312-15856', '2312-15857']


### Test Processing on Sample Paper

In [None]:
# Process first paper as example
if all_papers:
    test_paper_id = all_papers[0]
    print(f"\nProcessing test paper: {test_paper_id}")
    
    # Load data
    bibtex_entries, references = loader.load_paper_data(test_paper_id)
    
    print(f"\nLoaded:")
    print(f"  - BibTeX entries: {len(bibtex_entries)}")
    print(f"  - References: {len(references)}")
    
    # Clean data
    cleaned_data = cleaner.process_paper(test_paper_id, bibtex_entries, references)
    
    print(f"\nCleaning Statistics:")
    for key, value in cleaned_data['statistics'].items():
        print(f"  - {key}: {value}")


Processing test paper: 2312-15844

Loaded:
  - BibTeX entries: 49
  - References: 32

Cleaning Statistics:
  - num_bibtex: 49
  - num_references: 32
  - bibtex_with_title: 49
  - bibtex_with_author: 48
  - references_with_title: 32
  - references_with_author: 32


### Examine Sample BibTeX Entry Hierarchy

In [None]:
# Show example BibTeX cleaning
if cleaned_data['bibtex_cleaned']:
    print("=" * 80)
    print("EXAMPLE: BibTeX Entry Cleaning")
    print("=" * 80)
    
    example = cleaned_data['bibtex_cleaned'][0]
    
    print(f"\nOriginal Key: {example.get('original_key')}")
    print(f"Entry Type: {example.get('entry_type')}")
    
    if 'title_original' in example:
        print(f"\nTitle (Original):")
        print(f"  {example['title_original']}")
        print(f"\nTitle (Cleaned):")
        print(f"  {example['title_cleaned']}")
        print(f"\nTitle (No Stopwords):")
        print(f"  {example['title_no_stopwords']}")
    
    if 'author_original' in example:
        print(f"\nAuthor (Original):")
        print(f"  {example['author_original']}")
        print(f"\nAuthor (Normalized):")
        print(f"  {example['author_normalized']}")
        print(f"\nAuthor Last Names:")
        print(f"  {example['author_last_names']}")
    
    if example.get('year'):
        print(f"\nYear Extracted: {example['year']}")
    
    if example.get('arxiv_id'):
        print(f"ArXiv ID Found: {example['arxiv_id']}")

EXAMPLE: BibTeX Entry Cleaning

Original Key: ishiguro2021realisation
Entry Type: article

Title (Original):
  {The Realisation of an Avatar-Symbiotic Society where Everyone can Perform Active Roles without Constraint

Title (Cleaned):
  the realisation of an avatar-symbiotic society where everyone can perform active roles without constraint

Title (No Stopwords):
  realisation avatar-symbiotic society where everyone can perform active roles without constraint

Author (Original):
  Ishiguro, Hiroshi

Author (Normalized):
  ishiguro hiroshi

Author Last Names:
  ['ishiguro']

Year Extracted: 2021


### Examine Sample Reference Hierarchy

In [None]:
# Show example reference cleaning
if cleaned_data['references_cleaned']:
    print("\n" + "=" * 80)
    print("EXAMPLE: Reference Entry Cleaning")
    print("=" * 80)
    
    example = cleaned_data['references_cleaned'][0]
    
    print(f"\nArXiv ID: {example.get('arxiv_id')}")
    
    if 'title_original' in example:
        print(f"\nTitle (Original):")
        print(f"  {example['title_original']}")
        print(f"\nTitle (Cleaned):")
        print(f"  {example['title_cleaned']}")
        print(f"\nTitle (No Stopwords):")
        print(f"  {example['title_no_stopwords']}")
    
    if 'authors_original' in example:
        print(f"\nAuthors (Original):")
        print(f"  {example['authors_original']}")
        print(f"\nAuthor (Normalized):")
        print(f"  {example['author_normalized']}")
        print(f"\nAuthor Last Names:")
        print(f"  {example['author_last_names']}")
    
    if example.get('year'):
        print(f"\nYear Extracted: {example['year']}")


EXAMPLE: Reference Entry Cleaning

ArXiv ID: 2309-05551

Title (Original):
  OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data

Title (Cleaned):
  openfashionclip vision-and-language contrastive learning with open-source fashion data

Title (No Stopwords):
  openfashionclip vision-and-language contrastive learning open-source fashion data

Authors (Original):
  ['Giuseppe Cartella', 'Alberto Baldrati', 'Davide Morelli', 'Marcella Cornia', 'Marco Bertini', 'R. Cucchiara']

Author (Normalized):
  giuseppe cartella alberto baldrati davide morelli marcella cornia marco bertini r cucchiara

Author Last Names:
  ['giuseppe']

Year Extracted: 2023


In [None]:
def process_paper(paper_id: str):
    """Process a single paper and return cleaned data"""
    try:
        # Load data
        bibtex_entries, references = loader.load_paper_data(paper_id)
        
        # Clean data
        cleaned_data = cleaner.process_paper(paper_id, bibtex_entries, references)
        
        # Save individual paper data in bibtex/{paperid}/ folder
        output_folder = PROCESSED_DIR / paper_id
        output_folder.mkdir(parents=True, exist_ok=True)
        output_file = output_folder / "cleaned_data.json"
        
        with open(output_file, 'w', encoding='utf-8') as f:
            json.dump(cleaned_data, f, indent=2, ensure_ascii=False)
        
        print(f"✓ ({cleaned_data['statistics']['num_bibtex']} BibTeX, {cleaned_data['statistics']['num_references']} refs)")
    
    except Exception as e:
        print(f"✗ Error: {e}")
        logger.error(f"Error processing {paper_id}: {e}")
    

def process_all_papers(limit: Optional[int] = None):
    """Process all papers and save cleaned data"""
    papers_to_process = all_papers[:limit] if limit else all_papers
    
    results = []
    total = len(papers_to_process)
    
    print(f"Processing {total} papers...\n")
    
    for i, paper_id in enumerate(papers_to_process, 1):
        print(f"[{i}/{total}] Processing {paper_id}...", end=" ")
        
        try:
            # Load data
            bibtex_entries, references = loader.load_paper_data(paper_id)
            
            # Clean data
            cleaned_data = cleaner.process_paper(paper_id, bibtex_entries, references)
            results.append(cleaned_data)
            
            # Save individual paper data in bibtex/{paperid}/ folder
            output_folder = PROCESSED_DIR / paper_id
            output_folder.mkdir(parents=True, exist_ok=True)
            output_file = output_folder / "cleaned_data.json"
            
            with open(output_file, 'w', encoding='utf-8') as f:
                json.dump(cleaned_data, f, indent=2, ensure_ascii=False)
            
            print(f"✓ ({cleaned_data['statistics']['num_bibtex']} BibTeX, {cleaned_data['statistics']['num_references']} refs)")
        
        except Exception as e:
            print(f"✗ Error: {e}")
            logger.error(f"Error processing {paper_id}: {e}")
    
    summary = {
        'total_papers': len(results),
        'total_bibtex': sum(r['statistics']['num_bibtex'] for r in results),
        'total_references': sum(r['statistics']['num_references'] for r in results),
        'papers': [r['paper_id'] for r in results]
    }
    
    print(f"\n{'='*80}")
    print(f"Processing Complete!")
    print(f"  - Papers processed: {summary['total_papers']}")
    print(f"  - Total BibTeX entries: {summary['total_bibtex']}")
    print(f"  - Total references: {summary['total_references']}")
    print(f"  - Output directory: {PROCESSED_DIR}")
    
    return results

In [None]:
# Process individual paper
# results = process_paper('2312-15844')

### Process All Papers

In [None]:
# Re-process all papers with fixed cleaning
all_results = process_all_papers(limit=500)

Processing 500 papers...

[1/500] Processing 2312-15844... ✓ (49 BibTeX, 32 refs)
[2/500] Processing 2312-15845... ✓ (29 BibTeX, 18 refs)
[3/500] Processing 2312-15846... ✓ (72 BibTeX, 59 refs)
[4/500] Processing 2312-15847... ✓ (303 BibTeX, 25 refs)
[5/500] Processing 2312-15848... ✓ (190 BibTeX, 47 refs)
[6/500] Processing 2312-15851... ✓ (154 BibTeX, 29 refs)
[7/500] Processing 2312-15853... ✓ (228 BibTeX, 20 refs)
[8/500] Processing 2312-15855... ✓ (298 BibTeX, 34 refs)
[9/500] Processing 2312-15856... ✓ (192 BibTeX, 68 refs)
[10/500] Processing 2312-15857... ✓ (23 BibTeX, 6 refs)
[11/500] Processing 2312-15858... ✓ (234 BibTeX, 22 refs)
[12/500] Processing 2312-15861... ✓ (17 BibTeX, 10 refs)
[13/500] Processing 2312-15863... ✓ (109 BibTeX, 30 refs)
[14/500] Processing 2312-15864... ✓ (73 BibTeX, 4 refs)
[15/500] Processing 2312-15867... ✓ (31 BibTeX, 20 refs)
[16/500] Processing 2312-15868... ✓ (60 BibTeX, 0 refs)
[17/500] Processing 2312-15869... ✓ (34 BibTeX, 19 refs)
[18/500] 

✓ (84442 BibTeX, 27 refs)
[40/500] Processing 2312-15902... ✓ (119 BibTeX, 28 refs)
[41/500] Processing 2312-15903... ✓ (34 BibTeX, 20 refs)
[42/500] Processing 2312-15904... ✓ (1155 BibTeX, 13 refs)
[43/500] Processing 2312-15906... ✓ (54 BibTeX, 30 refs)
[44/500] Processing 2312-15907... ✓ (78023 BibTeX, 42 refs)
[45/500] Processing 2312-15908... ✓ (100 BibTeX, 18 refs)
[46/500] Processing 2312-15909... ✓ (36 BibTeX, 21 refs)
[47/500] Processing 2312-15910... ✓ (73 BibTeX, 26 refs)
[48/500] Processing 2312-15911... ✓ (132 BibTeX, 28 refs)
[49/500] Processing 2312-15912... ✓ (6 BibTeX, 0 refs)
[50/500] Processing 2312-15914... ✓ (12 BibTeX, 2 refs)
[51/500] Processing 2312-15915... ✓ (91 BibTeX, 76 refs)
[52/500] Processing 2312-15916... ✓ (37 BibTeX, 23 refs)
[53/500] Processing 2312-15918... ✓ (218 BibTeX, 56 refs)
[54/500] Processing 2312-15921... ✓ (159 BibTeX, 39 refs)
[55/500] Processing 2312-15922... ✓ (29 BibTeX, 20 refs)
[56/500] Processing 2312-15923... ✓ (119 BibTeX, 22 ref

## Summary

 **Task 2.2.1 - Data Cleaning Pipeline Complete**:
1. Loaded BibTeX entries and references.json data
2. Applied comprehensive text normalization
3. Standardized titles, authors, and metadata
4. Extracted key matching features (year, arXiv ID, last names)
5. Saved cleaned data in `bibtex/{paper_id}/cleaned_data.json` for each paper

### Cleaning Justifications:

| Cleaning Step | Reason | Example |
|---------------|--------|----------|
| Lowercasing | Case-insensitive matching | "Deep Learning" → "deep learning" |
| Unicode normalization | Handle accents/special chars | "Müller" → "Muller" |
| Punctuation removal | Standardize format | "Smith, J." → "smith j" |
| Stop word removal | Focus on meaningful words | "The Art of Programming" → "art programming" |
| Author name extraction | Match by last name | "Smith, J., Jones, K." → ["smith", "jones"] |
| Year extraction | Temporal matching | "Published 2020" → 2020 |

### Output Structure:

Each paper's cleaned data is saved as:
```
bibtex/
  23127088/
    2312-15857/
      refs.bib
      cleaned_data.json  ← New cleaned data file
    2312-15858/
      refs.bib
      cleaned_data.json
    ...
```

### Next Steps:

Continue to **Task 2.2.2 - Data Labelling** in the next notebook ([2_2_2_Data_Labelling.ipynb](2_2_2_Data_Labelling.ipynb))