# Data Collection and Preprocessing for Graduation Project

## üìö Learning Objectives

By completing this notebook, you will:
- Collect and acquire datasets for graduation project
- Perform comprehensive data cleaning and preprocessing
- Implement feature engineering techniques
- Validate data quality and create train/validation/test splits
- Document data collection and preprocessing procedures

## üîó Prerequisites

- ‚úÖ Unit 1: Project proposal completed
- ‚úÖ Understanding of data science pipelines
- ‚úÖ Python, Pandas, NumPy knowledge

---

## Official Structure Reference

This notebook covers practical activities from **Course 12, Unit 2**:
- Collecting and acquiring datasets
- Performing data cleaning and preprocessing using Python libraries
- Implementing feature engineering techniques
- Validating data quality and preparing train/validation/test splits
- Creating data exploration notebooks with visualizations
- **Source:** `DETAILED_UNIT_DESCRIPTIONS.md` - Unit 2 Practical Content

---

## Introduction

**Data Collection and Preparation** is critical for project success. Quality data leads to quality models.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

print("‚úÖ Libraries imported!")
print("Ready for data collection and preprocessing!")


‚úÖ Libraries imported!
Ready for data collection and preprocessing!


## Part 1: Data Collection Strategies

Explore different data sources and collection methods.


In [2]:
print("=" * 60)
print("Data Collection Strategies")
print("=" * 60)

data_sources = {
    "Public Datasets": {
        "sources": [
            "Kaggle Datasets", "UCI Machine Learning Repository",
            "Google Dataset Search",
            "Papers with Code Datasets",
            "Hugging Face Datasets"
        ],
        "advantages": "Ready to use, often cleaned, documented",
        "considerations": "Check licensing, ensure relevance to your problem"
    },
    "APIs": {
        "sources": [
            "Twitter API (for social media analysis)",
            "Reddit API",
            "News APIs",
            "Government data APIs"
        ],
        "advantages": "Real-time data, structured format",
        "considerations": "API limits, authentication, rate limiting"
    },
    "Web Scraping": {
        "sources": [
            "BeautifulSoup + requests",
            "Scrapy framework",
            "Selenium for dynamic content"
        ],
        "advantages": "Access to large amounts of data",
        "considerations": "Legal and ethical considerations, robots.txt, ToS"
    },
    "Custom Collection": {
        "sources": [
            "Surveys",
            "Experiments",
            "Sensors/IoT devices"
        ],
        "advantages": "Tailored to your specific needs",
        "considerations": "Time-consuming, requires IRB approval for human subjects"
    }
}

for source_type, details in data_sources.items():
    print(f"\n{source_type}:")
    print(f"  Sources: {', '.join(details['sources'][:3])}...")
    print(f"  Advantages: {details['advantages']}")
    print(f"  Considerations: {details['considerations']}")

print("\n‚úÖ Choose data sources aligned with your project goals and ethical guidelines!")


Data Collection Strategies

Public Datasets:
  Sources: Kaggle Datasets, UCI Machine Learning Repository, Google Dataset Search...
  Advantages: Ready to use, often cleaned, documented
  Considerations: Check licensing, ensure relevance to your problem

APIs:
  Sources: Twitter API (for social media analysis), Reddit API, News APIs...
  Advantages: Real-time data, structured format
  Considerations: API limits, authentication, rate limiting

Web Scraping:
  Sources: BeautifulSoup + requests, Scrapy framework, Selenium for dynamic content...
  Advantages: Access to large amounts of data
  Considerations: Legal and ethical considerations, robots.txt, ToS

Custom Collection:
  Sources: Surveys, Experiments, Sensors/IoT devices...
  Advantages: Tailored to your specific needs
  Considerations: Time-consuming, requires IRB approval for human subjects

‚úÖ Choose data sources aligned with your project goals and ethical guidelines!


## Part 2: Data Cleaning and Preprocessing Pipeline

Comprehensive data cleaning workflow.


In [3]:
# Generate sample data for demonstration
np.random.seed(42)
sample_data = pd.DataFrame({
    'feature1': np.random.randn(1000), 'feature2': np.random.choice(['A', 'B', 'C'], 1000),
    'feature3': np.random.randn(1000) + np.random.choice([0, np.nan], 1000, p=[0.95, 0.05]),
    'target': np.random.choice([0, 1], 1000)
})

print("=" * 60)
print("Data Cleaning Pipeline")
print("=" * 60)

print("\n1. Initial Data Inspection:")
print(f"   Shape: {sample_data.shape}")
print(f"   Missing values:\n{sample_data.isnull().sum()}")

# Handle missing values
print("\n2. Handling Missing Values:")
# Option 1: Drop rows with missing values
data_dropped = sample_data.dropna()
print(f"   After dropping: {data_dropped.shape}")

# Option 2: Fill missing values (mean for numeric, mode for categorical)
data_filled = sample_data.copy()
data_filled['feature3'].fillna(data_filled['feature3'].mean(), inplace=True)
print(f"   After filling: Missing values = {data_filled.isnull().sum().sum()}")

# Handle categorical variables
print("\n3. Encoding Categorical Variables:")
le = LabelEncoder()
data_encoded = data_filled.copy()
data_encoded['feature2_encoded'] = le.fit_transform(data_encoded['feature2'])
print(f"   Original categories: {data_filled['feature2'].unique()}")
print(f"   Encoded values: {data_encoded['feature2_encoded'].unique()}")

# Feature scaling
print("\n4. Feature Scaling:")
scaler = StandardScaler()
numeric_features = ['feature1', 'feature3']
data_scaled = data_encoded.copy()
data_scaled[numeric_features] = scaler.fit_transform(data_scaled[numeric_features])
print(f"   Features scaled: {numeric_features}")
print(f"   Mean after scaling: {data_scaled[numeric_features].mean().round(4).tolist()}")
print(f"   Std after scaling: {data_scaled[numeric_features].std().round(4).tolist()}")

print("\n‚úÖ Data cleaning pipeline complete!")


Data Cleaning Pipeline

1. Initial Data Inspection:
   Shape: (1000, 4)
   Missing values:
feature1     0
feature2     0
feature3    43
target       0
dtype: int64

2. Handling Missing Values:
   After dropping: (957, 4)
   After filling: Missing values = 0

3. Encoding Categorical Variables:
   Original categories: ['C' 'A' 'B']
   Encoded values: [2 0 1]

4. Feature Scaling:
   Features scaled: ['feature1', 'feature3']
   Mean after scaling: [0.0, 0.0]
   Std after scaling: [1.0005, 1.0005]

‚úÖ Data cleaning pipeline complete!


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data_filled['feature3'].fillna(data_filled['feature3'].mean(), inplace=True)


## Part 3: Train/Validation/Test Split

Proper data splitting for model development and evaluation.


In [4]:
# Prepare features and target
X = data_scaled[['feature1', 'feature3', 'feature2_encoded']]
y = data_scaled['target']

# Split: 60% train, 20% validation, 20% test
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp)

print("=" * 60)
print("Data Splitting Strategy")
print("=" * 60)
print(f"Total samples: {len(X)}")
print(f"Training set: {len(X_train)} ({len(X_train)/len(X)*100:.1f}%)")
print(f"Validation set: {len(X_val)} ({len(X_val)/len(X)*100:.1f}%)")
print(f"Test set: {len(X_test)} ({len(X_test)/len(X)*100:.1f}%)")

print("\n‚úÖ Data split complete!")
print("üìù Use validation set for hyperparameter tuning")
print("üìù Use test set ONLY for final evaluation")


Data Splitting Strategy
Total samples: 1000
Training set: 600 (60.0%)
Validation set: 200 (20.0%)
Test set: 200 (20.0%)

‚úÖ Data split complete!
üìù Use validation set for hyperparameter tuning
üìù Use test set ONLY for final evaluation


## Summary

### Key Steps in Data Collection & Preparation:
1. **Data Collection**: Identify and acquire relevant datasets
2. **Data Inspection**: Understand data structure, types, distributions
3. **Data Cleaning**: Handle missing values, outliers, inconsistencies
4. **Feature Engineering**: Create meaningful features from raw data
5. **Data Splitting**: Train/Validation/Test sets (60/20/20 or 70/15/15)
6. **Documentation**: Document all preprocessing steps for reproducibility

### Best Practices:
- Always check data quality before modeling
- Document all preprocessing steps
- Save cleaned datasets for reproducibility
- Use stratified splitting for imbalanced data
- Validate data representativeness

**Reference:** Course 12, Unit 2: "Data Collection and Preparation" - All practical activities covered
