# Project Setup and Data Processing

This notebook handles the initial setup of the project environment and data processing pipeline. Follow these steps in order.

## 1. Environment Setup

First, ensure you have installed all required dependencies:

In [1]:
!pip install -r requirements.txt

[33mDEPRECATION: Loading egg at /Users/Angusf777/anaconda3/lib/python3.11/site-packages/modelindex-0.0.2-py3.11.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /Users/Angusf777/anaconda3/lib/python3.11/site-packages/ordered_set-4.1.0-py3.11.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /Users/Angusf777/anaconda3/lib/python3.11/site-packages/model_index-0.1.11-py3.11.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg a

## 2. Create Directory Structure

In [2]:
import os
from pathlib import Path

# Create necessary directories
directories = ['Data', 'Tools', 'Checkpoints']
base_path = Path.cwd()

for dir_name in directories:
    dir_path = base_path / dir_name
    if not dir_path.exists():
        dir_path.mkdir(parents=True)
        print(f"Created {dir_name} directory")
    else:
        print(f"{dir_name} directory already exists")

Data directory already exists
Tools directory already exists
Checkpoints directory already exists


## 3. Download and Extract Kickstarter Dataset

Download the Kickstarter dataset from webrobots.io and extract it to the Data directory.

In [3]:
import requests
import gzip
from tqdm import tqdm
import logging

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Dataset URL (@https://webrobots.io/kickstarter-datasets/)
dataset_url = 'https://s3.amazonaws.com/weruns/forfun/Kickstarter/Kickstarter_2024-12-12T03_20_04_455Z.json.gz'
output_file = Path('Data/Kickstarter_2024-12-12T03_20_04_455Z.json')

def download_and_extract():
    """Download and extract the Kickstarter dataset."""
    if output_file.exists():
        logger.info(f"File already exists at {output_file}")
        return
    
    logger.info("Downloading dataset...")
    response = requests.get(dataset_url, stream=True)
    total_size = int(response.headers.get('content-length', 0))
    
    # Download compressed file
    compressed_file = output_file.with_suffix('.json.gz')
    with open(compressed_file, 'wb') as f:
        with tqdm(total=total_size, unit='iB', unit_scale=True) as pbar:
            for data in response.iter_content(chunk_size=8192):
                size = f.write(data)
                pbar.update(size)
    
    logger.info("Extracting gzipped file...")
    with gzip.open(compressed_file, 'rb') as f_in:
        with open(output_file, 'wb') as f_out:
            f_out.write(f_in.read())
    
    # Clean up compressed file
    compressed_file.unlink()
    logger.info("Download and extraction complete!")

download_and_extract()

2025-01-12 14:35:29,603 - INFO - Downloading dataset...
100%|██████████| 269M/269M [02:13<00:00, 2.01MiB/s] 
2025-01-12 14:37:44,197 - INFO - Extracting gzipped file...
2025-01-12 14:37:52,524 - INFO - Download and extraction complete!


## 4. Process Data with filter_kickstarter.py and make_WebDatabase.py

Run both scripts to process the raw dataset and create the web database.

In [4]:
import subprocess
import sys
from pathlib import Path

def run_script_with_args(script_path: Path, input_path: Path, output_path: Path, stats_path: Path = None):
    """Run a Python script with command line arguments."""
    try:
        logger.info(f"Running {script_path.name}...")
        
        # Build command with required arguments
        cmd = [
            sys.executable,
            str(script_path),
            '--input', str(input_path),
            '--output', str(output_path)
        ]
        
        # Add stats path if provided
        if stats_path:
            cmd.extend(['--stats', str(stats_path)])
        
        # Run the command
        result = subprocess.run(cmd, capture_output=True, text=True)
        
        # Print output
        if result.stdout:
            print(result.stdout)
        if result.stderr:
            print(result.stderr)
            
        # Check if output file was created
        if output_path.exists():
            size_mb = output_path.stat().st_size / (1024 * 1024)
            logger.info(f"Successfully created {output_path.name} ({size_mb:.2f} MB)")
        else:
            raise FileNotFoundError(f"Expected output file {output_path} was not created")
            
    except Exception as e:
        logger.error(f"Error running {script_path.name}: {e}")
        raise

# Paths for filter_kickstarter.py
filter_script = Path('Tools/filter_kickstarter.py')
filter_input = Path('Data/Kickstarter_2024-12-12T03_20_04_455Z.json')
filter_output = Path('Data/Kickstarter_filtered.json')
filter_stats = Path('Data/filtering_stats.json')

# Run filter_kickstarter.py
if filter_script.exists():
    run_script_with_args(filter_script, filter_input, filter_output, filter_stats)
else:
    logger.error(f"Script not found: {filter_script}")

# Paths for make_WebDatabase.py
web_db_script = Path('Tools/make_WebDatabase.py')
web_db_input = Path('Data/Kickstarter_2024-12-12T03_20_04_455Z.json')
web_db_output = Path('Data/website_database.json')
web_db_stats = Path('Data/web_processing_stats.json')

# Run make_WebDatabase.py
if web_db_script.exists():
    run_script_with_args(web_db_script, web_db_input, web_db_output, web_db_stats)
else:
    logger.error(f"Script not found: {web_db_script}")

2025-01-12 14:37:52,543 - INFO - Running filter_kickstarter.py...
2025-01-12 14:38:36,603 - INFO - Successfully created Kickstarter_filtered.json (1401.27 MB)
2025-01-12 14:38:36,608 - INFO - Running make_WebDatabase.py...


2025-01-12 14:38:36,594 - INFO - 
Filtering completed!
2025-01-12 14:38:36,594 - INFO - Total processed: 231290
2025-01-12 14:38:36,594 - INFO - Included: 212377
2025-01-12 14:38:36,594 - INFO - Excluded: 18913
2025-01-12 14:38:36,594 - INFO - 
Canceled projects summary:
2025-01-12 14:38:36,594 - INFO - Total canceled: 8937
2025-01-12 14:38:36,594 - INFO - Converted to failed: 5738
2025-01-12 14:38:36,594 - INFO - Excluded (>60% remaining): 3199
2025-01-12 14:38:36,594 - INFO - Invalid timestamps: 0
2025-01-12 14:38:36,594 - INFO - 
Detailed statistics saved to: Data/filtering_stats.json



2025-01-12 14:38:54,429 - INFO - Successfully created website_database.json (185.58 MB)


2025-01-12 14:38:36,699 - INFO - Starting to process Kickstarter data...
2025-01-12 14:38:36,752 - INFO - Processed 1000 records successfully...
2025-01-12 14:38:36,802 - INFO - Processed 2000 records successfully...
2025-01-12 14:38:36,862 - INFO - Processed 3000 records successfully...
2025-01-12 14:38:36,917 - INFO - Processed 4000 records successfully...
2025-01-12 14:38:36,967 - INFO - Processed 5000 records successfully...
2025-01-12 14:38:37,021 - INFO - Processed 6000 records successfully...
2025-01-12 14:38:37,075 - INFO - Processed 7000 records successfully...
2025-01-12 14:38:37,137 - INFO - Processed 8000 records successfully...
2025-01-12 14:38:37,193 - INFO - Processed 9000 records successfully...
2025-01-12 14:38:37,258 - INFO - Processed 10000 records successfully...
2025-01-12 14:38:37,319 - INFO - Processed 11000 records successfully...
2025-01-12 14:38:37,372 - INFO - Processed 12000 records successfully...
2025-01-12 14:38:37,427 - INFO - Processed 13000 records suc

## 5. Verify Setup

Check that all required files have been created successfully.

In [5]:
required_files = [
    'Kickstarter_2024-12-12T03_20_04_455Z.json',
    'Kickstarter_filtered.json',
    'filtering_stats.json',
    'website_database.json',
    'web_processing_stats.json'
]

logger.info("Checking for required files:")
for file in required_files:
    file_path = Path('Data') / file
    if file_path.exists():
        size_mb = file_path.stat().st_size / (1024 * 1024)
        logger.info(f"✓ {file} exists ({size_mb:.2f} MB)")
    else:
        logger.error(f"✗ {file} is missing")

2025-01-12 14:38:54,441 - INFO - Checking for required files:
2025-01-12 14:38:54,442 - INFO - ✓ Kickstarter_2024-12-12T03_20_04_455Z.json exists (1465.22 MB)
2025-01-12 14:38:54,443 - INFO - ✓ Kickstarter_filtered.json exists (1401.27 MB)
2025-01-12 14:38:54,444 - INFO - ✓ filtering_stats.json exists (0.02 MB)
2025-01-12 14:38:54,444 - INFO - ✓ website_database.json exists (185.58 MB)
2025-01-12 14:38:54,444 - INFO - ✓ web_processing_stats.json exists (0.00 MB)
