# Project Setup and Data Processing

This notebook handles the initial setup of the project environment and data processing pipeline. Follow these steps in order.

## 1. Environment Setup

First, ensure you have installed all required dependencies:

In [1]:
!pip install -r requirements.txt

[33mDEPRECATION: Loading egg at /Users/Angusf777/anaconda3/lib/python3.11/site-packages/modelindex-0.0.2-py3.11.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /Users/Angusf777/anaconda3/lib/python3.11/site-packages/ordered_set-4.1.0-py3.11.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /Users/Angusf777/anaconda3/lib/python3.11/site-packages/model_index-0.1.11-py3.11.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg a

## 2. Create Directory Structure

In [2]:
import os
from pathlib import Path

# Create necessary directories
directories = ['Data', 'Tools', 'Checkpoints']
base_path = Path.cwd()

for dir_name in directories:
    dir_path = base_path / dir_name
    if not dir_path.exists():
        dir_path.mkdir(parents=True)
        print(f"Created {dir_name} directory")
    else:
        print(f"{dir_name} directory already exists")

Data directory already exists
Tools directory already exists
Checkpoints directory already exists


## 3. Download and Extract Kickstarter Dataset

Download the Kickstarter dataset from webrobots.io and extract it to the Data directory.

In [3]:
import requests
import gzip
from tqdm import tqdm
import logging

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Dataset URL (@https://webrobots.io/kickstarter-datasets/)
dataset_url = 'https://s3.amazonaws.com/weruns/forfun/Kickstarter/Kickstarter_2024-12-12T03_20_04_455Z.json.gz'
output_file = Path('Data/Kickstarter_2024-12-12T03_20_04_455Z.json')

def download_and_extract():
    """Download and extract the Kickstarter dataset."""
    if output_file.exists():
        logger.info(f"File already exists at {output_file}")
        return
    
    logger.info("Downloading dataset...")
    response = requests.get(dataset_url, stream=True)
    total_size = int(response.headers.get('content-length', 0))
    
    # Download compressed file
    compressed_file = output_file.with_suffix('.json.gz')
    with open(compressed_file, 'wb') as f:
        with tqdm(total=total_size, unit='iB', unit_scale=True) as pbar:
            for data in response.iter_content(chunk_size=8192):
                size = f.write(data)
                pbar.update(size)
    
    logger.info("Extracting gzipped file...")
    with gzip.open(compressed_file, 'rb') as f_in:
        with open(output_file, 'wb') as f_out:
            f_out.write(f_in.read())
    
    # Clean up compressed file
    compressed_file.unlink()
    logger.info("Download and extraction complete!")

download_and_extract()

2025-04-17 11:27:32,107 - INFO - Downloading dataset...
100%|██████████| 269M/269M [00:21<00:00, 12.3MiB/s] 
2025-04-17 11:27:54,776 - INFO - Extracting gzipped file...
2025-04-17 11:28:02,248 - INFO - Download and extraction complete!


## 4. Process Data with check_duplicates.py, filter_kickstarter.py and make_WebDatabase.py

Run both scripts to preprocess the raw dataset and create the database for feature engineering in the next step.

In [4]:
import subprocess
import sys
from pathlib import Path

def run_script_with_args(script_path: Path, input_path: Path, output_path: Path, stats_path: Path = None):
    """Run a Python script with command line arguments."""
    try:
        logger.info(f"Running {script_path.name}...")
        
        # Build command with required arguments
        cmd = [
            sys.executable,
            str(script_path),
            '--input', str(input_path),
            '--output', str(output_path)
        ]
        
        # Add stats path if provided
        if stats_path:
            cmd.extend(['--stats', str(stats_path)])
        
        # Run the command
        result = subprocess.run(cmd, capture_output=True, text=True)
        
        # Print output
        if result.stdout:
            print(result.stdout)
        if result.stderr:
            print(result.stderr)
            
        # Check if output file was created
        if output_path.exists():
            size_mb = output_path.stat().st_size / (1024 * 1024)
            logger.info(f"Successfully created {output_path.name} ({size_mb:.2f} MB)")
        else:
            raise FileNotFoundError(f"Expected output file {output_path} was not created")
            
    except Exception as e:
        logger.error(f"Error running {script_path.name}: {e}")
        raise

# Step 1: Remove duplicates
logger.info("Step 1: Removing duplicates...")
duplicate_script = Path('Tools/check_duplicates.py')
duplicate_input = Path('Data/Kickstarter_2024-12-12T03_20_04_455Z.json')
duplicate_output = Path('Data/Kickstarter_removed_duplicates.json')
duplicate_stats = Path('Data/duplicate_stats.json')

if duplicate_script.exists():
    run_script_with_args(duplicate_script, duplicate_input, duplicate_output, duplicate_stats)
else:
    logger.error(f"Script not found: {duplicate_script}")

# Step 2: Run filter on deduplicated data
logger.info("Step 2: Filtering deduplicated data...")
filter_script = Path('Tools/filter_kickstarter.py')
filter_input = Path('Data/Kickstarter_removed_duplicates.json')
filter_output = Path('Data/Kickstarter_filtered.json')
filter_stats = Path('Data/filtering_stats.json')

if filter_script.exists():
    run_script_with_args(filter_script, filter_input, filter_output, filter_stats)
else:
    logger.error(f"Script not found: {filter_script}")

# Step 3: Create web database from deduplicated data
logger.info("Step 3: Creating web database from deduplicated data...")
web_db_script = Path('Tools/make_WebDatabase.py')
web_db_input = Path('Data/Kickstarter_removed_duplicates.json')
web_db_output = Path('Data/website_database.json')
web_db_stats = Path('Data/web_processing_stats.json')

if web_db_script.exists():
    run_script_with_args(web_db_script, web_db_input, web_db_output, web_db_stats)
else:
    logger.error(f"Script not found: {web_db_script}")

# Verify all files were created
required_files = [
    duplicate_output,
    duplicate_stats,
    filter_output,
    filter_stats,
    web_db_output,
    web_db_stats
]

logger.info("\nVerifying created files:")
for file in required_files:
    if file.exists():
        size_mb = file.stat().st_size / (1024 * 1024)
        logger.info(f"✓ {file.name} exists ({size_mb:.2f} MB)")
    else:
        logger.error(f"✗ {file.name} is missing")

2025-04-17 11:28:02,265 - INFO - Step 1: Removing duplicates...
2025-04-17 11:28:02,266 - INFO - Running check_duplicates.py...
2025-04-17 11:28:48,416 - INFO - Successfully created Kickstarter_removed_duplicates.json (1326.20 MB)
2025-04-17 11:28:48,418 - INFO - Step 2: Filtering deduplicated data...
2025-04-17 11:28:48,419 - INFO - Running filter_kickstarter.py...



Duplicate Removal Summary:
-------------------------
Total projects processed: 231,290
Duplicates removed: 29,380
Unique projects remaining: 201,910

Deduplicated data saved to: Data/Kickstarter_removed_duplicates.json
Statistics saved to: Data/duplicate_stats.json

2025-04-17 11:28:02,393 - INFO - Starting duplicate removal process...
2025-04-17 11:28:44,576 - INFO - Duplicate removal completed!



2025-04-17 11:29:30,534 - INFO - Successfully created Kickstarter_filtered.json (1250.65 MB)
2025-04-17 11:29:30,535 - INFO - Step 3: Creating web database from deduplicated data...
2025-04-17 11:29:30,536 - INFO - Running make_WebDatabase.py...


2025-04-17 11:29:30,525 - INFO - 
Filtering completed!
2025-04-17 11:29:30,525 - INFO - Total processed: 201910
2025-04-17 11:29:30,525 - INFO - Included: 189796
2025-04-17 11:29:30,525 - INFO - Excluded: 12114
2025-04-17 11:29:30,525 - INFO - 
Canceled projects summary:
2025-04-17 11:29:30,525 - INFO - Total canceled: 8937
2025-04-17 11:29:30,526 - INFO - Converted to failed: 5738
2025-04-17 11:29:30,526 - INFO - Excluded (>60% remaining): 3199
2025-04-17 11:29:30,526 - INFO - Invalid timestamps: 0
2025-04-17 11:29:30,526 - INFO - 
Detailed statistics saved to: Data/filtering_stats.json



2025-04-17 11:29:45,019 - INFO - Successfully created website_database.json (171.44 MB)
2025-04-17 11:29:45,019 - INFO - 
Verifying created files:
2025-04-17 11:29:45,020 - INFO - ✓ Kickstarter_removed_duplicates.json exists (1326.20 MB)
2025-04-17 11:29:45,020 - INFO - ✓ duplicate_stats.json exists (0.00 MB)
2025-04-17 11:29:45,021 - INFO - ✓ Kickstarter_filtered.json exists (1250.65 MB)
2025-04-17 11:29:45,021 - INFO - ✓ filtering_stats.json exists (0.02 MB)
2025-04-17 11:29:45,021 - INFO - ✓ website_database.json exists (171.44 MB)
2025-04-17 11:29:45,021 - INFO - ✓ web_processing_stats.json exists (0.00 MB)


2025-04-17 11:29:30,614 - INFO - Starting to process Kickstarter data...
2025-04-17 11:29:30,662 - INFO - Processed 1000 records successfully...
2025-04-17 11:29:30,714 - INFO - Processed 2000 records successfully...
2025-04-17 11:29:30,766 - INFO - Processed 3000 records successfully...
2025-04-17 11:29:30,816 - INFO - Processed 4000 records successfully...
2025-04-17 11:29:30,863 - INFO - Processed 5000 records successfully...
2025-04-17 11:29:30,914 - INFO - Processed 6000 records successfully...
2025-04-17 11:29:30,964 - INFO - Processed 7000 records successfully...
2025-04-17 11:29:31,017 - INFO - Processed 8000 records successfully...
2025-04-17 11:29:31,063 - INFO - Processed 9000 records successfully...
2025-04-17 11:29:31,113 - INFO - Processed 10000 records successfully...
2025-04-17 11:29:31,167 - INFO - Processed 11000 records successfully...
2025-04-17 11:29:31,218 - INFO - Processed 12000 records successfully...
2025-04-17 11:29:31,278 - INFO - Processed 13000 records suc

## 5. Verify Setup

Check that all required files have been created successfully.

In [5]:
required_files = [
    'Kickstarter_2024-12-12T03_20_04_455Z.json',  # Original dataset
    'Kickstarter_removed_duplicates.json',         # Deduplicated dataset
    'duplicate_stats.json',                        # Duplicate removal statistics
    'Kickstarter_filtered.json',                   # Filtered dataset
    'filtering_stats.json',                        # Filtering statistics
    'website_database.json',                       # Web database
    'web_processing_stats.json'                    # Web processing statistics
]

logger.info("Checking for required files:")
for file in required_files:
    file_path = Path('Data') / file
    if file_path.exists():
        size_mb = file_path.stat().st_size / (1024 * 1024)
        logger.info(f"✓ {file} exists ({size_mb:.2f} MB)")
    else:
        logger.error(f"✗ {file} is missing")

2025-04-17 11:29:45,031 - INFO - Checking for required files:
2025-04-17 11:29:45,032 - INFO - ✓ Kickstarter_2024-12-12T03_20_04_455Z.json exists (1465.22 MB)
2025-04-17 11:29:45,033 - INFO - ✓ Kickstarter_removed_duplicates.json exists (1326.20 MB)
2025-04-17 11:29:45,034 - INFO - ✓ duplicate_stats.json exists (0.00 MB)
2025-04-17 11:29:45,034 - INFO - ✓ Kickstarter_filtered.json exists (1250.65 MB)
2025-04-17 11:29:45,035 - INFO - ✓ filtering_stats.json exists (0.02 MB)
2025-04-17 11:29:45,036 - INFO - ✓ website_database.json exists (171.44 MB)
2025-04-17 11:29:45,037 - INFO - ✓ web_processing_stats.json exists (0.00 MB)


## 6. Preprocess Data Scraped from Web and Cross Check with Kickstarter Data to Create Initial Training Data that requires feature engineering

RawData/project_descriptions_20000.json is required, which is the scraping result of around 20000 projects, it is obtained through scraping the Kickstarter website, where the code is in https://github.com/lkh2/Kickstarter-Scraper

In [8]:
import subprocess

# Run create_pre_input.py
subprocess.run(["python", "create_pre_input.py"], check=True)   

# Run test.py, which generates insights for distribution of length of data, funding goals
subprocess.run(["python", "test.py"], check=True)

Error: RawData/project_descriptions_20000.json not found.
Error: Data/pre_inputdata.json not found!


CompletedProcess(args=['python', 'test.py'], returncode=0)

## 7. Feature Engineering the Initial Training Data to create Model Input Data

In [None]:
# Run Processor.py
subprocess.run(["python", "Processor.py"], check=True)   