# Kaggle Submission Lab

Kaggle is a platform for data science competitions. In this lab, you'll learn how to participate in a Kaggle competition from start to finish.

**What you'll learn:**
- Navigating Kaggle and finding competitions
- Downloading and understanding competition data
- Creating a submission file
- Submitting predictions and viewing your score

**Prerequisites:**
- A Kaggle account ([kaggle.com](https://www.kaggle.com))
- Completed the Basic ML Model lab (or equivalent experience)

**Time:** ~30 minutes

---

## Part 1: Finding a Competition

### Where to Start

1. Go to [kaggle.com/competitions](https://www.kaggle.com/competitions)
2. Click on the **"Getting Started"** tab — these are beginner-friendly competitions

### Recommended Beginner Competitions

| Competition | Type | Description |
|-------------|------|-------------|
| [Titanic](https://www.kaggle.com/c/titanic) | Classification | Predict survival on the Titanic |
| [House Prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) | Regression | Predict house sale prices |
| [Digit Recognizer](https://www.kaggle.com/c/digit-recognizer) | Classification | Identify handwritten digits |

For this lab, we'll walk through the **Titanic** competition — it's the classic beginner competition.

### Join the Competition

1. Go to [kaggle.com/c/titanic](https://www.kaggle.com/c/titanic)
2. Click **"Join Competition"**
3. Accept the rules

---

## Part 2: Understanding the Competition

Every Kaggle competition page has important tabs:

| Tab | What it contains |
|-----|------------------|
| **Overview** | Competition description, prizes, timeline |
| **Data** | Download data, description of features |
| **Code** | Public notebooks from other participants |
| **Discussion** | Q&A and tips from the community |
| **Leaderboard** | Rankings based on submissions |
| **Rules** | Submission limits, team rules, etc. |

### The Titanic Challenge

**Goal:** Predict which passengers survived the Titanic disaster.

**Data files:**
- `train.csv` — Training data with survival labels (what you train on)
- `test.csv` — Test data without labels (what you predict)
- `gender_submission.csv` — Example submission file showing the format

---

## Part 3: Getting the Data

You have two options:

### Option A: Download from Kaggle Website

1. Go to the Data tab
2. Click "Download All"
3. Extract the zip file to your working directory

### Option B: Use Kaggle API (Recommended)

First, set up the Kaggle API:

1. Go to [kaggle.com/account](https://www.kaggle.com/account)
2. Scroll to "API" section
3. Click "Create New API Token"
4. This downloads `kaggle.json`
5. Place it in `~/.kaggle/` (Linux/Mac) or `C:\Users\<username>\.kaggle\` (Windows)

Then install and use the API:

```bash
pip install kaggle
kaggle competitions download -c titanic
unzip titanic.zip
```

---

## Part 4: Load and Explore the Data

Let's load the Titanic data. If you don't have it downloaded, we'll create sample data that mirrors the structure.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score

print("Libraries imported!")

In [None]:
# Try to load the actual Titanic data
# If files don't exist, we'll create synthetic data with the same structure

try:
    train = pd.read_csv('train.csv')
    test = pd.read_csv('test.csv')
    print("Loaded Kaggle Titanic data!")
except FileNotFoundError:
    print("Titanic files not found. Creating synthetic data for demonstration...")
    print("(Download the real data from kaggle.com/c/titanic for the actual competition)")
    
    # Create synthetic Titanic-like data
    np.random.seed(42)
    n_train, n_test = 891, 418
    
    def create_titanic_data(n, include_survived=True):
        data = {
            'PassengerId': range(1, n + 1),
            'Pclass': np.random.choice([1, 2, 3], n, p=[0.24, 0.21, 0.55]),
            'Sex': np.random.choice(['male', 'female'], n, p=[0.65, 0.35]),
            'Age': np.random.normal(30, 14, n).clip(0.5, 80),
            'SibSp': np.random.choice([0, 1, 2, 3, 4], n, p=[0.68, 0.23, 0.05, 0.02, 0.02]),
            'Parch': np.random.choice([0, 1, 2, 3], n, p=[0.76, 0.13, 0.09, 0.02]),
            'Fare': np.random.exponential(30, n).clip(0, 500),
            'Embarked': np.random.choice(['S', 'C', 'Q'], n, p=[0.72, 0.19, 0.09])
        }
        df = pd.DataFrame(data)
        
        if include_survived:
            survival_prob = 0.38
            survival_prob += np.where(df['Sex'] == 'female', 0.35, -0.15)
            survival_prob += np.where(df['Pclass'] == 1, 0.15, np.where(df['Pclass'] == 3, -0.15, 0))
            survival_prob += np.where(df['Age'] < 16, 0.1, 0)
            survival_prob = np.clip(survival_prob, 0.05, 0.95)
            df['Survived'] = (np.random.random(n) < survival_prob).astype(int)
        
        df.loc[np.random.choice(n, int(n * 0.2), replace=False), 'Age'] = np.nan
        df.loc[np.random.choice(n, int(n * 0.002), replace=False), 'Embarked'] = np.nan
        
        return df
    
    train = create_titanic_data(n_train, include_survived=True)
    test = create_titanic_data(n_test, include_survived=False)
    test['PassengerId'] = range(892, 892 + n_test)
    
print(f"\nTraining data: {train.shape}")
print(f"Test data: {test.shape}")

In [None]:
# Look at the training data
train.head()

In [None]:
# Check data types and missing values
print("Missing Values:")
print(train.isna().sum())

In [None]:
# Survival rate
print(f"Overall survival rate: {train['Survived'].mean():.2%}")
print("\nSurvival by Sex:")
print(train.groupby('Sex')['Survived'].mean())
print("\nSurvival by Pclass:")
print(train.groupby('Pclass')['Survived'].mean())

---

## Part 5: Prepare the Data

We need to handle missing values and convert categorical variables to numbers.

In [None]:
def preprocess_titanic(df):
    """
    Preprocess Titanic data.
    Returns a copy with features ready for modeling.
    """
    data = df.copy()
    
    # Fill missing Age with median
    data['Age'] = data['Age'].fillna(data['Age'].median())
    
    # Fill missing Embarked with mode
    data['Embarked'] = data['Embarked'].fillna(data['Embarked'].mode()[0])
    
    # Fill missing Fare with median
    data['Fare'] = data['Fare'].fillna(data['Fare'].median())
    
    # Convert Sex to numeric
    data['Sex'] = data['Sex'].map({'male': 0, 'female': 1})
    
    # Convert Embarked to numeric
    data['Embarked'] = data['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})
    
    # Select features
    features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
    
    return data[features]

# Preprocess both train and test
X_train_full = preprocess_titanic(train)
X_test_final = preprocess_titanic(test)
y_train_full = train['Survived']

print("Preprocessing complete!")
print(f"Training features shape: {X_train_full.shape}")
print(f"Test features shape: {X_test_final.shape}")

---

## Part 6: Train and Validate Locally

Before submitting to Kaggle, validate your model locally.

In [None]:
# Cross-validation
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)

cv_scores = cross_val_score(model, X_train_full, y_train_full, cv=5, scoring='accuracy')

print(f"Cross-validation scores: {cv_scores}")
print(f"Mean CV accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

---

## Part 7: Train Final Model and Create Submission

Train on ALL training data and create predictions for the test set.

In [None]:
# Train on full training data
final_model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
final_model.fit(X_train_full, y_train_full)

# Make predictions on test set
test_predictions = final_model.predict(X_test_final)

print(f"Predictions generated: {len(test_predictions)} rows")
print(f"Predicted survival: {test_predictions.sum()} / {len(test_predictions)}")

In [None]:
# Create submission DataFrame
submission = pd.DataFrame({
    'PassengerId': test['PassengerId'],
    'Survived': test_predictions
})

print("Submission preview:")
print(submission.head())

In [None]:
# Save to CSV
submission.to_csv('submission.csv', index=False)
print("\nSubmission saved to 'submission.csv'!")

# Verify
print(f"File shape: {pd.read_csv('submission.csv').shape}")

---

## Part 8: Submit to Kaggle

### Option A: Website Upload

1. Go to [kaggle.com/c/titanic](https://www.kaggle.com/c/titanic)
2. Click **"Submit Predictions"**
3. Upload your `submission.csv` file
4. Add a description (e.g., "Random Forest, basic features")
5. Click **"Make Submission"**

### Option B: Kaggle API

```bash
kaggle competitions submit -c titanic -f submission.csv -m "Random Forest baseline"
```

### After Submitting

- You'll see your score on the **public leaderboard**
- The public leaderboard uses a portion of the test data
- Final rankings use the **private leaderboard** (revealed when competition ends)
- You can submit multiple times (check rules for daily limits)

---

## Part 9: Improving Your Score

### Feature Engineering Ideas

```python
# Family size
data['FamilySize'] = data['SibSp'] + data['Parch'] + 1
data['IsAlone'] = (data['FamilySize'] == 1).astype(int)

# Extract title from name
data['Title'] = data['Name'].str.extract(' ([A-Za-z]+)\\.', expand=False)

# Age bins
data['AgeBin'] = pd.cut(data['Age'], bins=[0, 12, 20, 40, 60, 100], labels=[0, 1, 2, 3, 4])

# Fare per person
data['FarePerPerson'] = data['Fare'] / data['FamilySize']
```

### Try Different Models

```python
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# pip install xgboost
from xgboost import XGBClassifier
```

### Hyperparameter Tuning

```python
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5)
grid_search.fit(X_train_full, y_train_full)

print(f"Best params: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.4f}")
```

### Learn from Others

- Check the **Code** tab on Kaggle for public notebooks
- Read top solutions after competitions end
- Join the **Discussion** forum

---

## Summary

### Kaggle Workflow

1. **Find and join** a competition
2. **Download** and explore the data
3. **Preprocess** the data
4. **Train and validate** locally (use cross-validation!)
5. **Train final model** on all training data
6. **Create submission** file in the required format
7. **Submit** and check your score
8. **Iterate** — try new features, models, parameters

### Submission File Format

Always check the competition's required format! For Titanic:

```
PassengerId,Survived
892,0
893,1
894,0
...
```

### Tips

- **Start simple** — Get a baseline submission first
- **Validate locally** — Don't waste submissions on obvious bugs
- **Read the rules** — Understand evaluation metrics and limits
- **Learn from others** — Public notebooks are a goldmine

---

## Exercises

1. **Submit to Titanic:** Use this notebook to make your first submission

2. **Try House Prices:** Apply similar steps to the House Prices regression competition

3. **Feature engineering:** Add 2-3 new features and see if your score improves

4. **Model comparison:** Try at least 3 different models and compare their cross-validation scores

---

## Next Section

Continue to: **[Presenting Projects and Pitching](../04-presenting-projects/)**