[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gdsaxton/GDAN5400/blob/main/Week%208%20Notebooks/GDAN%205400%20-%20Week%208%20Notebooks%20%28I%29%20-%20Task%201%20-%20Join%20the%20Competition.ipynb)

This notebook provides a mini-tutorial on joining Kaggle competitions.

In [None]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')

# Load Packages and Set Working Directory
Import several necessary Python packages. We will be using the <a href="http://pandas.pydata.org/">Python Data Analysis Library,</a> or <i>PANDAS</i>, extensively for our data manipulations in this and future tutorials.

In [None]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

<br>
PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.

In [None]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)
pd.set_option('display.max_info_columns', 500)

# Introduction to Kaggle Competitions

In coding assignment #4, we are switching from the *roof insurance claim dataset* used in previous assignments to a dataset available on *Kaggle*, an online platform for data science and machine learning that provides datasets, competitions, collaborative notebooks, and learning resources.

In the assignment, you will:

1. **Join the Kaggle Competition** – Sign up for the Titanic competition on Kaggle and download the dataset.
2. **Load the Data** – Read in the `train.csv` and `test.csv` files using PANDAS.
3. **Explore the Data** – Use methods like `df.info()`, histograms, and plots to understand the dataset.
4. **Handle Missing Data** – Identify and fill missing values appropriately.
5. **Prepare Data for Modeling** – Convert categorical variables to numeric, select relevant features, and split the data into training and validation sets.
6. **Train a Logistic Regression Model** – Fit a logistic regression model to predict survival.
7. **Evaluate the Model** – Measure accuracy on the validation set.
8. **Make Predictions on the Test Set** – Apply the trained model to generate survival predictions for Kaggle's test data.
9. **Create a Submission File** – Format the predictions into a CSV file.
10. **Upload to Kaggle** – Submit your predictions and check your score.


These exercises will help strengthen your ability to explore, preprocess, and model real-world datasets using machine learning. You will gain hands-on experience with data cleaning, feature engineering, and predictive modeling, all while working with a classic dataset in a competitive Kaggle environment.

---

# What is Kaggle?

Kaggle is an online platform for data science and machine learning where users can:
- Access **datasets** for analysis and modeling.
- Participate in **competitions** to solve real-world problems.
- Share and explore **notebooks** (Python, R) created by other users.
- Learn from **courses** on machine learning, deep learning, and AI.
- Collaborate with a global community of data scientists.

---

### Key Features
- **Kaggle Datasets** – A large collection of public datasets for various domains.  
- **Kaggle Competitions** – Data science challenges with prize money and rankings.  
- **Kaggle Notebooks** – Cloud-based Jupyter notebooks for coding and sharing work.  
- **Kaggle Discussions** – Forums to interact with the community and learn from experts.  
- **Kaggle Courses** – Free courses on Python, ML, deep learning, and more.

---

###  Why Use Kaggle?
- **Learn & Practice** – Beginner-friendly platform for hands-on machine learning.  
- **Compete & Improve** – Solve complex problems and benchmark against top data scientists.  
- **Collaborate & Share** – Work with teams, explore notebooks, and exchange ideas.  
- **Access Free Compute** – Use Kaggle’s free GPUs/TPUs for deep learning projects.  

**Explore Kaggle:** [https://www.kaggle.com](https://www.kaggle.com)


---

### The Titanic Challenge

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, you are asked to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

`The competition is simple: we want you to use the Titanic passenger data (name, age, price of ticket, etc) to try to predict who will survive and who will die.`

# First Step: Join the Challenge

The first thing to do is to join the competition!  Open a new window with **[the competition page](https://www.kaggle.com/c/titanic)**, and click on the **"Join Competition"** button, if you haven't already.  (_If you see a "Submit Predictions" button instead of a "Join Competition" button, you have already joined the competition, and don't need to do so again._)

![](https://i.imgur.com/07cskyU.png)

This takes you to the rules acceptance page.  You must accept the competition rules in order to participate.  These rules govern how many submissions you can make per day, the maximum team size, and other competition-specific details.   Then, click on **"I Understand and Accept"** to indicate that you will abide by the competition rules.

In the first task of the assignment I ask you to:
- Enter your **Kaggle display name** in the input box below.  
- To help me find you on the leaderboard, please add ** 'GDAN5400'** to your display name (e.g., *"Gregory Saxton - GDAN5400"*).  
- **Important:** When running the code, you must enter a name and press **Enter** to proceed.  

In [None]:
kaggle_displayname = input("Enter your Kaggle Display Name: ")
print(f"Your Kaggle name is: {kaggle_displayname}")