Logging into Kaggle for the first time can be daunting. Our competitions often have large cash prizes, public leaderboards, and involve complex data. Nevertheless, we really think all data scientists can rapidly learn from machine learning competitions and meaningfully contribute to our community. To give you a clear understanding of how our platform works and a mental model of the type of learning you could do on Kaggle, we've created a Getting Started tutorial for the Titanic competition. It walks you through the initial steps required to get your first decent submission on the leaderboard. By the end of the tutorial, you'll also have a solid understanding of how to use Kaggle's online coding environment, where you'll have trained your own machine learning model.

So if this is your first time entering a Kaggle competition, regardless of whether you:
- have experience with handling large datasets,
- haven't done much coding,
- are newer to data science, or
- are relatively experienced (but are just unfamiliar with Kaggle's platform),

you're in the right place! 

# Part 1: Get started

In this section, you'll learn more about the competition and make your first submission. 

## Join the competition!

The first thing to do is to join the competition!  Open a new window with **[the competition page](https://www.kaggle.com/c/titanic)**, and click on the **"Join Competition"** button, if you haven't already.  (_If you see a "Submit Predictions" button instead of a "Join Competition" button, you have already joined the competition, and don't need to do so again._)

![](https://i.imgur.com/07cskyU.png)

This takes you to the rules acceptance page.  You must accept the competition rules in order to participate.  These rules govern how many submissions you can make per day, the maximum team size, and other competition-specific details.   Then, click on **"I Understand and Accept"** to indicate that you will abide by the competition rules.

## The challenge

The competition is simple: we want you to use the Titanic passenger data (name, age, price of ticket, etc) to try to predict who will survive and who will die.

## The data

To take a look at the competition data, click on the **<a href="https://www.kaggle.com/c/titanic/data" target="_blank" rel="noopener noreferrer"><b>Data tab</b></a>** at the top of the competition page.  Then, scroll down to find the list of files.  
There are three files in the data: (1) **train.csv**, (2) **test.csv**, and (3) **gender_submission.csv**.

### (1) train.csv

**train.csv** contains the details of a subset of the passengers on board (891 passengers, to be exact -- where each passenger gets a different row in the table).  To investigate this data, click on the name of the file on the left of the screen.  Once you've done this, you can view all of the data in the window.  

![](https://i.imgur.com/cYsdt0n.png)

The values in the second column (**"Survived"**) can be used to determine whether each passenger survived or not: 
- if it's a "1", the passenger survived.
- if it's a "0", the passenger died.

For instance, the first passenger listed in **train.csv** is Mr. Owen Harris Braund.  He was 22 years old when he died on the Titanic.

### (2) test.csv

Using the patterns you find in **train.csv**, you have to predict whether the other 418 passengers on board (in **test.csv**) survived.  

Click on **test.csv** (on the left of the screen) to examine its contents.  Note that **test.csv** does not have a **"Survived"** column - this information is hidden from you, and how well you do at predicting these hidden values will determine how highly you score in the competition! 

### (3) gender_submission.csv

The **gender_submission.csv** file is provided as an example that shows how you should structure your predictions.  It predicts that all female passengers survived, and all male passengers died.  Your hypotheses regarding survival will probably be different, which will lead to a different submission file.  But, just like this file, your submission should have:
- a **"PassengerId"** column containing the IDs of each passenger from **test.csv**.
- a **"Survived"** column (that you will create!) with a "1" for the rows where you think the passenger survived, and a "0" where you predict that the passenger died.

第一次登录 Kaggle 可能会让人望而生畏。我们的比赛经常有丰厚的奖金、公开的排行榜，并且涉及复杂的数据。尽管如此，我们确实认为所有数据科学家都可以从机器学习竞赛中快速学习并为我们的社区做出有意义的贡献。为了让您清楚地了解我们的平台是如何工作的，以及您可以在 Kaggle 上进行的学习类型的心智模型，我们为泰坦尼克号竞赛创建了一个入门教程。它会引导您完成在排行榜上获得第一个不错的提交所需的初始步骤。在本教程结束时，您还将对如何使用 Kaggle 的在线编码环境有深入的了解，您将在其中训练自己的机器学习模型。

因此，如果这是你第一次参加 Kaggle 比赛，无论你是否：
- 有处理大型数据集的经验，
- 没有做太多编码，
- 对数据科学较新，或者
- 相对有经验（但不熟悉 Kaggle 的平台），

你来对地方了！

# 第 1 部分：开始

在本节中，您将了解有关比赛的更多信息并进行首次提交。

## 参加比赛！

首先要做的就是参加比赛！使用**[竞赛页面](https://www.kaggle.com/c/titanic)**打开一个新窗口，然后点击**“加入竞赛”**按钮，如果你还没有的话。 （_如果您看到“提交预测”按钮而不是“参加比赛”按钮，则您已经参加了比赛，无需再次参加。_）

![](https://i.imgur.com/07cskyU.png)

这会将您带到规则接受页面。您必须接受比赛规则才能参加。这些规则规定了您每天可以提交多少次、最大团队规模以及其他特定于比赛的详细信息。然后，点击**“我理解并接受”**表示您将遵守比赛规则。

## 挑战

比赛很简单：我们希望您使用泰坦尼克号乘客数据（姓名、年龄、票价等）来尝试预测谁会生还谁会死。

＃＃ 数据

要查看比赛数据，请单击比赛页面顶部的 **<a href="https://www.kaggle.com/c/titanic/data" target="_blank" rel="noopener noreferrer"><b>Data tab</b></a>**。然后，向下滚动以找到文件列表。
数据中有三个文件：(1) **train.csv**, (2) **test.csv**, (3) **gender_submission.csv<**>。

### (1) 火车.csv**train.csv**包含机上乘客子集的详细信息（准确地说是 891 名乘客——每个乘客在表格中占据不同的行）。要调查此数据，请单击屏幕左侧的文件名。完成此操作后，您可以在窗口中查看所有数据。

![](https://i.imgur.com/cYsdt0n.png)

第二列中的值 (**"Survived"**) 可用于确定每个乘客是否幸存：
- 如果是“1”，则乘客幸免于难。
- 如果是“0”，则乘客死亡。

例如，**train.csv**中列出的第一位乘客是 Owen Harris Braund 先生。他死于泰坦尼克号时年仅 22 岁。

### (2) 测试.csv

使用您在**train.csv**中找到的模式，您必须预测机上其他 418 名乘客（在**test.csv**中）是否幸存。

单击**test.csv**（在屏幕左侧）以检查其内容。请注意，**test.csv**没有**“Survived”**列 - 此信息对您是隐藏的，您预测这些隐藏值的能力将决定如何你在比赛中得分很高！

### (3) 性别<_>submission.csv

**性别<_>submission.csv**文件作为示例提供，说明您应如何构建预测。它预测所有女乘客都生还，所有男乘客都死了。您关于生存的假设可能会有所不同，这将导致不同的提交文件。但是，就像这个文件一样，您提交的内容应该有：
-**"PassengerId"**列包含来自**test.csv**的每位乘客的 ID。
- <**>“幸存”** 列（您将创建！），其中“1”代表您认为乘客幸存的行，“0”代表您预测乘客死亡.

# Part 2: Your coding environment

In this section, you'll train your own machine learning model to improve your predictions.  _If you've never written code before or don't have any experience with machine learning, don't worry!  We don't assume any prior experience in this tutorial._

## The Notebook

The first thing to do is to create a Kaggle Notebook where you'll store all of your code.  You can use Kaggle Notebooks to getting up and running with writing code quickly, and without having to install anything on your computer.  (_If you are interested in deep learning, we also offer free GPU access!_) 

Begin by clicking on the **<a href="https://www.kaggle.com/c/titanic/kernels" target="_blank">Code tab</a>** on the competition page.  Then, click on **"New Notebook"**.

![](https://i.imgur.com/v2i82Xd.png)

Your notebook will take a few seconds to load.  In the top left corner, you can see the name of your notebook -- something like **"kernel2daed3cd79"**.

![](https://i.imgur.com/64ZFT1L.png)

You can edit the name by clicking on it.  Change it to something more descriptive, like **"Getting Started with Titanic"**.  

![](https://i.imgur.com/uwyvzXq.png)

## Your first lines of code

When you start a new notebook, it has two gray boxes for storing code.  We refer to these gray boxes as "code cells".

![](https://i.imgur.com/q9mwkZM.png)

The first code cell already has some code in it.  To run this code, put your cursor in the code cell.  (_If your cursor is in the right place, you'll notice a blue vertical line to the left of the gray box._)  Then, either hit the play button (which appears to the left of the blue line), or hit **[Shift] + [Enter]** on your keyboard.

If the code runs successfully, three lines of output are returned.  Below, you can see the same code that you just ran, along with the output that you should see in your notebook.

# 第 2 部分：您的编码环境

在本节中，您将训练自己的机器学习模型来改进您的预测。 _如果您以前从未编写过代码或没有任何机器学习经验，请不要担心！在本教程中，我们不假设任何先前的经验。_

＃＃ 笔记本

首先要做的是创建一个 Kaggle Notebook，您将在其中存储所有代码。您可以使用 Kaggle Notebooks 快速启动和运行代码，而无需在您的计算机上安装任何东西。 （_如果您对深度学习感兴趣，我们还提供免费的 GPU 访问权限！_）

首先单击竞赛页面上的 **<a href="https://www.kaggle.com/c/titanic/kernels" target="_blank">Code tab</a>**。然后，单击 **“新建笔记本”**。

![](https://i.imgur.com/v2i82Xd.png)

您的笔记本将需要几秒钟来加载。在左上角，您可以看到笔记本的名称——类似于 **"kernel2daed3cd79"**。

![](https://i.imgur.com/64ZFT1L.png)

您可以通过单击名称来编辑名称。将其更改为更具描述性的内容，例如 **“泰坦尼克号入门”**。

![](https://i.imgur.com/uwyvzXq.png)

## 你的第一行代码

当你开始一个新的笔记本时，它有两个灰色的盒子用于存储代码。我们将这些灰色框称为“代码单元”。

![](https://i.imgur.com/q9mwkZM.png)

第一个代码单元中已经有一些代码。要运行此代码，请将光标放在代码单元格中。 （_如果光标在正确的位置，您会注意到灰色框左侧有一条蓝色竖线。_）然后，点击播放按钮（出现在蓝线左侧)，或按键盘上的 **[Shift] + [Enter]**。

如果代码运行成功，将返回三行输出。下面，您可以看到您刚刚运行的相同代码，以及您应该在笔记本中看到的输出。

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

This shows us where the competition data is stored, so that we can load the files into the notebook.  We'll do that next.

## Load the data

The second code cell in your notebook now appears below the three lines of output with the file locations.

![](https://i.imgur.com/OQBax9n.png)

Type the two lines of code below into your second code cell.  Then, once you're done, either click on the blue play button, or hit **[Shift] + [Enter]**.  

这向我们展示了比赛数据的存储位置，以便我们可以将文件加载到笔记本中。我们接下来会这样做。

## 加载数据

笔记本中的第二个代码单元现在显示在带有文件位置的三行输出下方。

![](https://i.imgur.com/OQBax9n.png)

在您的第二个代码单元格中键入下面的两行代码。然后，完成后，单击蓝色播放按钮，或按 **[Shift] + [Enter]**。

这向我们展示了比赛数据的存储位置，以便我们可以将文件加载到笔记本中。我们接下来会这样做。

## 加载数据

笔记本中的第二个代码单元现在显示在带有文件位置的三行输出下方。

![](https://i.imgur.com/OQBax9n.png)

在您的第二个代码单元格中键入下面的两行代码。然后，完成后，单击蓝色播放按钮，或按 **[Shift] + [Enter]**。

In [2]:
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
train_data.head()

FileNotFoundError: [Errno 2] No such file or directory: '/kaggle/input/titanic/train.csv'

Your code should return the output above, which corresponds to the first five rows of the table in **train.csv**.  It's very important that you see this output **in your notebook** before proceeding with the tutorial!
> _If your code does not produce this output_, double-check that your code is identical to the two lines above.  And, make sure your cursor is in the code cell before hitting **[Shift] + [Enter]**.

The code that you've just written is in the Python programming language. It uses a Python "module" called **pandas** (abbreviated as `pd`) to load the table from the **train.csv** file into the notebook. To do this, we needed to plug in the location of the file (which we saw was `/kaggle/input/titanic/train.csv`).  
> If you're not already familiar with Python (and pandas), the code shouldn't make sense to you -- but don't worry!  The point of this tutorial is to (quickly!) make your first submission to the competition.  At the end of the tutorial, we suggest resources to continue your learning.

At this point, you should have at least three code cells in your notebook.  
![](https://i.imgur.com/ReLhYca.png)

Copy the code below into the third code cell of your notebook to load the contents of the **test.csv** file.  Don't forget to click on the play button (or hit **[Shift] + [Enter]**)!

您的代码应返回上面的输出，它对应于 **train.csv** 中表格的前五行。在继续本教程之前，**在笔记本**中看到此输出非常重要！
> _如果您的代码没有产生此输出_，请仔细检查您的代码是否与上面两行相同。并且，在按下 **[Shift] + [Enter]** 之前，请确保您的光标位于代码单元格中。

您刚刚编写的代码是使用 Python 编程语言编写的。它使用名为 **pandas**（缩写为 `pd`）的 Python“模块”将表从 **train.csv** 文件加载到笔记本中。为此，我们需要插入文件的位置（我们看到的是“/kaggle/input/titanic/train.csv”）。
> 如果您还不熟悉 Python（和 pandas），代码对您来说应该没有意义——但别担心！本教程的重点是（快速！）让你第一次提交给比赛。在本教程结束时，我们会建议资源以继续您的学习。

此时，您的笔记本中应该至少有三个代码单元。
![](https://i.imgur.com/ReLhYca.png)

将下面的代码复制到笔记本的第三个代码单元格中，以加载 **test.csv** 文件的内容。不要忘记点击播放按钮（或点击**[Shift] + [Enter]**）！

In [None]:
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
test_data.head()

As before, make sure that you see the output above in your notebook before continuing.  

Once all of the code runs successfully, all of the data (in **train.csv** and **test.csv**) is loaded in the notebook.  (_The code above shows only the first 5 rows of each table, but all of the data is there -- all 891 rows of **train.csv** and all 418 rows of **test.csv**!_)

# Part 3: Your first submission

Remember our goal: we want to find patterns in **train.csv** that help us predict whether the passengers in **test.csv** survived.

It might initially feel overwhelming to look for patterns, when there's so much data to sort through.  So, we'll start simple.

## Explore a pattern

Remember that the sample submission file in **gender_submission.csv** assumes that all female passengers survived (and all male passengers died).  

Is this a reasonable first guess?  We'll check if this pattern holds true in the data (in **train.csv**).

Copy the code below into a new code cell.  Then, run the cell.

和以前一样，确保在继续之前在笔记本中看到上面的输出。

一旦所有代码成功运行，所有数据（在 **train.csv** 和 **test.csv** 中）都会加载到笔记本中。 (_上面的代码只显示了每个表的前 5 行，但所有数据都在那里——**train.csv** 的所有 891 行和 **test.csv** 的所有 418 行！_)

# 第 3 部分：您的第一次提交

记住我们的目标：我们想在 **train.csv** 中找到模式，帮助我们预测 **test.csv** 中的乘客是否幸存。

当有如此多的数据需要整理时，最初可能会觉得寻找模式让人不知所措。所以，我们将从简单的开始。

## 探索模式

请记住，**gender_submission.csv** 中的示例提交文件假设所有女性乘客都幸存（所有男性乘客都死亡）。

这是一个合理的初步猜测吗？我们将检查此模式在数据中是否成立（在 **train.csv** 中）。

将下面的代码复制到一个新的代码单元中。然后，运行单元格。

In [None]:
women = train_data.loc[train_data.Sex == 'female']["Survived"]
rate_women = sum(women)/len(women)

print("% of women who survived:", rate_women)

Before moving on, make sure that your code returns the output above.  The code above calculates the percentage of female passengers (in **train.csv**) who survived.

Then, run the code below in another code cell:

在继续之前，请确保您的代码返回上面的输出。上面的代码计算了女性乘客（在 **train.csv** 中）幸存下来的百分比。

然后，在另一个代码单元中运行下面的代码：

In [None]:
men = train_data.loc[train_data.Sex == 'male']["Survived"]
rate_men = sum(men)/len(men)

print("% of men who survived:", rate_men)

The code above calculates the percentage of male passengers (in **train.csv**) who survived.

From this you can see that almost 75% of the women on board survived, whereas only 19% of the men lived to tell about it. Since gender seems to be such a strong indicator of survival, the submission file in **gender_submission.csv** is not a bad first guess!

But at the end of the day, this gender-based submission bases its predictions on only a single column.  As you can imagine, by considering multiple columns, we can discover more complex patterns that can potentially yield better-informed predictions.  Since it is quite difficult to consider several columns at once (or, it would take a long time to consider all possible patterns in many different columns simultaneously), we'll use machine learning to automate this for us.

## Your first machine learning model

We'll build what's known as a **random forest model**.  This model is constructed of several "trees" (there are three trees in the picture below, but we'll construct 100!) that will individually consider each passenger's data and vote on whether the individual survived.  Then, the random forest model makes a democratic decision: the outcome with the most votes wins!

![](https://i.imgur.com/AC9Bq63.png)

The code cell below looks for patterns in four different columns (**"Pclass"**, **"Sex"**, **"SibSp"**, and **"Parch"**) of the data.  It constructs the trees in the random forest model based on patterns in the **train.csv** file, before generating predictions for the passengers in **test.csv**.  The code also saves these new predictions in a CSV file **submission.csv**.

Copy this code into your notebook, and run it in a new code cell.

上面的代码计算了男性乘客（在 **train.csv** 中）幸存下来的百分比。

由此可以看出，船上几乎 75% 的女性幸存下来，而只有 19% 的男性活着讲述了这件事。由于性别似乎是生存的重要指标，所以 **gender_submission.csv** 中的提交文件是一个不错的初步猜测！

但归根结底，这种基于性别的提交仅基于一个列进行预测。可以想象，通过考虑多列，我们可以发现更复杂的模式，这些模式可能会产生更明智的预测。由于一次考虑多个列非常困难（或者，同时考虑许多不同列中的所有可能模式需要很长时间），我们将使用机器学习为我们自动执行此操作。

## 你的第一个机器学习模型

我们将构建所​​谓的**随机森林模型**。这个模型是由几棵“树”构成的（下图中是三棵树，但我们要构造 100 棵！），它们会单独考虑每个乘客的数据，并投票决定这个人是否幸存。然后，随机森林模型做出民主决定：得票最多的结果获胜！

![](https://i.imgur.com/AC9Bq63.png)

下面的代码单元在四个不同的列中查找模式（**“Pclass”**、**“Sex”**、**“SibSp”** 和 < **>“Parch”**) 的数据。它根据**train.csv**文件中的模式构建随机森林模型中的树，然后为**test.csv**中的乘客生成预测。该代码还将这些新预测保存在 CSV 文件**submission.csv<**> 中。

将此代码复制到您的笔记本中，并在新的代码单元中运行它。

In [None]:
from sklearn.ensemble import RandomForestClassifier

y = train_data["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

Make sure that your notebook outputs the same message above (`Your submission was successfully saved!`) before moving on.
> Again, don't worry if this code doesn't make sense to you!  For now, we'll focus on how to generate and submit predictions.

Once you're ready, click on the **"Save Version"** button in the top right corner of your notebook.  This will generate a pop-up window.  
- Ensure that the **"Save and Run All"** option is selected, and then click on the **"Save"** button.
- This generates a window in the bottom left corner of the notebook.  After it has finished running, click on the number to the right of the **"Save Version"** button.  This pulls up a list of versions on the right of the screen.  Click on the ellipsis **(...)** to the right of the most recent version, and select **Open in Viewer**.  
- Click on the **Data** tab on the top of the screen.  Then, click on the **"Submit"** button to submit your results.

![](https://i.imgur.com/1ocaUl4.png)

Congratulations for making your first submission to a Kaggle competition!  Within ten minutes, you should receive a message providing your spot on the leaderboard.  Great work!

在继续之前，请确保您的笔记本输出与上面相同的消息（`您的提交已成功保存！`）。
> 同样，如果这段代码对您来说没有意义，请不要担心！现在，我们将重点关注如何生成和提交预测。

准备就绪后，单击笔记本右上角的 **“保存版本”** 按钮。这将生成一个弹出窗口。
- 确保选择了**“保存并运行所有”**选项，然后单击**“保存”**按钮。
- 这会在笔记本的左下角生成一个窗口。运行完成后，单击 **“保存版本”** 按钮右侧的数字。这会在屏幕右侧拉出一个版本列表。单击最新版本右侧的省略号 **(...)**，然后选择 **在查看器中打开**。
- 单击屏幕顶部的 **数据** 选项卡。然后，单击**“提交”** 按钮提交您的结果。

![](https://i.imgur.com/1ocaUl4.png)

恭喜你第一次提交 Kaggle 竞赛！十分钟内，您应该会收到一条消息，提供您在排行榜上的位置。做得好！

# Part 4: Learn more!

If you're interested in learning more, we strongly suggest our (3-hour) **[Intro to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning)** course, which will help you fully understand all of the code that we've presented here.  You'll also know enough to generate even better predictions!

# 第 4 部分：了解更多！

如果您有兴趣了解更多信息，我们强烈建议您阅读我们的（3 小时）<**>[机器学习简介](https://www.kaggle.com/learn/intro-to-machine-learning)<**> 当然，这将帮助您完全理解我们在此处提供的所有代码。您还将了解足以生成更好的预测！