# Overview of what Kaggle really is

They have a youtube channel; the playlists have the good stuff.  
https://www.youtube.com/user/kaggledotcom/playlists?nohtml5=False

Get pretty much up to speed on kaggle by hearing about it from their CEO:  
https://www.youtube.com/watch?v=0PMHNc_2RrY#t=458.345272 

If you win a competition, there's serious prize money involved.
https://www.kaggle.com/wiki/FormingATeam 

### Prerequisites  

##### Preferred set up

Anaconda (for Python 2.7) for the code we'll do locally.  
https://www.continuum.io/downloads 

But we'll explain kaggles python version below with a script that runs on their end.

Kaggle Scripts...more on this later. The script above may be useful on your own.  
https://www.kaggle.com/c/titanic/forums/t/13390/introducing-kaggle-scripts

Kaggle Forums - there is lots of q/a, code review, etc
https://www.kaggle.com/forums 

From the titanic tutorial, here are the relevant additions you'll use either on your own or in kaggle scripts:  
"Numpy, Scipy, Pandas, matplotlib and csv package. In order to check whether you have these, just go to your python command line and type  import numpy"

### Getting started with our Titanic Problem

Titanic problem  
https://www.kaggle.com/c/titanic  
+ check out the sidebar titled "Dashboard"
+ explore their visualizations https://www.kaggle.com/c/titanic/prospector#175 

How to start with python (standalone tutorial for submission; concepts shown first in excel)  
https://www.kaggle.com/c/titanic/details/getting-started-with-python

Tutorial on using pandas in python (not a standalone tutorial to create submission)  
https://www.kaggle.com/c/titanic/details/getting-started-with-python-ii 

Getting started with random forests (standalone tutorial for submission relying on pandas tutorial)
https://www.kaggle.com/c/titanic/details/getting-started-with-random-forests 

### Old DSI code may help you out
Our fall 2015 has an example that's relevant. 

https://github.com/dsiufl/Python-Workshops 

# Submission ID

Check my test submission on the leaderboard here  
https://www.kaggle.com/c/titanic/leaderboard?submissionId=2818820  

ctrl+f your username on the leaderboard to find yourself quickly, it's a long list.

# Let's start creating our submission
We are trying to create a csv that holds our predictions for whether or not somebody survived. 
To stay organized, we want the ipython notebook to work with a few important directories:  
+ data (simply download from the competition page)
+ predictions (where we will write out predictions after training a model)
+ scripts (where you can store .py scripts if you want to have significantly different versions, examples, other users' scripts, etc)

In [1]:
#imports
import csv as csv
import numpy as np

In [15]:
#things that could be easier in pandas
csv_file_object = csv.reader(open('./data/train.csv', 'rb'))

# The next() command just skips the first line which is a header
header = csv_file_object.next()

#reading into a csv object
data=[]
for row in csv_file_object:
    data.append(row)
data = np.array(data) 

In [13]:
#don't forget how python indeces work
print "Just print the data so I can see if it loaded"
print data

print "Just print the first row so I can see columns"
print data[0]

print "Just print the last row so I know I have everything"
print data[-1] 

Just print the data so I can see if it loaded
[['1' '0' '3' ..., '7.25' '' 'S']
 ['2' '1' '1' ..., '71.2833' 'C85' 'C']
 ['3' '1' '3' ..., '7.925' '' 'S']
 ..., 
 ['889' '0' '3' ..., '23.45' '' 'S']
 ['890' '1' '1' ..., '30' 'C148' 'C']
 ['891' '0' '3' ..., '7.75' '' 'Q']]
Just print the first row so I can see columns
['1' '0' '3' 'Braund, Mr. Owen Harris' 'male' '22' '1' '0' 'A/5 21171'
 '7.25' '' 'S']
Just print the last row so I know I have everything
['891' '0' '3' 'Dooley, Mr. Patrick' 'male' '32' '0' '0' '370376' '7.75' ''
 'Q']


In [16]:
# let's immediately make a model

number_passengers = np.size(data[0::,1].astype(np.float))
number_survived = np.sum(data[0::,1].astype(np.float))
proportion_survivors = number_survived / number_passengers

# we actually have a usable model now to make a submission
print proportion_survivors

0.383838383838


We'll use numpy here to quickly mask (male/female) the data while creating a gender model.  

http://searchsecurity.techtarget.com/definition/data-masking  
https://en.wikipedia.org/wiki/Data_masking

In [17]:
#lets see if women or men were more likely to survive
female_data = data[0::,4] == "female"
male_data = data[0::,4] != "female"

In [7]:
# make it numerical
female_onboard = data[female_data,1].astype(np.float)
male_onboard = data[male_data,1].astype(np.float)

In [8]:
# compare likelihood of survival
# in doing so we create a SIMPLE gender model
prop_female_survived = np.sum(female_onboard) / np.size(female_data)
prop_male_survived = np.sum(male_onboard) / np.size(male_data)

print prop_female_survived, prop_male_survived
# who is more likely to survive?

0.261503928171 0.122334455668


### Preparing for submission  
Now we get ready to apply that simple gender model on the test data.
We'll have to work with test.csv to do so
The predictions have to go somewhere,
so we'll also create a pointer to csv that doesnt exist until we save it.

In [18]:
# get test data and move past header
test_file = open('./data/test.csv', 'rb')
test_file_object = csv.reader(test_file)
header = test_file_object.next()

In [19]:
# be aware of what columns the data came with
# this becomes useful as you get into more advanced Feature Engineering
print header

['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']


In [20]:
#create predictions file pointer
prediction_file = open('./predictions/mygendermodel.csv', 'wb')
prediction_file_object = csv.writer(prediction_file)

We'll apply the model as we read the test data per row,
first seeing if the datum is male or female,
then writing our prediction for that datum to a new file.
Writing a 1 indicates survival; what does a 0 indicate?

### Writing our predictions to the submission file

In [12]:
prediction_file_object.writerow(["PassengerId", "Survived"])
for row in test_file_object:
    if row[3] == 'female':
        prediction_file_object.writerow([row[0],'1'])
    else:
        prediction_file_object.writerow([row[0],'0'])
test_file.close()
prediction_file.close()

That's it! You've populated a csv that you can submit to Kaggle.
Because this is a script, we can reuse it.
This facilitates teamwork, replication, using new training data, etc.
Let's see how we can do this with pandas in II_getstarted.py

# Where to next?  

### Working with other users' scripts (and using one to see our python version)  

You'll want to be comfortable running code right on kaggle for the sake of testing your work before submitting; you likely will not be developing on their interface. This is why it is useful for a team leader to consider using a common repository. Let's see this script that displays what we have available on kaggle for our use: 

https://www.kaggle.com/burriswj/titanic/installed-python-packages-for-dsi/versions 

### Why do we care about sharing scripts?

You might find something you like; teams ensemble methods. If you're familiar with forking from git/github, you can do the same thing here; explore what's available when viewing code.

You can see where I've made changes here after I forked it:
https://www.kaggle.com/scripts/diff/1605/200056

### Kaggle inclass vs large scale competitions

https://www.kaggle.com/solutions/competitions 

https://inclass.kaggle.com/Competitions 
