# Baseline framework. Step 1. Quickstart for getting repository and data.

This notebook aims to provide an example of a baseline in ML competition.

As an example a Titanic competition on Kaggle was chosen https://www.kaggle.com/c/titanic

## Establish repository

First, lets create a folder structure which will suit our needs. An interesting example could be found at https://github.com/drivendata/cookiecutter-data-science

Based on that example a more simpler folder structure could be suggested for ML competitions and this tutorial in particular. For example:

```
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
└── submissions        <- Folder for submission files
```

Lets create this structure in the cell below. If you want, you could go forward and adapt more extensive folder structure following the link above.

In [1]:
%cd ~/Git/dmia/DMIA_Sport_2019_Spring_dev/seminars/baselines/

/Users/aguschin/Git/dmia/DMIA_Sport_2019_Spring_dev/seminars/baselines


In [2]:
%%bash

export FOLDER='kaggle_titanic'
export LC_ALL=en_US.UTF-8

# create folder and content
rm -rf $FOLDER
mkdir $FOLDER
cd $FOLDER
mkdir -p {data/{external,processed,raw},models,notebooks,reports/figures,submissions}
touch {models,notebooks,reports/figures,submissions}/.gitkeep
touch README.md requirements.txt
wget https://raw.githubusercontent.com/drivendata/cookiecutter-data-science/master/%7B%7B%20cookiecutter.repo_name%20%7D%7D/.gitignore -nv
# pip freeze > requirements.txt
# conda env export > conda.yaml
# pipreqs .
# cp Quickstart.ipynb $FOLDER ???
tree .

# initialise repository
# git lfs usage is recommended https://git-lfs.github.com
git init
git lfs install
git lfs track "*.csv"
git lfs track "*.csv.gz"
git add .gitignore .gitattributes
git add *
git commit -m "create repository"
git status
# git remote add origin https://github.com/username/reponame.git
# git push -u origin master

.
├── README.md
├── data
│   ├── external
│   ├── processed
│   └── raw
├── models
├── notebooks
├── reports
│   └── figures
├── requirements.txt
└── submissions

9 directories, 2 files
Initialized empty Git repository in /Users/aguschin/Git/dmia/DMIA_Sport_2019_Spring_dev/seminars/baselines/kaggle_titanic/.git/
Updated git hooks.
Git LFS initialized.
Tracking "*.csv"
Tracking "*.csv.gz"
[master (root-commit) cc53cd9] create repository
 8 files changed, 91 insertions(+)
 create mode 100644 .gitattributes
 create mode 100644 .gitignore
 create mode 100644 README.md
 create mode 100644 models/.gitkeep
 create mode 100644 notebooks/.gitkeep
 create mode 100644 reports/figures/.gitkeep
 create mode 100644 requirements.txt
 create mode 100644 submissions/.gitkeep
On branch master
nothing to commit, working tree clean


2019-10-19 17:02:09 URL:https://raw.githubusercontent.com/drivendata/cookiecutter-data-science/master/%7B%7B%20cookiecutter.repo_name%20%7D%7D/.gitignore [1003/1003] -> ".gitignore" [1]
The following paths are ignored by one of your .gitignore files:
data
Use -f if you really want to add them.


## Clone existing repo

In [None]:
%%bash

git clone https://github.com/username/reponame.git

## Download data

https://github.com/Kaggle/kaggle-api

In [4]:
%cd kaggle_titanic/notebooks/

/Users/aguschin/Git/dmia/DMIA_Sport_2019_Spring_dev/seminars/baselines/kaggle_titanic/notebooks


In [5]:
%%bash

export COMPETITION='titanic'
kaggle competitions download -c titanic -p ../data/raw
cd ../data/raw && unzip titanic.zip

Downloading titanic.zip to ../data/raw

Archive:  titanic.zip
  inflating: train.csv               
  inflating: test.csv                
  inflating: gender_submission.csv   


100%|██████████| 33.9k/33.9k [00:00<00:00, 1.28MB/s]
