# Data Competition Sites

## What we will accomplish

In this notebook we will:
- Provide a definition of data competition sites,
- Give examples of popular data competition sites and
- Demonstrate the process of obtaining a data set from such a site.

## Definition

A <i>data competition website</i> is a site that hosts competitions about particular data sets. 

For example, some entity may have a collection of images from MRI scans. This entity could then provide those images as a data set for a competition whose goal is to provide the "best" predictive algorithm for some disease of interest. The data competition site would:
- Host the competition,
- Publicly store the data,
- Specify the rules as outlined by the entity,
- Accept the competition entries and
- Help determine the winner or winners.

While the competitions may be the main purpose of the website, these sites can often serve as a source of data for personal projects, contain tutorials and be community hubs.

## Popular data competition websites

Here is a list of some of the most popular data competition sites:
- <a href="https://www.kaggle.com/">Kaggle.com</a>,
- <a href="https://idao.world/">The International Data Olympiad</a>,
- <a href="https://www.drivendata.org/">DrivenData</a>,
- <a href="https://competitions.codalab.org/">CodaLab</a> and
- <a href="https://datahack.analyticsvidhya.com/">DataHack</a>.

Some of these sites will require you to create a profile and others may only have data available for active competitions.

## Example: Extracting data from Kaggle.com

Let's now demonstrate how to extract data from Kaggle.com. Note that in order to work through this example you will need a Kaggle profile.

Kaggle has an entire section dedicated to public datasets, <a href="https://www.kaggle.com/datasets">https://www.kaggle.com/datasets</a>, in particular we will download the famous iris data set found here, <a href="https://www.kaggle.com/uciml/iris">https://www.kaggle.com/uciml/iris</a>.

### Instructions

1. Go to this link, <a href="https://www.kaggle.com/uciml/iris">https://www.kaggle.com/uciml/iris</a>,
2. Click the download button,
3. Unzip the zip file and move the `Iris.csv` file to this folder,
4. Run the following code to load in the data using `pandas`.

In [1]:
import pandas as pd

In [2]:
pd.read_csv("Iris.csv")

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica


## Data use guidelines

Just like we said in the `Data Repositories` notebooks, be sure to follow any data use guidelines put forth by either the data competition website or the data set contributors. This includes using the data in accordance with their specified rules and citing the data source in any end products that result from your project.

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2022.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)