# Data Repositories

## What we will accomplish

In this notebook we will:
- Define the notion of a data repository,
- Give several examples of data repositories and
- Demonstrate the process of obtaining data from a data repository.

## Definition

We will call a <i>data repository</i> any website where data sets are deposited. These can exist for many reasons for example:
- Housing data associated with published academic research,
- Holding data that was used by a news organization or
- Holding benchmark data sets that are used to compare algorithmic performance.

Such repositories can be excellent sources of data for a project. Let's now review a couple different kinds and give some examples.

## Academic repositories

These repositories house data affiliated with academic research papers. They exist for both the purpose of replication and to spur additional research. Here are some examples:
- The UC Irvine Machine Learning Repository, <a href="https://archive.ics.uci.edu/ml/index.php">https://archive.ics.uci.edu/ml/index.php</a>, (<i>a very popular repository</i>),
- A repository of COVID-19 Tweets, <a href="https://publichealth.jmir.org/2020/2/e19273/">https://publichealth.jmir.org/2020/2/e19273/</a>,
- The Mendeley Data repository site, <a href="https://data.mendeley.com/">https://data.mendeley.com/</a> and
- The Harvard Dataverse, <a href="https://dataverse.harvard.edu/">https://dataverse.harvard.edu/</a>.

## GitHub repositories

There are many GitHub repositories whose sole purpose is data storage. News organizations and data-based blogs/websites often have repositories that store the data sets accompanying their stories/posts. For example:
- <a href="https://fivethirtyeight.com/">FiveThirtyEight</a>, <a href="https://github.com/fivethirtyeight/">https://github.com/fivethirtyeight/</a>,
- <a href="https://www.nytimes.com/">The New York Times</a>, <a href="https://github.com/nytimes">https://github.com/nytimes</a> and
- <a href="https://pudding.cool/">Pudding.cool</a>, <a href="https://github.com/the-pudding/data">https://github.com/the-pudding/data</a>.

There are also repositories maintained by individual users not affiliated with any larger organization. These may be harder to find, but if you have a data set in mind it can be a good idea to do a web search for an existing GitHub repository. This could save you a lot of time and work.

## An example

Let's demonstrate how you can use a repository to access data.

We will use a data set from the FiveThirtyEight repository. Let's download the `candy-data.csv` from the folder associated with this post, <a href="https://fivethirtyeight.com/videos/the-ultimate-halloween-candy-power-ranking/">https://fivethirtyeight.com/videos/the-ultimate-halloween-candy-power-ranking/</a>.

### Instructions

1. First go to the link associated with the data file, <a href="https://github.com/fivethirtyeight/data/blob/master/candy-power-ranking/candy-data.csv">https://github.com/fivethirtyeight/data/blob/master/candy-power-ranking/candy-data.csv</a>.
2. Then click on the `Raw` button above the data table demonstrated on the page.
3. Using your web browser, save the file as `candy-data.csv` within the `Data Collection` folder of this repository.
4. Run the code chunks below.

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv("candy-data.csv")

In [3]:
data.head()

Unnamed: 0,competitorname,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent
0,100 Grand,1,0,1,0,0,1,0,1,0,0.732,0.86,66.971725
1,3 Musketeers,1,0,0,0,1,0,0,1,0,0.604,0.511,67.602936
2,One dime,0,0,0,0,0,0,0,0,0,0.011,0.116,32.261086
3,One quarter,0,0,0,0,0,0,0,0,0,0.011,0.511,46.116505
4,Air Heads,0,1,0,0,0,0,0,0,0,0.906,0.511,52.341465


Congratulations! You have now downloaded and used data stored on a repository.

Note this will not be the exact same process you will follow everytime you want to use data stored on a repository.

## Repository use guidelines

When using data that you did not collect or create yourself it is important to ensure that you follow whatever data use guidelines are associated with the data set you utilized. In particular, you should check to make sure that you are not violating any restrictions or legal guidelines outlined by the data provider. Many repositories will have guidelines on how you are allowed to use their data set.

It is also important that you credit the original source of the data in your final project. Some repositories may also have guidelines for citation. For example, an academic repository likely has an associated publication that you should cite.

Please be responsible and courteous data citizens. :)

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2022.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)