# What is data? 

So what *is* data? The term is used in so many ways, it's often hard to pin down what people mean. Here is what [Wikipedia says][data1]:

> Data is uninterpreted information.

This is somewhat helpful, but also a bit cryptic, since we aren't told what it means to interpret information. Indeed, [it is often suggested](http://www.diffen.com/difference/Data_vs_Information) that an act of interpretation is required to go from data to information:

> Data are the facts or details from which information is derived. Individual pieces of data are rarely useful alone. For data to become information, data needs to be put into context.


Here's a longer passage about 'raw data' from the Wikipedia [article on data](https://en.wikipedia.org/wiki/Data):

> Raw data, i.e. unprocessed data, is a collection of numbers, characters; data processing commonly occurs by stages, and the "processed data" from one stage may be considered the "raw data" of the next.

This is more useful, since it tells us that data can somehow be processed, or transformed into something else. It also points out that what counts as data is relative to the context.

Let's try to get a clearer picture by looking at some examples.


[data1]: https://en.wikipedia.org/wiki/Data_(disambiguation)

## Yesterday I ate tomatoes

Suppose I decide to keep a diary about the food I eat. This could be pretty informal, something a bit like this:

```
Monday
------
bfast: toast and jam
lunch: tomato soup and roll
supper: baked beans, sushi, treacle tart

Tuesday
-------
bfast: porridge with soya milk
lunch: tomato soup and roll
supper: peri-peri chicken, chips, coke

```

But it's still good enough to count as data about my diet. (Not enough greens?)

## I run

Here's a slightly different kind of diary, recording my running exploits in the first half of December:

```
5/12/15 4.5km
7/12/15 3.1km
12/12/15 8.6km

```

So here we have data that combines dates and distances.


## Just numbers

What about this?

```
23.87
19.85
19.22
28.93
29.41
22.23
23.50
24.95
```

It's just a list of numbers, right? Apart from the fact that the numbers are in a narrow range, it's pretty much impossible to guess what this information is about.

Here are the same numbers, but with more information added:

```
Year   Days of rainfall
-----------------------
2004        23.87
2005        19.85
2006        19.22
2007        28.93
2008        29.41
2009        22.23
2010        23.50
2011        24.95
```

So now we see that we have got a *time series*: a sequence of data points measured at different times &mdash; in this case, in successive years. The two columns have been given labels which tell us what the time points are, and what kind of quantity has been measured. We could also specify not just *when* but *where* the measurements were taken, namely in Edinburgh.

The bits of information which tell us things like dates, location, the kind of quantity etc is sometimes called *metadata*: it's data about data.

### Your turn

* Find another example of time series data. Find or make-up some data points that are part of the series.

* Find another example of simple numerical data which is *not* time series data. What metadata would you need to add to make sure that someone else understands the data?

## Turning Tables

We often represent data in the form of rows and columns. That's what we mean when we talk about a data table (or tabular data). So the rainfall data above had two columns and eight rows, plus a header row.

### Your turn

* Write down the food diary example so that it looks like a table. 

Public bodies collect *lots* of data about all manners of things. More and more, they have been making this available as [open data](https://en.wikipedia.org/wiki/Open_data ) to anyone that wants to use it. Most of the time, the data is provided as some kind of table that we can download over the internet. Here's an example of data about Scottish schools which we've already downloaded for you. We're doing a bit of extra magic to make it easy to display the data, but you can ignore this for the time being.

In [2]:
from dds_lab import *
dir()
schools_csv = pd.read_csv(schools)
schools_csv

NameError: name 'pd' is not defined

Let's just briefly look through this table. The first column is not in fact part of the dataset, but is just there to help us keep track of which row is which. The second column can be ignored for now, but is a standardised way of giving a unique identifier to each school, whose conventional name can be found in the third column. The fifth and sixth columns contain the geographical coordinates of each school; as we'll see later, this is really helpful since it allows us to plot the locations of the schools on a map. Finally, the sixth column shows us the number of pupils.

# Survey Data

We are often asked to fill in questionnaires, asking our views and feelings about various things. Within Edinburgh, the Council [carries out an extensive survey of residents](http://www.edinburgh.gov.uk/info/20029/have_your_say/921/edinburgh_people_survey):

> The Edinburgh People Survey (EPS) is the Council's annual citizen survey, measuring satisfaction with the Council and its services, identifying areas for improvement and gathering information about residents which is not available through other sources or at neighbourhood level.

> The survey is undertaken through face-to-face interviews with around 5,000 residents each year, conducted in the street and door-to-door.

This data differs from what we've seen so far in being largely *subjective*. Below, we show a tiny extract from the 2013 survey.

In [None]:
eps_csv = pd.read_csv(eps_extract)
eps_csv

In [None]:
dir()