# What is data? 

So what *is* data? The term is used in so many ways, it's often hard to pin down what people mean. Here is what [Wikipedia says][data1]:

> Data is uninterpreted information.

This is somewhat helpful, but also a bit cryptic, since we aren't told what it means to interpret information. Indeed, [it is often suggested](http://www.diffen.com/difference/Data_vs_Information) that an act of interpretation is required to go from data to information:

> Data are the facts or details from which information is derived. Individual pieces of data are rarely useful alone. For data to become information, data needs to be put into context.

Here's a longer passage about 'raw data' from the Wikipedia [article on data](https://en.wikipedia.org/wiki/Data):

> Raw data, i.e. unprocessed data, is a collection of numbers, characters; data processing commonly occurs by stages, and the "processed data" from one stage may be considered the "raw data" of the next.

This is more useful, since it tells us that data can somehow be processed, or transformed into something else. It also points out that what counts as data is relative to the context.

Let's try to get a clearer picture by looking at some examples.


[data1]: https://en.wikipedia.org/wiki/Data_(disambiguation)

## Yesterday I ate tomatoes

Suppose I decide to keep a diary about the food I eat. This could be pretty informal, something a bit like this:

```
Monday
------
bfast: toast and jam
lunch: tomato soup and roll
supper: baked beans, sushi, treacle tart

Tuesday
-------
bfast: porridge with soya milk
lunch: tomato soup and roll
supper: peri-peri chicken, chips, coke

```

But it's still good enough to count as data about my diet. (Not enough greens?)

## I run

Here's a slightly different kind of diary, recording my running exploits in the first half of December:

```
5/12/15 4.5km
7/12/15 3.1km
12/12/15 8.6km

```

So here we have data that combines dates and distances.


## Just numbers

What about this?

```
23.87
19.85
19.22
28.93
29.41
22.23
23.50
24.95
```

It's just a list of numbers, right? Apart from the fact that the numbers are in a narrow range, it's pretty much impossible to guess what this information is about.

Here are the same numbers, but with more information added:

```
Year   Days of rainfall
-----------------------
2004        23.87
2005        19.85
2006        19.22
2007        28.93
2008        29.41
2009        22.23
2010        23.50
2011        24.95
```

So now we see that we have got a *time series*: a sequence of data points measured at different times &mdash; in this case, in successive years. The two columns have been given labels which tell us what the time points are, and what kind of quantity has been measured. We could also specify not just *when* but *where* the measurements were taken, namely in Edinburgh.

The bits of information which tell us things like dates, location, the kind of quantity etc is sometimes called *metadata*: it's data about data.

### Your turn

* Find another example of time series data. Find or make-up some data points that are part of the series.

* Find another example of simple numerical data which is *not* time series data. What metadata would you need to add to make sure that someone else understands the data?

## Turning Tables

We often represent data in the form of rows and columns. That's what we mean when we talk about a data table (or tabular data). So the rainfall data above had two columns and eight rows, plus a header row.

### Your turn

* Write down the food diary example so that it looks like a table. 

Public bodies collect *lots* of data about all manners of things. More and more, they have been making this available as [open data](https://en.wikipedia.org/wiki/Open_data) to anyone that wants to use it. Most of the time, the data is provided as some kind of table that can be downloaded over the internet. Here's an example of data about Scottish schools which we've already downloaded for you. We're doing a bit of extra magic to make it easy to display the data, but you can ignore this for the time being.

In [4]:
from dds_lab import *
schools_csv = pd.read_csv(schools)
schools_csv.head(10)

Unnamed: 0,school,school_label,latitude,longitude,pupils
0,http://data.opendatascotland.org/id/educationa...,Linlithgow Academy,55.9716,-3.61259,1231
1,http://data.opendatascotland.org/id/educationa...,St Kentigern's Academy,55.87101,-3.63367,1215
2,http://data.opendatascotland.org/id/educationa...,"James Young High,The",55.88093,-3.51523,1135
3,http://data.opendatascotland.org/id/educationa...,St Margaret's Academy,55.88937,-3.52213,1094
4,http://data.opendatascotland.org/id/educationa...,Inveralmond Community High,55.90146,-3.51932,1090
5,http://data.opendatascotland.org/id/educationa...,West Calder High,55.86291,-3.54044,950
6,http://data.opendatascotland.org/id/educationa...,Deans Community High,55.90581,-3.54977,941
7,http://data.opendatascotland.org/id/educationa...,Broxburn Academy,55.93694,-3.48778,903
8,http://data.opendatascotland.org/id/educationa...,Bathgate Academy,55.89838,-3.61313,899
9,http://data.opendatascotland.org/id/educationa...,Whitburn Academy,55.86804,-3.67964,822


Let's just briefly look through this table. The first column is not in fact part of the dataset, but is just there to help us keep track of which row is which. The second column can be ignored for now, but is a standardised way of giving a unique identifier to each school, whose conventional name can be found in the third column. The fifth and sixth columns contain the geographical coordinates of each school; as we'll see later, this is really helpful since it allows us to plot the locations of the schools on a map. Finally, the sixth column shows us the number of pupils.

### Your turn

In the code cell above, the last line is:
```python
schools_csv.head(10)
```
This tells us to just look at the first 10 rows of the file. If you want to see (say) 20 rows of the file, replace the line with this:
```python
schools_csv.head(20)
```

Alternatively, if you want to see the whole table, write this:
```python
schools_csv
```

# Survey Data

We are often asked to fill in questionnaires, asking our views and feelings about various things. Within Edinburgh, the Council [carries out an extensive survey of residents](http://www.edinburgh.gov.uk/info/20029/have_your_say/921/edinburgh_people_survey):

> The Edinburgh People Survey (EPS) is the Council's annual citizen survey, measuring satisfaction with the Council and its services, identifying areas for improvement and gathering information about residents which is not available through other sources or at neighbourhood level.

> The survey is undertaken through face-to-face interviews with around 5,000 residents each year, conducted in the street and door-to-door.

This data differs from what we've seen so far in being largely *subjective*. Below, we show a tiny extract from the [2013 survey](https://github.com/edinburghcouncil/datasets/tree/master/Edinburgh%20People%20Survey).

In [5]:
eps_csv = pd.read_csv(eps_extract)
eps_csv

Unnamed: 0,HOU003,HOU004,HOU006,HOU007,NEI001,NEI002,NEI003,NEI032,NEI040,COU001,COU002
0,Meadows/Morningside,Male,45-54,Working - Full-time (30+ hours),Fairly dissatisfied,"Parking bays should be painted in, could do wi...",Yes,Fairly safe,Very satisfied,Fairly satisfied,Need bottle bank at Waitrose (Falcone Road).
1,Meadows/Morningside,Female,35-44,Working - Part-time (9-29 hours),Fairly dissatisfied,No comment.,No,Fairly safe,Neither satisfied nor dissatisfied,Fairly satisfied,No comment.
2,Meadows/Morningside,Male,16-24,Working - Full-time (30+ hours),Don't know,Don't know.,Not sure,Fairly safe,Very satisfied,Fairly satisfied,No problems.
3,Meadows/Morningside,Male,25-34,Self employed,Don't know,No comment.,Not sure,Fairly safe,Fairly satisfied,Fairly satisfied,No comment.
4,Meadows/Morningside,Male,16-24,Student,Neither satisfied nor dissatisfied,It's okay.,No,Fairly safe,Fairly satisfied,Fairly satisfied,Rubbish collection and waste food disposal poo...
5,Meadows/Morningside,Female,35-44,Working - Part-time (9-29 hours),Fairly dissatisfied,Recycling bins not being collected. Need empti...,Not sure,Very safe,Fairly satisfied,Fairly satisfied,Food waste bins should be cleaned. Quite disgu...
6,Meadows/Morningside,Female,60-64,Not working - retired,Fairly dissatisfied,Pretty satisfied.,Yes,Fairly safe,Very satisfied,Fairly satisfied,Romanians begging on streets. It's on the rise...
7,Meadows/Morningside,Male,16-24,Student,Neither satisfied nor dissatisfied,No comment.,No,Fairly safe,Very satisfied,Fairly satisfied,No comment.
8,Meadows/Morningside,Male,35-44,Working - Full-time (30+ hours),Fairly dissatisfied,No comment.,Not sure,Very safe,Very satisfied,Fairly satisfied,No issues.
9,Meadows/Morningside,Male,25-34,Working - Full-time (30+ hours),Fairly dissatisfied,No comment.,Not sure,Fairly safe,Fairly satisfied,Fairly satisfied,No comment.


Some of the answers shown here are impossible to interpret without knowing what questions were asked, so here are the relevant survey questions:

> **NEI001**:	Thinking of your neighbourhood area, by which I mean the area within a 15 minute walk of your home, how satisfied or dissatisfied are you with this area as a place to live?

> **NEI002**:	What should be the top priority for improving the quality of life in your neighbourhood?

> **NEI003**:	Do you feel that you are able to have a say on things happening or how Council services are run in your local area (neighbourhood or community)?

> **NEI032**:	How safe do you feel in your neighbourhood after dark?

> **NEI040**:	To what extent are you satisfied or dissatisfied with the way the Council is managing your neighbourhood?

> **COU001**:	To what extent are you satisfied or dissatisfied with the way the Council is managing the City?	

> **COU002**:	Why do you say this?

Since the Meadows/Morningside area is one of the more desirable areas of Edinburgh, and given that Edinburgh is sometimes rated as [one of the most livable cities in the UK](http://www.scotsman.com/news/edinburgh-named-second-best-place-to-live-in-uk-1-3149764), it's intriguing how lukewarm about their neighbourhood these respondents were!


## Social Media Data

In some contexts, we might want to treat information shared via social media as data. For example, we could sample Twitter to see what kinds of things people are currently saying about food. (Warning: you will only be able to re-run the following code if you have followed [these instructions about obtaining Twitter API keys](http://www.nltk.org/howto/twitter.html_).)

In [6]:
import nltk # load up the NLTK library
from nltk.twitter import Twitter
tw = Twitter() # start a new client that connects to Twitter
tw.tweets(keywords='food', limit=25) #filter Tweets from the public stream

RT @RIFuture: This is what it looks like when Rhode Islanders stand up for a living wage: https://t.co/LnqwLtUukI #FightFor15
Mechanics Of Eating: Why You'll Miss Flavor If You Scarf Your Food https://t.co/udAODPJYyo
RT @INFINITE7SOUL: [PIC] 151111 Coffee Pong Instagram Update: Food Truck Support from #인피니트 Dongwoo's fansite to In the Heights Musical htt…
Today is one of those 'no spoons' days. But hopefully when food arrives I will be able to get up &amp; collect it. :/
RT @_CollegeHumor_: Remix to ignition, there’s no food in the kitchen, my life is a mess and I can’t afford my tuition
RT @_CollegeHumor_: Remix to ignition, there’s no food in the kitchen, my life is a mess and I can’t afford my tuition
RT @INFINITE7SOUL: [PIC] 151111 Coffee Pong Instagram Update: Food Truck Support from #인피니트 Dongwoo's fansite to In the Heights Musical htt…
RT @MobileFoodConf: Columbus Mobile Food Conference a must attend for anyone in the industry in the Midwest and Anyone who wants to be 

ht…
I've

## Thoughts as Data

As we mentioned at the outset, what gets categorised as 'raw data' depends very much on the context. Here's an example, from the novel *Thinks ...* by David Lodge (2001). In this extract, the narrator is describing an exercise in which he verbalises whatever comes into his head, and records it on a dictation machine for subsequent transcription and analysis:

> The object of the exercise being to try and describe the structure of, or rather to produce a specimen, that is to say raw data, on the basis of which one might begin to try to describe the structure of, or from which one might infer the structure of ... thought.

