# **Whats the deal with data?**

As humans, our insatiable desire to `learn`, make `inferences` about things, and `predict future events` requires **knowledge** about the world. This knowledge is where data comes in.

**Data** is the information that enables you to do all of those things mentioned above. For example:

> Say I want to learn what about the **gender wage gap** in the US. I could collect the following data from a large amount of people working at various companies:
- gender
- age
- salary
- years spent at the company
- education level

I could then plot this data to `look for trends`. I could also try to `predict` someone's salary based off this data.

Collections of data like this are called `datasets`. Today, you'd be hard pressed to find a single company or business that does _not_ utilize datasets in the decision making process. The role of data is becoming increasingly important as technology improves.

Before selecting our datasets, first we need to go over a fundamental property of data: `features`!

## Features, attributes, variables, or ... columns?

When looking at datasets, the pieces of information recorded about the subject go by different names depending on who you ask:

- In `computer science`, they are called **features** or **attributes**. 
- In `statistics`, they are called **variables**
- In `linear algebra`, they are the **columns** of an array/matrix

This is the information that lets us teach machines to automatically do what we want and from this point on, I use the term **features**, as this is used in ML.

An example of features:

> Say I want to create a machine learning algorithm that `predicts company A's stock price` for the end of the week. To do this, I can use the first six days' stock prices as features to predict the last day (7).

![](assets/stock.png)

> Looking at the lineplot above, the stock price is generally increasing. Thus, my ML system will likely predict a number greater than 64. 

## Flavors of features

Not all features are equal. Features exhibit certain properties and are divided into two groups: `quantitative` and `qualitative` features. 

### *Quantitative*

*`Quantitative features`* are ones recorded as numbers. This can be in the form of a ranking, a measurement, a count of occurences of things, and more. These types of features are also referred to as **numerical**.

An example of `quantitative features` would be the following mini dataset containing measures from college students:

| Weight (lbs) | Height (ft) | Age | GPA (1-4) | # of enrolled classes |
|--------------|-------------|-----|-----------|-----------------------|
| 163          | 5'6         | 19  | 2.9       | 3                     |
| 208          | 6'4         | 24  | 3.3       | 4                     |
| 123          | 5'5         | 21  | 3.7       | 4                     |
| 90           | 5'3         | 25  | 2.9       | 2                     |
| 176          | 5'8         | 19  | 3.9       | 5                     |
| 198          | 6'0         | 20  | 3.0       | 3                     |

> 

### *Qualitative*

*`Qualitative`* features are *not* recorded as numbers. These are typically labels, names, or other non-numerical descriptions. This type of data is also called **categorical**, since these non-numerical labels are often categories of things.

Continuting the theme of measuring data from college students, the following would be examples of `qualitative features`:

| Name    | Year      | Gender | Major                  | Commuter | Likes school |
|---------|-----------|--------|------------------------|----------|--------------|
| Hans    | Freshman  | M      | Computer science       | Yes      | No           |
| Jack    | Senior    | M      | Biochemistry           | No       | Yes          |
| Hailey  | Sophomore | F      | Electrical engineering | Yes      | Yes          |
| Edward  | Senior    | M      | Anthropology           | Yes      | No           |
| Cynthia | Freshman  | F      | Health Science         | No       | No           |
| Sarah   | Senior    | F      | Journalism             | Yes      | Yes          |

>

<div class="alert alert-block alert-info"><b>Note:</b> Quantitative and qualitative features can further be broken down into sub-categories. Check out this neat <a href="https://statsandr.com/blog/variable-types-and-examples/">webpage</a> for more information. </div>

# **Choosing a data set**