Copyright 2020 Andrew M. Olney and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.

# Data Science and the Nature of Data

<span style="color: #8B0000;">

## Seeing the Bigger Picture in Data Science

Data science is about discovering patterns and making sense of the vast amounts of information around us. From the numbers in a weather forecast to the recommendations on your favorite streaming service, data science influences how we understand and interact with the world. At its core, data science involves working with variables—pieces of information that change depending on the context. For example, a person's height, age, and weight are all variables that help describe them. However these variables come in different forms, and understanding their nature is essential for meaningful analysis.
</span>

<span style="color: #8B0000;">
In this study, you can explore the different types of variables and how they shape data interpretation. It then delves into how data is structured, such as in tables and spreadsheets, which allow us to organize and analyze information efficiently. Finally, you will see how dataframes, powerful tools in data science, enable us to manipulate data dynamically.
By the end of this study, you will see how the various components of data science connect, allowing you to understand and work with data intuitively. Instead of just memorizing definitions, you will gain an appreciation for how data is structured and used in real-world applications, helping you think like a little more of a data scientist. According to your preferences, this study offers you the chance to apply your initiatives for advancement.
</span>

## Types of variables

Structured data begins with **measurements** of some type of thing in the real world, which we call a **variable**.
Consider the example of height. 
I may measure 10 people and find that their heights in centimeters are:

| Height |
|--------|
| 165    |
| 188    |
| 153    |
| 164    |
| 150    |
| 190    |
| 169    |
| 163    |
| 165    |
| 190    |

Each of these values (e.g. 165) is a measurement of the variable *height*.
We call *height* a variable because its value isn't constant.
If everyone in the world were the same height, we wouldn't call height a variable, and we also wouldn't bother measuring it, because we'd know everyone is the same.

Variables have different **types** that can affect your analysis.

### Nominal

A nominal variable consists of unordered categories, like *male* or *female* for biological sex.
Notice that these categories are not numbers, and there is no order to the categories.
We do not say that male comes before female or is smaller than female.

### Ordinal

Ordinal variables consist of ordered categories.
You can think of it as nominal data but with an ordering from first to last or smallest to largest.
A common example of ordinal data are Likert questions like:

```
(1) Strongly disagree
(2) Disagree
(3) Neither agree nor disagree
(4) Agree
(5) Strongly agree
```

Even though these options are numbered 1 to 5, those numbers only indicate which comes before the others, not how "big" an option is.
For example, we wouldn't say that the difference between *Agree*  and *Disagree* is the same as the difference between *Neither agree nor disagree* and *Strongly agree*.

### Interval

Interval variables are ordered *and* their measurement scales are evenly spaced.
A classic example is temperature in Fahrenheit.
In degrees Fahrenheit, the difference between 70 and 71 is the same as the difference between 90 and 91 - either case is one degree.
The other most important characteristic of interval variables is also the most confusing one, which is that interval variables don't have a meaningful zero value.
Degrees Fahrenheit is an example of this because there's nothing special about 0 degrees. 
0 degrees doesn't mean there's no temperature or no heat energy, it's just an arbitrary point on the scale.

### Ratio

Ratio variables are like interval variables but with meaningful zeros.
Age and height are good examples because 0 age means you have no age, and 0 height means you have no height.
The name *ratio* reflects that you can form a ratio with these variables, which means that you can say age 20 is twice as old as age 10.
Notice you can't say that about degrees Fahrenheit: 100 degrees is not really twice as hot as 50 degrees, because 0 degrees Fahrenheit doesn't mean "no temperature."

<span style="color: #8B0000;">

### Comprehensive Example Involving All Variable Types 

Country (Nominal): Each participant's country of origin like USA, Canada, Japan, etc. This is a nominal variable as the categories (countries) do not have a mathematical order.
Satisfaction Level (Ordinal): Participants rated their overall life satisfaction on a scale from 1 (Very Unsatisfied) to 5 (Very Satisfied). This is an ordinal variable because the rankings have an order but the intervals between the ranks are not necessarily equal.
Temperature (Interval): The average yearly temperature (in Celsius) of the participant's city. Temperature is an interval variable because the difference between temperatures is meaningful, and the scale is evenly spaced, but zero is not an absence of temperature.
Annual Income (Ratio): The yearly income of participants in USD. This is a ratio variable because it has a meaningful zero (no income), and the relationships between numbers are meaningful (e.g., an income of 40,000 USD is twice that of 20,000 USD).
</span>


<span style="color: #8B0000;">
    
## Exploring the World of Data through Tabular Structures

**Tabular data**, commonly found in spreadsheets, is the most prevalent form of structured data, organizing information into rows and columns for clarity and analysis. Each **row** in a tabular dataset typically represents an **observation** or **datapoint** (in statistics terminology) or an **item** (in machine learning terminology), such as a person, and includes measurements for various attributes like height, age, and weight. For example, a dataset with five rows suggests data on five different individuals. The **columns** in these tables are labeled with variable names, such as 'Height' at the top of a column, indicating the type of data stored in each column. The **column headers** serve as descriptive labels rather than actual data points, which is why they are not counted in the total number of data rows. This organization facilitates straightforward data analysis and manipulation.
 
<table width="100%">
  <tr>
    <td valign="top" style="width: 33.33%; padding-right: 10px; white-space: nowrap;">
      <strong>* Standard Tabular Data</strong>
      <pre>
Height  Age  Weight
161    50    53
161    17    53
155    33    84
180    51    84
186    18    88
      </pre>
    </td>
    <td valign="top" style="width: 33.33%; padding-right: 10px; white-space: nowrap;">
      <strong>** Comma-Separated Values (CSV)</strong>
      <pre>
CSV Representation
Height,Age,Weight
161,50,53
161,17,53
155,33,84
180,51,84
186,18,88
      </pre>
    </td>
    <td valign="top" style="width: 33.33%; white-space: nowrap;">
      <strong>*** Tab-Separated Values (TSV)</strong>
      <pre>
TSV Representation
Height  Age  Weight
161    50    53
161    17    53
155    33    84
180    51    84
186    18    88
      </pre>
    </td>
  </tr>
</table>



**This table represents a typical structured data format used in spreadsheets, with each row corresponding to individual records and columns representing data types such as Height, Age, and Weight.*
***CSV is a widely used format in data science, where data points are separated by commas. This format simplifies the export and import of data between programs and platforms.*
****TSV format uses tabs to delimit data, making it ideal for data with commas or other special characters. It provides clarity and distinction between columns without additional text qualifiers.*

</span>

## Dataframes

Data scientists often load tabular data into a **dataframe** that they can manipulate in a program.
In other words, tabular data from a file is brought into the computational notebook in a variable that represents rows, columns, header, etc just like they are stored in the tabular data file.
Because dataframes match tabular data in files, they are very intuitive to work with, which may explain their popularity.

We're now at the practical portion of this notebook, so let's work with dataframes!

First, we need to import a dataframe library called `pandas`.

** Follow the steps in the video below**

Once the code is in the Jupyter cell below, you must **execute** or **run** it by either pressing the &#9658; button at the top of the window or by pressing Shift + Enter on your keyboard.

We can now do things with `pd`, like load datasets!

Our file is called `height-age-weight.csv` and it is in the `datasets` folder.
That means the **path** from this notebook (the one you're reading) to the data is `datasets/height-age-weight.csv`.

To read this file into a dataframe, we will use `pd` and read the file into a variable called `dataframe`. 

**Follow the steps in the video below**

Run the cell with the &#9658; button or press Shift + Enter to run the code.

We've read the csv and stored the data in the `dataframe` variable, so we will use the `dataframe` variable whenever we want to work with the data.

There are many things we can do with dataframes.
One thing we can do is get specific rows.

**Follow the steps in the video below**

*Then &#9658; or Shift + Enter*

As you can see, the output is only the first row of the dataframe.

Try it again in the cell below, but this time, change the `1` to a `2`

*Then &#9658; or Shift + Enter*

Now the output is the first two rows of the dataframe.
We could get arbitrary rows of the dataframe by starting at a different number and ending at a different number.
Sometimes people call this a **slice**.

We can get a column of the dataframe by using the name of the variable for that column.
Before we go any further, let's step back for a second to talk about **lists**.

We can think of a dataframe in two ways:

- A list of rows
- A list of columns

We just saw the list of rows way.
So why are columns any different?
The difference is that our columns have variable names, and we often want to refer to columns using those names.
For example, we want to say something like "give me the Age column" instead of "give me column 2."

Let's make a list from scratch to illustrate this.

**Follow the steps in the video below**

Now execute the cell (scroll up if you need a reminder how).

This is a list with one thing inside it, `"Height"`.
Lists can have multiple things inside them, making lists a container for other variables.

Let's use a list to get a column from the dataframe.

**Follow the steps in the video below**

And run it.

We can get more than one column by adding another element to the list. 

**Follow the steps in the video below**

And run the cell (try Shift + Enter if you haven't tried it yet).

To recap, dataframes are both lists of rows and lists of columns, and lists are themselves containers for other variables.

There are many, many things we can do with dataframes, but let's just talk about one more for now.

We can select rows based on a value in a particular column:

**Follow the steps in the video below**

Don't forget to run the cell!

The resulting column is `True` or `False` depending on whether the value of `Height` was above 161 or not (notice a few were exactly equal to 161, so they weren't greater).

What we're about to do next is magical.

**Follow the steps in the video below**

And run it!

The dataframe only kept the rows for which `Height` was > 161, i.e. those for which this was `True`.

Notice this time we didn't put `Height` in a list. It won't work if we do.

------------------------

**Once you complete the above steps, the link for the completion keyword will appear below**

If you think it should appear and it hasn't, double check these things:

- Is there an answer in every code cell
- Has every cell be run (does it have a number next to it, e.g. `[3]`
- Remember you can watch the training video in the other tab

<!--  -->