Copyright 2020 Andrew M. Olney and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.

  # 1- Data Science and the Nature of Data

<span style="color: #8B0000;">
Data science is the study of data, how we collect, organize, and analyze it to extract meaningful insights. This material is structured to help you understand data science basics in a logical and step by step manner.
</span>

<span style="color: #8B0000;">
1-1 Understanding Variables: We begin by defining variables, which are the fundamental building blocks of data. A variable represents a characteristic that can take different values, such as height, age, or weight.
</span>

<span style="color: #8B0000;">
1-2 Types of Variables: Next, we categorize variables into four types:

- **Nominal** (categorical labels, like gender or color)
- **Ordinal** (ordered categories, like rankings)
- **Interval** (numeric values with equal spacing, like temperature)
- **Ratio** (numeric values with a meaningful zero, like age)
</span>

<span style="color: #8B0000;">
1-3 Organizing Data in Tables: We then explore how data is structured in tables and spreadsheets. Each row represents an individual observation, and each column represents a variable.
</span>

<span style="color: #8B0000;">
1-4 Using Dataframes in Data Science: Finally, we introduce dataframes, a tool used in programming to manipulate and analyze tabular data effectively.
</span>

<span style="color: #8B0000;">
Throughout this material, you will progress through clear examples and explanations, building your knowledge step by step. By the end, you will have a solid understanding of how data is classified, structured, and analyzed using dataframes. This logical approach ensures that each concept is fully understood before moving on to the next.
</span>

## 2- Types of variables

Structured data begins with **measurements** of some type of thing in the real world, which we call a **variable**.
Consider the example of height. 
I may measure 10 people and find that their heights in centimeters are:

<div style="display: flex; align-items: flex-start;">

<!-- Table -->
<div>
    <table>
        <thead>
            <tr>
                <th>Height</th>
            </tr>
        </thead>
        <tbody>
            <tr><td>165</td></tr>
            <tr><td>188</td></tr>
            <tr><td>153</td></tr>
            <tr><td>164</td></tr>
            <tr><td>150</td></tr>
            <tr><td>190</td></tr>
            <tr><td>169</td></tr>
            <tr><td>163</td></tr>
            <tr><td>165</td></tr>
            <tr><td>190</td></tr>
        </tbody>
    </table>
</div>


</div>

Each of these values (e.g. 165) is a measurement of the variable *height*.
We call *height* a variable because its value isn't constant.
If everyone in the world were the same height, we wouldn't call height a variable, and we also wouldn't bother measuring it, because we'd know everyone is the same.

Variables have different **types** that can affect your analysis.

### 2-1 Nominal

A nominal variable consists of unordered categories, like *male* or *female* for biological sex.
Notice that these categories are not numbers, and there is no order to the categories.
We do not say that male comes before female or is smaller than female.
<span style="color: #8B0000;">
Example: Pet Type - Consider a survey question asking respondents about the type of pet they own: Dog, Cat, Bird, or None. These categories serve to classify responses but do not imply any order or ranking.
</span>

### 2-2 Ordinal

Ordinal variables consist of ordered categories.
You can think of it as nominal data but with an ordering from first to last or smallest to largest.
A common example of ordinal data are Likert questions like:

```
(1) Strongly disagree
(2) Disagree
(3) Neither agree nor disagree
(4) Agree
(5) Strongly agree
```

Even though these options are numbered 1 to 5, those numbers only indicate which comes before the others, not how "big" an option is.
For example, we wouldn't say that the difference between *Agree*  and *Disagree* is the same as the difference between *Neither agree nor disagree* and *Strongly agree*.
<span style="color: #8B0000;">
Example: Education Level - Education levels listed as Elementary, High School, College, and Postgraduate. While there is a clear order to these categories, the difference in educational attainment between each level isn't quantified.
</span>

### 2-3 Interval

Interval variables are ordered *and* their measurement scales are evenly spaced.
A classic example is temperature in Fahrenheit.
In degrees Fahrenheit, the difference between 70 and 71 is the same as the difference between 90 and 91 - either case is one degree.
The other most important characteristic of interval variables is also the most confusing one, which is that interval variables don't have a meaningful zero value.
Degrees Fahrenheit is an example of this because there's nothing special about 0 degrees. 
0 degrees doesn't mean there's no temperature or no heat energy, it's just an arbitrary point on the scale.
<span style="color: #8B0000;">
Example: Time of Day - Recorded in hours, such as 10:00, 11:00, 12:00. The difference between each hour is constant, but 0:00 does not imply the absence of time.
</span>

### 2-4 Ratio

Ratio variables are like interval variables but with meaningful zeros.
Age and height are good examples because 0 age means you have no age, and 0 height means you have no height.
The name *ratio* reflects that you can form a ratio with these variables, which means that you can say age 20 is twice as old as age 10.
Notice you can't say that about degrees Fahrenheit: 100 degrees is not really twice as hot as 50 degrees, because 0 degrees Fahrenheit doesn't mean "no temperature."
<span style="color: #8B0000;">
Example: Distance Traveled - Measured in kilometers on a trip. This variable has a true zero point (no travel), and differences and ratios are meaningful.
</span>

## 3- Tabular data

The most common type of structured data is **tabular data** which is what you find in spreadsheets.
If you've ever used a spreadsheet, you know something about tabular data!

Here's an example of tabular data, with *height* in centimeters, *age* in years, and *weight* in kilograms:

<div style="display: flex; align-items: center;">

<!-- Table -->
<div>
    <table>
        <thead>
            <tr>
                <th>Height</th>
                <th>Age</th>
                <th>Weight</th>
            </tr>
        </thead>
        <tbody>
            <tr><td>161</td><td>50</td><td>53</td></tr>
            <tr><td>161</td><td>17</td><td>53</td></tr>
            <tr><td>155</td><td>33</td><td>84</td></tr>
            <tr><td>180</td><td>51</td><td>84</td></tr>
            <tr><td>186</td><td>18</td><td>88</td></tr>
        </tbody>
    </table>
</div>


**Follow the steps!**

</div>
In tabular data like this, each **row** is a person.
More generically, we would say each row is an **observation** or **datapoint** (in statistics terminology) or an **item** (in machine learning terminology).
In each row, we have measurements for each of our variables for that particular person.
Since we have five rows of measurements, we know that there are five people in this dataset.

We can also think about tabular data in terms of **columns**.
Each column represents a variable, with the name of that variable in the **column header**.
For example, *height* is at the top of the first column and is the name of the variable for that column.
Importantly, the header is not an observation but rather a description of our data.
This is why we don't count the header when we are counting the rows in our data.

### 3-1 Delimited tabular data - CSV and TSV

You are probably familiar with spreadsheet files, e.g. Microsoft Excel has files that end in `.xls` or `.xlsx`.
However, in data science, it is more common to have tabular data files that are **delimited**.
A delimited file is just a plain text file where column boundaries are represented by a specific character, usually a comma or a tab.

Here's what the data above looks like in **comma separated value (CSV)** form:

```
Height,Age,Weight
161,50,53
161,17,53
155,33,84
180,51,84
186,18,88
```

and here's what the data looks like in **tab separated value (TSV)** form:

```
Height	Age	Weight
161	50	53
161	17	53
155	33	84
180	51	84
186	18	88
```

The choice of the delimiter (comma, tab, or something else) is really arbitrary, but it's always better to use a delimiter that doesn't appear in your data.

## 4- Dataframes

Data scientists often load tabular data into a **dataframe** that they can manipulate in a program.
In other words, tabular data from a file is brought into the computational notebook in a variable that represents rows, columns, header, etc just like they are stored in the tabular data file.
Because dataframes match tabular data in files, they are very intuitive to work with, which may explain their popularity.

We're now at the practical portion of this notebook, so let's work with dataframes!

## 5- Working Examples 

Let's now proceed with some working examples to explore how we can manipulate and analyze our data using a dataframe.

##### 5-1 First Example

First, we need to import a dataframe library called `pandas`.

**Follow the steps in the video below**

Once the code is in the Jupyter cell below, you must **execute** or **run** it by either pressing the &#9658; button at the top of the window or by pressing Shift + Enter on your keyboard.


We can now do things with `pd`, like load datasets!

##### 5-2 Second Example

Our file is called `height-age-weight.csv` and it is in the `datasets` folder.
That means the **path** from this notebook (the one you're reading) to the data is `datasets/height-age-weight.csv`.

To read this file into a dataframe, we will use `pd` and read the file into a variable called `dataframe`. 

**Follow the steps in the video below**

Run the cell with the &#9658; button or press Shift + Enter to run the code.

##### 5-3 Third Example

We've read the csv and stored the data in the `dataframe` variable, so we will use the `dataframe` variable whenever we want to work with the data.

There are many things we can do with dataframes.
One thing we can do is get specific rows.

**Follow the steps in the video below**

*Then &#9658; or Shift + Enter*

As you can see, the output is only the first row of the dataframe.

Try it again in the cell below, but this time, change the `1` to a `2`

*Then &#9658; or Shift + Enter*

##### 5-4 Forth Example

Now the output is the first two rows of the dataframe.
We could get arbitrary rows of the dataframe by starting at a different number and ending at a different number.
Sometimes people call this a **slice**.

We can get a column of the dataframe by using the name of the variable for that column.
Before we go any further, let's step back for a second to talk about **lists**.

We can think of a dataframe in two ways:

- A list of rows
- A list of columns

We just saw the list of rows way.
So why are columns any different?
The difference is that our columns have variable names, and we often want to refer to columns using those names.
For example, we want to say something like "give me the Age column" instead of "give me column 2."

Let's make a list from scratch to illustrate this.

**Follow the steps in the video below**

Now execute the cell (scroll up if you need a reminder how).

##### 5-5 Fifth Example

This is a list with one thing inside it, `"Height"`.
Lists can have multiple things inside them, making lists a container for other variables.

Let's use a list to get a column from the dataframe.

**Follow the steps in the video below**

And run it.

##### 5-6 Sixth Example

We can get more than one column by adding another element to the list. 

**Follow the steps in the video below**

And run the cell (try Shift + Enter if you haven't tried it yet).

##### 5-7 seventh Example

To recap, dataframes are both lists of rows and lists of columns, and lists are themselves containers for other variables.

There are many, many things we can do with dataframes, but let's just talk about one more for now.

We can select rows based on a value in a particular column:

**Follow the steps in the video below**

Don't forget to run the cell!

##### 5-8 Final Example

The resulting column is `True` or `False` depending on whether the value of `Height` was above 161 or not (notice a few were exactly equal to 161, so they weren't greater).

What we're about to do next is magical.

**Follow the steps in the video below**

The dataframe only kept the rows for which `Height` was > 161, i.e. those for which this was `True`.

Notice this time we didn't put `Height` in a list. It won't work if we do.

------------------------

**Once you complete the above steps, the link for the completion keyword will appear below**

If you think it should appear and it hasn't, double check these things:

- Is there an answer in every code cell
- Has every cell be run (does it have a number next to it, e.g. `[3]`
- Remember you can watch the training video in the other tab

  # Data Science and the Nature of Data