# <b>1 <span style='color:#F76241'>|</span> Data? I hardly know-a!</b>

<font size="9">A</font>s humans, our insatiable desire to `learn`, make `inferences` about things, and `predict future events` requires **knowledge** about the world. This knowledge is where data comes in.

**Data** is the information that enables you to do all of those things mentioned above. For example:

> Say I want to learn what about the **gender wage gap** in the US. I could collect the following data from **entry level** employees working at company X (this data was made up for educational purposes):

| Gender | Age | Salary (USD per hr) | Position type | Time spent at company X (months) | Education level |
|--------|-----|---------------------|----------------|----------------------------------|-----------------|
| M      | 26  | 49                  | Entry          | 8                                | Master's        |
| M      | 32  | 55                  | Entry          | 15                               | Bachelor's      |
| M      | 34  | 60                  | Entry          | 17                               | PhD             |
| F      | 29  | 44                  | Entry          | 9                                | PhD             |
| F      | 23  | 40                  | Entry          | 3                                | Bachelor's      |
| F      | 48  | 54                  | Entry          | 30                               | Master's        |

I could then plot this data to `look for trends` and also try to `predict` someone's salary (assuming I have a lot more than just six entries like above).

Collections of data like this are called `datasets`. Today, you'd be hard pressed to find a single company or business that does _not_ utilize datasets in the decision making process. The role of data is becoming increasingly important as technology improves.

Before selecting our datasets, first we need to go over a fundamental property of data: `features`!

 
<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;text-align: center;"><b> 1.1 Features? Attributes? Variables? Or...columns?</b></p>
</div>

When looking at datasets, the pieces of information recorded about the subject go by different names depending on who you ask:

- In `computer science`, they are called **features** or **attributes**. 
- In `statistics`, they are called **variables**
- In `linear algebra`, they are the **columns** of an array/matrix

This is the information that lets us teach machines to automatically do what we want and from this point on, I use the term **features**, as this is used in ML.

An example of features:

> Say I want to create a machine learning algorithm that `predicts company A's stock price` for the end of the week. To do this, I can use the **first six days' stock prices** as features to predict the last day (7).

<img src="assets/images/stock.png"  width="500" height="100">

> Looking at the lineplot above, the stock price is generally increasing. Thus, my ML system will likely predict a number 64 or greater. 

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px;">
    <p style="padding: 8px;color:white;text-align: center;"><b>1.2 Types of features</b></p>
</div>

Not all features are equal. Features exhibit certain properties and are divided into two groups: `quantitative` and `qualitative` features. 

Below is a sample from the `Car sales` dataset. I will reference this data when describing these two types of features.


|     | Make   | Model          |   Sales (thousands) | Vehicle_type   |   Price (thousands) |   Engine_size |   Horsepower |   Width |   Length |
|----:|:---------------|:---------------|---------------------:|:---------------|---------------------:|--------------:|-------------:|--------:|---------:|
|   0 | Acura          | Integra        |               16.919 | Passenger      |               21.5   |           1.8 |          140 |    67.3 |    172.4 |
|   1 | Acura          | TL             |               39.384 | Passenger      |               28.4   |           3.2 |          225 |    70.3 |    192.9 |
|   2 | Acura          | CL             |               14.114 | Passenger      |              nan     |           3.2 |          225 |    70.6 |    192   |
|   3 | Acura          | RL             |                8.588 | Passenger      |               42     |           3.5 |          210 |    71.4 |    196.6 |
|   4 | Audi           | A4             |               20.397 | Passenger      |               23.99  |           1.8 |          150 |    68.2 |    178   |
|   5 | Audi           | A6             |               18.78  | Passenger      |               33.95  |           2.8 |          200 |    76.1 |    192   |
|   6 | Audi           | A8             |                1.38  | Passenger      |               62     |           4.2 |          310 |    74   |    198.2 |
|   7 | BMW            | 323i           |               19.747 | Passenger      |               26.99  |           2.5 |          170 |    68.4 |    176   |
|   8 | BMW            | 328i           |                9.231 | Passenger      |               33.4   |           2.8 |          193 |    68.5 |    176   |
|   9 | BMW            | 528i           |               17.527 | Passenger      |               38.9   |           2.8 |          193 |    70.9 |    188   |
|  10 | Buick          | Century        |               91.561 | Passenger      |               21.975 |           3.1 |          175 |    72.7 |    194.6 |


<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px;">
    <p style="padding: 8px;color:white;text-align: center;">1.2.1<em> Quantitative</em></p>
</div>

*`Quantitative features`* are ones recorded as numbers. This can be in the form of a ranking, a measurement, a count of occurences of things, and more. These types of features are also referred to as **numerical**.

Looking at the `car sales` data in the table above, most of the features are numerical:
- Sales (thousands)
- Price (thousands)
- Engine_size
- Horsepower
- Width
- Length

Using these `quantitative` features, I can look for certain relationships between the numbers:

> Perhaps I can use **Horsepower** to `predict` the **Sales** column? Or maybe given the **Width** and **Length** columns, `predict` the **Price**?



<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px;">
    <p style="padding: 8px;color:white;text-align: center;">1.2.2<em> Qualitative</em></p>
</div>

*`Qualitative`* features are *not* recorded as numbers. These are typically labels, names, or other non-numerical descriptions. This type of data is also called **categorical**, since these non-numerical labels are often categories of things.

Looking at the `car sales` data in the table above, a few of the features are categorical:
- Make
- Model
- Vehicle_type

Using these `qualitative` features, I can make certain observations about the classes of data points:

> Perhaps I can see which **Make** sells the best? Or which **Model** is the most expensive? I can also see which **Vehicle type** is the most popular

Because there's so few qualitative examples in the `Car sales` dataset, here's another example of qualitative data recorded from college students:

| Name    | Year      | Gender | Major                  | Commuter | Likes school |
|---------|-----------|--------|------------------------|----------|--------------|
| Hans    | Freshman  | M      | Computer science       | Yes      | No           |
| Jack    | Senior    | M      | Biochemistry           | No       | Yes          |
| Hailey  | Sophomore | F      | Electrical engineering | Yes      | Yes          |
| Edward  | Senior    | M      | Anthropology           | Yes      | No           |
| Cynthia | Freshman  | F      | Health Science         | No       | No           |
| Sarah   | Senior    | F      | Journalism             | Yes      | Yes          |

<div class="alert alert-block alert-info"><b>Note:</b> Quantitative and qualitative features can further be broken down into sub-categories. Check out this neat <a href="https://statsandr.com/blog/variable-types-and-examples/" target="_blank">webpage</a> for more information. </div>



At this point, you should understand what data is and how it can be used to inform decisions. Now, we can start exploring datasets on the web!

# <b>2 <span style='color:#F76241'>|</span> Data aggregation</b>

Finally, we can move onto a more exciting part: **`dataset selection`**

In this age of data, there are **_a lot_** of datasets on the web. And this raises an important question: **how do you choose the right one?**

<img src="assets/images/datameme.jpg"  width="300" height="250">

<font size="1"> (image <a href="https://www.memecreator.org/meme/you-get-a-dataset-everyone-gets-a-dataset-and-you-get-a-dataset-and-you-get-a-da/">credits</a>) </font>

The answer to this question: it depends on what your `goal` is. 

The `goal` of _this_ project is to showcase the python data science/machine learning tool kit, so I will choose a dataset that clearly showcase how you can mold `quantitative` and `qualitative` data to answer questions and solve problems.

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;text-align: center;"><b> Where to find datasets 🔎</b></p>
</div>


As I mentioned above, there are many datasets in the wild. Here are some websites you can use to find your own:

- [OpenML.org](https://www.openml.org/search?type=data&sort=runs&status=active)
- [Kaggle.com](https://www.kaggle.com/datasets)
- [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)
- [Datasets subreddit](https://www.reddit.com/r/datasets/)
- [NYC OpenData](https://opendata.cityofnewyork.us/)
- [Awesome Public Datasets repository](https://github.com/awesomedata/awesome-public-datasets)

In this project, I will be using the [California housing dataset](https://www.kaggle.com/datasets/camnugent/california-housing-prices), a dataset containing house prices for California districts from the 1990 census to provide code samples and showcase APIs. 

More importantly, at the end of each notebook, `you` will have a chance to practice what you've learned by analyzing the [Iris](https://archive.ics.uci.edu/ml/datasets/iris) dataset. This dataset contains five features, four of which are quantitative (**sepal_length, sepal_width, petal_length, petal_width**) and can be used to predict the qualitative feature (**species**). There will be questions focused on reinforcing the knowledge you attained from the code samples, as well as answers in hidden cells to check your work. 

Remember, true mastery and understanding of things comes from personal experience, experimentation, and time. You're not expected to become a machine learning expert overnight by reading these notebooks. They are meant to introduce to you the tools and mindsets you need to apply your learned skills elsewhere. Only by repeatedly using the knowledge you learn here will it become second nature, I promise you.

Finally, the most important part of learning data science and machine learning: **`have fun with it!`**.

# **Exercises**

The following questions are based on the text above, as well as your own intuition 

(Note: **double click** on the text that says "Your answer here:", and when you're done answering, run the cell via shift+enter)


### 1.  
- Your answer here: 

### 2.  
- Your answer here: 

### 3.  
- Your answer here: 

## Answers

<details>
    <summary><b>Question 1</b></summary>
    
</details>

<details>
    <summary><b>Question 2</b></summary>
    
</details>

<details>
    <summary><b>Question 3</b></summary>
    
</details>