# Tabular Data and Python Data Structures

### Data Types


There are lots of different types of data in the world, and Python groups that data into several categories.

**Boolean** (`bool`)**:** 
- Any data which can be expressed as either `True` or `False`.
- Used when comparing two values. For example, if you enter `10 > 9`, Python will return `True`.

**String** (`str`)**:**

- Data that involves text — either letters, numbers, or special characters. 
- Strings are enclosed in either single- or double-quotation marks: `"string-1"` or `'string-2'`.

**Numeric** (`int`, `float`, `complex`)**:**

- Data that can be expressed numerically.
- An integer, or `int`, is a whole number, positive or negative, without decimals, of unlimited length:  `123`.
- A floating-point number, or `float`, is a number, positive or negative, containing one or more decimals: `123.01`.
- A complex number, or `complex`, are imaginary numbers, designated by a `j`: `(3 + 6j)`.

**Sequence** (`list`, `tuple`, `set`)**:**

- Data that is a collection of discrete items.
- A `list` is collection that is ordered and changeable. It's designated using square brackets `[]`, and items can be of different data types: `["red", 1, 1.03, 1]`.
- A `tuple` is a collection which is ordered and unchangeable. It's designated using parentheses `()`: `("red", 1, 1.03, 1)`.
- A `set` is a collection which is unordered, unchangeable, and does not permit duplicate items. It's designated using curly brackets `{}`: `{"red", 1, 1.03}`.

**Mapping** (`dict`)**:** 
- Dictionaries store data in *key-value* pairs. They're designated using curly brackets `{}`, like a `set`, but notice that keys and values are associated with each other using a colon `:`. Each pair is separated from the next using a comma `,`.

```python
dict1 = {
    "department": "quindio", 
    "property_type": "house", 
    "price_usd": 330899.98
}
```

**Binary** (`bytes`, `bytearray`, `memoryview`)**:**
- Used to manipulate and display binary data. That is, data that can be expressed with integers represented with base 2.
- Unlike the other data types described above, `binary` types are not human-readable.

<img src="images/Screenshot_67.png" alt="Example Image" width="300">

# Tabular Data and Python Data Structures
## Working with Lists

Python comes with several data structures that we can use to organize tabular data. Let's start by putting a single observation in a **list**.


In [1]:
house_0_list = [115910.26, 128, 4]
house_0_list

[115910.26, 128, 4]

**Task 1.1.1:** One metric that people in the real estate industry look at is price per square meter because it allows them to compare houses of different sizes. Can you use the information in this list to calculate the price per square meter for `house_0`?

In [2]:
house_0_price_m2 = house_0_list[0]/house_0_list[1]
house_0_price_m2

905.54890625

**Task 1.1.2:** Next, use the append method to add the price per square meter to the end of the end of `house_0`.

In [3]:
house_0_list.append(house_0_price_m2)
house_0_list

[115910.26, 128, 4, 905.54890625]

Now that you can work with data for a single house, let's think about how to organize the whole dataset. One option would be to create a list for each observation and then put those together in another list. This is called a [**nested list**]

In [4]:
houses_nested_list = [
    [115910.26, 128.0, 4.0],
    [48718.17, 210.0, 3.0],
    [28977.56, 58.0, 2.0],
    [36932.27, 79.0, 3.0],
    [83903.51, 111.0, 3.0],
]

houses_nested_list

[[115910.26, 128.0, 4.0],
 [48718.17, 210.0, 3.0],
 [28977.56, 58.0, 2.0],
 [36932.27, 79.0, 3.0],
 [83903.51, 111.0, 3.0]]

**Task 1.1.3:** Append the price per square meter to each observation in `houses_nested_list` using a `for` loop.

In [5]:
for house in houses_nested_list:
    price_m2=house[0]/house[1]
    house.append(price_m2)
houses_nested_list

[[115910.26, 128.0, 4.0, 905.54890625],
 [48718.17, 210.0, 3.0, 231.9912857142857],
 [28977.56, 58.0, 2.0, 499.61310344827587],
 [36932.27, 79.0, 3.0, 467.4970886075949],
 [83903.51, 111.0, 3.0, 755.8874774774774]]

## Working with a dictionary

Lists are a good way to organize data, but one drawback is that we can only represent values. Why is that a problem? For example, someone looking at `[115910.26, 128.0, 4]` wouldn't know which values corresponded to price, area, etc. A better option might be a **`dictionary`** , where each value is associated with a key. Here's what `house_0` looks like as a dictionary instead of a list

In [6]:
house_0_dict = {
    "price_aprox_usd": 115910.26,
    "surface_covered_in_m2": 128,
    "rooms": 4,
}

house_0_dict

{'price_aprox_usd': 115910.26, 'surface_covered_in_m2': 128, 'rooms': 4}

**Task 1.1.4:** Calculate the price per square meter for `house_0` and add it to the dictionary under the key `"price_per_m2"`.


In [7]:
house_0_dict["price_per_m2"] = house_0_dict["price_aprox_usd"]/house_0_dict["surface_covered_in_m2"]
house_0_dict

{'price_aprox_usd': 115910.26,
 'surface_covered_in_m2': 128,
 'rooms': 4,
 'price_per_m2': 905.54890625}

If we wanted to combine all our observations together, the best way would be to create a **`list of dictionaries`**.

In [8]:
houses_rowwise = [
    {
        "price_aprox_usd": 115910.26,
        "surface_covered_in_m2": 128,
        "rooms": 4,
    },
    {
        "price_aprox_usd": 48718.17,
        "surface_covered_in_m2": 210,
        "rooms": 3,
    },
    {
        "price_aprox_usd": 28977.56,
        "surface_covered_in_m2": 58,
        "rooms": 2,
    },
    {
        "price_aprox_usd": 36932.27,
        "surface_covered_in_m2": 79,
        "rooms": 3,
    },
    {
        "price_aprox_usd": 83903.51,
        "surface_covered_in_m2": 111,
        "rooms": 3,
    },
]

houses_rowwise

[{'price_aprox_usd': 115910.26, 'surface_covered_in_m2': 128, 'rooms': 4},
 {'price_aprox_usd': 48718.17, 'surface_covered_in_m2': 210, 'rooms': 3},
 {'price_aprox_usd': 28977.56, 'surface_covered_in_m2': 58, 'rooms': 2},
 {'price_aprox_usd': 36932.27, 'surface_covered_in_m2': 79, 'rooms': 3},
 {'price_aprox_usd': 83903.51, 'surface_covered_in_m2': 111, 'rooms': 3}]

This way of storing data is so popular, it has its own name: **`JSON`**. We'll learn more about it later in the course. For now, let's build another for loop, but this time, we'll add a add the price per square meter to each dictionary.

**Task 1.1.5:** Using a `for` loop, calculate the price per square meter and store the result under a `"price_per_m2"` key for each observation in `houses_rowwise`.

In [9]:
for house in houses_rowwise:
    house["price_per_m2"] = house["price_aprox_usd"]/house["surface_covered_in_m2"]
houses_rowwise

[{'price_aprox_usd': 115910.26,
  'surface_covered_in_m2': 128,
  'rooms': 4,
  'price_per_m2': 905.54890625},
 {'price_aprox_usd': 48718.17,
  'surface_covered_in_m2': 210,
  'rooms': 3,
  'price_per_m2': 231.9912857142857},
 {'price_aprox_usd': 28977.56,
  'surface_covered_in_m2': 58,
  'rooms': 2,
  'price_per_m2': 499.61310344827587},
 {'price_aprox_usd': 36932.27,
  'surface_covered_in_m2': 79,
  'rooms': 3,
  'price_per_m2': 467.4970886075949},
 {'price_aprox_usd': 83903.51,
  'surface_covered_in_m2': 111,
  'rooms': 3,
  'price_per_m2': 755.8874774774774}]

**`JSON`** is a great way to organize data, but it does have some downsides. Note that each dictionary represents a single house or, if we think about it as tabular data, a row in our dataset. This means that it's pretty easy to do row-wise calculations (like we did with price per square meter), but column-wise calculations are more complicated. For instance, what if we wanted to know the mean house price for our dataset? First we'd need to collect the price for each house in a list and then calculate mean. 

**Task 1.1.6:** To calculate the mean price for `houses_rowwise` by completing the code below.

In [10]:
house_prices = []
for house in houses_rowwise:
    house_prices.append(house["price_aprox_usd"])
house_prices    
mean_house_price = sum(house_prices) / len(house_prices)
mean_house_price

62888.35399999999

One way to make this sort of calculation easier is to organize our data by features instead of observations. We'll still use dictionaries and lists, but we'll implement them a slightly differently.

In [11]:
houses_columnwise = {
    "price_aprox_usd": [115910.26, 48718.17, 28977.56, 36932.27, 83903.51],
    "surface_covered_in_m2": [128.0, 210.0, 58.0, 79.0, 111.0],
    "rooms": [4.0, 3.0, 2.0, 3.0, 3.0],
}

houses_columnwise

{'price_aprox_usd': [115910.26, 48718.17, 28977.56, 36932.27, 83903.51],
 'surface_covered_in_m2': [128.0, 210.0, 58.0, 79.0, 111.0],
 'rooms': [4.0, 3.0, 2.0, 3.0, 3.0]}

**Task 1.1.7:** Calculate the mean house price in `houses_columnwise`

In [12]:
mean_house_price = sum(houses_columnwise["price_aprox_usd"])/len(houses_columnwise["price_aprox_usd"])

mean_house_price

62888.35399999999

Of course, when we organize our data according to columns / features, row-wise operations become more difficult. 

**Task 1.1.8:** Create a `"price_per_m2"` column in `houses_columnwise`?

In [13]:
price=houses_columnwise["price_aprox_usd"]
area=houses_columnwise["surface_covered_in_m2"]
price_per_m2=[]
for p,a in zip(price,area):
    price_m2=p/a
    price_per_m2.append(price_m2)
    houses_columnwise["price_per_m2"]=price_per_m2
houses_columnwise

{'price_aprox_usd': [115910.26, 48718.17, 28977.56, 36932.27, 83903.51],
 'surface_covered_in_m2': [128.0, 210.0, 58.0, 79.0, 111.0],
 'rooms': [4.0, 3.0, 2.0, 3.0, 3.0],
 'price_per_m2': [905.54890625,
  231.9912857142857,
  499.61310344827587,
  467.4970886075949,
  755.8874774774774]}

# Tabular Data and pandas DataFrames

While you've shown that you can wrangle data using lists and dictionaries, it's not as intuitive as working with, say, a spreadsheet. Fortunately, there are lots of libraries for Python that make it an even better tool for tabular data — way better than spreadsheet applications like Microsoft Excel or Google Sheets! One of the best known data science libraries is **pandas**, which allows you to organize data into **DataFrames**.

Let's import pandas and then create a DataFrame from `houses_columnwise`. 

In [2]:
import pandas as pd

data = {
    "price_aprox_usd": [115910.26, 48718.17, 28977.56, 36932.27, 83903.51],
    "surface_covered_in_m2": [128.0, 210.0, 58.0, 79.0, 111.0],
    "rooms": [4.0, 3.0, 2.0, 3.0, 3.0],
}

df_houses = pd.DataFrame(data)

df_houses

Unnamed: 0,price_aprox_usd,surface_covered_in_m2,rooms
0,115910.26,128.0,4.0
1,48718.17,210.0,3.0
2,28977.56,58.0,2.0
3,36932.27,79.0,3.0
4,83903.51,111.0,3.0
