# Lec - 1

## Organising Data

Information can come in many forms, and part of a data scientist's job is making sure that information is organized in a way that's conducive to analysis. 

One common way to organize information is in a **table**, which is a group of **cells** organized into **rows** and **columns**:

When working with this sort of **tabular data**, it's important to organize row and columns following the principles of "**[tidy data](https://en.wikipedia.org/wiki/Tidy_data)**." What does that mean in the case of our dataset?

1. Each row corresponds to a single house in our dataset. We'll call each of these houses an **observation**.
2. Each column corresponds to a characteristic of each house. We'll call these **features**.
3. Each cell contains only one **value**. 

### Python Lists
---

* Ordered collection of items that can be of any data type, including strings, integers, floats, and other lists.
* Defined by square brackets `[]`.
* Indexing starts at 0.
* Can be sliced, concatenated, and modified.

In [1]:
# Example

my_list = [1, 2, 3, 4, 5]
print(my_list[0])  # prints 1
my_list.append(6)
print(my_list)  # prints [1, 2, 3, 4, 5, 6]

1
[1, 2, 3, 4, 5, 6]


### Python Dictionaries
---

* Unordered collection of key-value pairs.
* Defined by curly brackets {}.
* Keys must be unique and immutable (e.g., strings, integers).
* Values can be of any data type.
* Can be accessed, modified, and iterated over.

In [2]:
my_dict = {'name': 'John', 'age': 30}
print(my_dict['name'])  # prints John
my_dict['city'] = 'New York'
print(my_dict)  # prints {'name': 'John', 'age': 30, 'city': 'New York'}

John
{'name': 'John', 'age': 30, 'city': 'New York'}


# Pandas DataFrame
---
* A 2-dimensional labeled data structure with columns of potentially different types.
* Can be thought of as a spreadsheet or a table in a relational database.
* Can be created from a dictionary, where each key becomes a column and each value becomes a row.

In [4]:
import pandas as pd

In [6]:
# Example
data = {
    'price_approx_usd': [115910.26, 48718.17, 28977.56, 36932.27, 83903.51],
    'surface_covered_in_m2': [128.0, 210.0, 58.0, 79.0, 111.0],
    'rooms': [4.0, 3.0, 2.0, 3.0, 3.0],
    'price_per_m2': [905.55, 231.99, 499.61, 467.50, 755.89]
}

df = pd.DataFrame(data)
df

Unnamed: 0,price_approx_usd,surface_covered_in_m2,rooms,price_per_m2
0,115910.26,128.0,4.0,905.55
1,48718.17,210.0,3.0,231.99
2,28977.56,58.0,2.0,499.61
3,36932.27,79.0,3.0,467.5
4,83903.51,111.0,3.0,755.89
