## An introduction to `pandas`

When working with data in `pandas`, it's important to realize that the DataFrame data type is optimized for "vectorized" operations, meaning, it's intended to be used to operate on either:
* Full columns of data
* A subset of columns
* Full rows of data
* A subset of rows of data

It's _not_ intended to be used to loop over lines in the DataFrame (although, there are exceptional times when this might be the best approach). 

So, when working with DataFrames, always be thinkging about "column operations" or "row operations" instead of loops.

To start working with `pandas`, some basics can be useful:

* Converting structured data into DataFrames (whether from files or from Python data in memory)
* Selecting data
* Applying operations
* Grouping data

### Making DataFrames

pandas comes with built-in functions and methods to make this easy.

#### Reading from files
* `pd.read_csv()`
* `pd.read_excel()`
* `pd.read_json()`

#### Reading from Python data in memory
* `pd.DataFrame.from_records()`: for when you have structured data by _rows_
* `pd.DataFrame.from_dict()`: for when you have structured data in a dict by _columns_ (e.g. as performed in Workbook 05)

Examples:

```python
df = pd.read_csv("zurich_dogs.csv") # Assumes first row is header row but you can change that
```

```python
df = pd.read_excel(
    'my_excel_file.xlsx',
    sheet_name="Sheet1",
    header=0 # The first row is a "header row",
    usecols="B:G",
)
```

```python
row_data = [
    [12, 94, 399],
    [93, 90, 499],
    [21, 94, 200],
]
 
df = pd.DataFrame.from_records(row_data,  columns = ["Col1", "Col2", "Col3"])
```

```python
col_data = {
    "Col1": [12, 94, 399],
    "Col2": [93, 90, 499],
    "Col3": [21, 94, 200],
}
df = pd.DataFrame.from_dict(col_data)
```

### Selecting data

There two primary methods of selecting data in a DataFrame:

1. `.loc` - To _locate_ rows or columns based on their _names_
2. `.iloc` - To _locate_ rows or columns based on their _integer positions_

Note: if a row or column does not have a name (i.e. their name is an integer), then you can use that integer in `.loc` also. This is most common in rows that do not have an _index_.

Basic examples:

```python
df = pd.read_csv("zurich_dogs.csv")

df.loc[0:5] # Select rows 0 through 5, inclusive (indexes are still integers)
df.iloc[0:5] # Select rows 0 through 4, stopping before 5
df.loc[:, ["Age, Guardian", "Gender, Guardian"]] # Select all rows and these two column names
df.iloc[:, 1:3] # Select all rows and column positions 1 and 2 (stopping before 3)
```

You can also select individual columns quickly by using the `dict[key]` syntax:

```python
df["Age, Guardian"]
df["Primary Breed"]
```

As mentioned previously, we also have this concept of selecting data by "boolean masks".

A boolean mask is a list of `True`/`False` values generated from the DataFrame where the data on each row is represented as either `True` or `False` according to a certain condition.

Example continuing to use the zurich_dogs.csv file:

```python
gender_mask = df["Gender, Guardian"] == "m" # Where values in this column equal "m"
age_mask = df["Age, Guardian"] == "51-60" # Where values in this column equal "w"
```

The masks created above are just `True`/`False` values that correspond to each row that matches the criteria.

Masks can then be used with `.loc` to select the rows that match that data:

```python
df.loc[gender_mask]
df.loc[age_mask]
```

You can also combine them using the bit-wise operators that we saw in Lesson 05, where we learned to combine sets.

```python
df.loc[gender_mask & age_mask] # Only rows that are True for both gender_mask AND age_mask
df.loc[gender_mask | age_mask] # Only rows that are True for gender_mask OR age_mask (or both)
df.loc[gender_mask ^ age_mask] # Only rows that are NOT true for gender_mask and NOT True for  age_mask (symmetric difference)
df.loc[~gender_mask] # Only rows that are NOT true for gender_mask
```

### Applying functions

You can apply functions to selections of a DataFrame. These functions are commonly available as methods but you can also pass your own custom functions.

Using our column example data:
```python
col_data = {
    "Col1": [12, 94, 399],
    "Col2": [93, 90, 499],
    "Col3": [21, 94, 200],
}
df = pd.DataFrame.from_dict(col_data)

df["Col4"] = df["Col1"] + df["Col2"] # Create a new column "Col4" through arithmetic operations
df["Col5"] = df["Col



In [139]:
df.head()

Unnamed: 0,HALTER_ID,"Age, Guardian","Gender, Guardian",City District,Primary Breed,"Year of Birth, Dog","Sex, Dog",Dog Colour
0,126,51-60,m,9.0,Welsh Terrier,2011,w,Black/Brown
1,171,61-70,m,3.0,Berner Sennenhund,2009,m,tricolor
2,574,61-70,w,2.0,Cairn Terrier,2002,w,brindle
3,695,41-50,m,6.0,Labrador Retriever,2012,w,Brown
4,893,61-70,w,7.0,Mittelschnauzer,2010,w,Black
