![Banner](./img/AI_Special_Program_Banner.jpg)  

# Aggregation of data and pivot tables - Exercise - Solution
---

## Task 1
Work through the tutorial in the file [1.3.b_Pandas-Pivot-Table-Explained-NYCHouses.ipynb](1.3.b_Pandas-Pivot-Table-Explained-NYCHouses.ipynb) and then answer the following questions:

### 1.1 Defining an order
Suppose you had a column in a data frame in which the days of the week are entered with the abbreviations `MO, TU, WE, ..., SA, SO` instead of the numbers 1-7. What could you do to ensure that these values are output in the desired order when using this column in pivot tables?

**Answer:** You could solve this as follows (the column name is `Weekday`):
```python
df["Weekday"] = df["Weekday"].astype("category")
df["Weekday"].cat.set_categories(["MO", "TU", "WE", "TH", "FR", "SA", "SO"],inplace=True)
```

### 1.2 Columns and aggregate functions
What do you think of the pivot tables in cells 6-8? In your answer, please explain in particular the use of the 'Zip Code' column and the aggregate function used.

**Answer:** In the pivot tables, the standard aggregate function `mean` is used for all columns that have a numeric data type. Presumably `sum` would be more suitable as an aggregate function here. Above all, however, aggregating the column `Zip Code` makes no sense at all, as this is certainly not a measure that can be used for calculations.

### 1.3 `margins`
What is the purpose of the `margins` and `margins_name` parameters of the `pivot_table()` function?

**Answer:** This parameter is used to add a totals row or column. The name is given by `margins_name`.

### 1.4 `fill_value`
What is the purpose of the `fill_value` parameter of the `pivot_table` function (e.g. in cell 17)?

**Answer:** This parameter can be used to replace *missing values* (recognizable by the `NaN`) with something more meaningful, such as the value 0.

### 1.5 Aggregation by date
How would you proceed if you wanted to obtain an aggregation by (sales) year in cell 17 instead of the aggregation by tax class?

**Answer:** Unfortunately, `pivot_table()` does not automatically drill up the datetime values in `Sale Date`. A `year` column must therefore be generated from the `Sale Date` column. For example, using
```python
df['Year']=df['Sale Date'].dt.year
```

### 1.6 Aggregation of residential units
Does it make sense to use the aggregate function `len` for the attribute `Residential Units` (e.g. in cell 22)? What could possibly go wrong here?

**Answer:** The use of `len` would correspond to the use of `COUNT` in SQL and is therefore *not* sensible, because you have to add up, as there are numbers greater than 1. For clarification, repeat cell 22 with `np.sum` instead of `len`.

### 1.7 Multi-dimensional operations
The data used in the tutorial could be based on the following (not very high quality) *star schema*.

![ER model star schema](img/PropertySalesStar.png)

Looking at the cells

1. 9 $\rightarrow$ *Drill-down* in the dimension `DimPlace`
2. 16 $\rightarrow$ Combination *Drill-down* (Borough) and *Split* (TaxClass)
3. 20 $\rightarrow$ *Split* of the values for the boroughs by TaxClass
4. 24 $\rightarrow$ *Slicing*, as the value is set for Borough

respectively in the [notebook providing the material](1.3.b_Agg_Pivot_1_Mat.ipynb), which multi-dimensional operations were performed there?