## Concatenation
By the end of this lecture you will be able to:
- vertically concatenate `DataFrames`
- handle inconsistent dtypes in a vertical concat
- horizontally concatenate `DataFrames`
- diagonally concatenate `DataFrames`

## 数据连接

本讲结束时，您将能够：

- 垂直连接 `DataFrame`

- 处理垂直连接中不一致的数据类型

- 水平连接 `DataFrame`

- 对角连接 `DataFrame`


In [19]:
import polars as pl

We create a first `DataFrame` with fake trade records from 2023

我们创建了第一个包含 2023 年虚构交易记录的 `DataFrame`。

In [20]:
df1 = pl.DataFrame(
    [
        {"year": 2023, "exporter": "India", "importer": "Russia", "quantity": 0},
        {"year": 2023, "exporter": "India", "importer": "Russia", "quantity": 1},
    ]
)
df1

year,exporter,importer,quantity
i64,str,str,i64
2023,"""India""","""Russia""",0
2023,"""India""","""Russia""",1


We now create a second `DataFrame` with trade records from 2024

现在我们创建第二个 `DataFrame`，其中包含 2024 年以来的交易记录。

In [3]:
df2 = pl.DataFrame(
    [
        {"year": 2024, "exporter": "India", "importer": "Russia", "quantity": 2},
        {"year": 2024, "exporter": "India", "importer": "Russia", "quantity": 2},
    ]
)
df2

year,exporter,importer,quantity
i64,str,str,i64
2024,"""India""","""Russia""",2
2024,"""India""","""Russia""",2


## Combining `DataFrames` vertically
If we have data in two different `DataFrames` that we combine as a new `DataFrame` we can manage the data in memory in three different ways:
- keeping the data in the original two locations in memory and referencing the new `DataFrame` to these original locations
- copying the data to a single location in memory and and referencing the new `DataFrame` to this single location
- appending the data from the second `DataFrame` to the location of the first `DataFrame`

We cover three methods for vertically combining `DataFrames`: `df.vstack`, `df.extend` and `pl.concat`. The output of each method is the same from a user perspective but differs in terms of where the data sits in memory underneath the hood

## 垂直合并 `DataFrame`

如果我们有两个不同的 `DataFrame`，需要将它们合并成一个新的 `DataFrame`，我们可以通过三种不同的方式在内存中管理数据：

- 将数据保留在内存中的原始位置，并将新的 `DataFrame` 指向这些原始位置

- 将数据复制到内存中的单个位置，并将新的 `DataFrame` 指向该单个位置

- 将第二个 `DataFrame` 中的数据追加到第一个 `DataFrame` 的位置

我们介绍三种垂直合并 `DataFrame` 的方法：`df.vstack`、`df.extend` 和 `pl.concat`。从用户角度来看，每种方法的输出相同，但底层数据在内存中的位置有所不同。

### `vstack方法`

We combine the df1 and df2 `DataFrames` into a single `DataFrame` with the `vstack` method

我们使用 `vstack` 方法将 df1 和 df2 `DataFrame` 合并成一个 `DataFrame`。

In [21]:
(df1.vstack(df2))

year,exporter,importer,quantity
i64,str,str,i64
2023,"""India""","""Russia""",0
2023,"""India""","""Russia""",1
2024,"""India""","""Russia""",2
2024,"""India""","""Russia""",2


A `vstack`:
- keeps the data from both `DataFrames` in their original locations in memory and points the new `DataFrame` to those locations

### Rechunk
A `vstack` is computationally very cheap (as no data is copied). However, subsequent operations (e.g. `group_by`) are slower than if the data has been *rechunked* (i.e. copied from the original two chunks to a new single location in memory.


We can manually cause two `DataFrames` linked by `vstack` to be copied to a single location in memory with `rechunk`

`vstack`：

- 将两个 `DataFrame` 中的数据保留在内存中的原始位置，并将新的 `DataFrame` 指向这些位置
- (意思是: 只是修改了指针, 并没有将这两个数据框真正的合并.)。
### 重新分块

`vstack` 的计算成本非常低（因为无需复制数据）。但是，后续操作（例如 `group_by`）的速度会比数据*重新分块*（即将原始两个数据块复制到内存中的新单个位置）后的速度慢。

我们可以使用 `rechunk` 手动将通过 `vstack` 链接的两个 `DataFrame` 复制到内存中的单个位置。

In [22]:
(df1.vstack(df2).rechunk()) # 这里才是真正地进行了整合.

year,exporter,importer,quantity
i64,str,str,i64
2023,"""India""","""Russia""",0
2023,"""India""","""Russia""",1
2024,"""India""","""Russia""",2
2024,"""India""","""Russia""",2


We see below that the `pl.concat` function is a way of applying `vstack` and `rechunk` to a list of `DataFrames`.

### Extend
We can append one `DataFrame` to another with `extend`

下面我们可以看到，`pl.concat` 函数可以将 `vstack` 和 `rechunk` 应用于 `DataFrame` 列表。

### 扩展

我们可以使用 `extend` 将一个 `DataFrame` 追加到另一个 `DataFrame` 中。

In [23]:
(df1.extend(df2))

year,exporter,importer,quantity
i64,str,str,i64
2023,"""India""","""Russia""",0
2023,"""India""","""Russia""",1
2024,"""India""","""Russia""",2
2024,"""India""","""Russia""",2


An `extend`:
- copies the data from second `DataFrame` (`df2`) and appends it to the data of the first `DataFrame` (`df1`)
- modifies the first `DataFrame` (`df1`) 

We can see that `df1` has been modified in-place as it now has both years of data

`extend` 方法的作用是：

- 将第二个 `DataFrame` (`df2`) 中的数据复制到第一个 `DataFrame` (`df1`) 中

- 修改第一个 `DataFrame` (`df1`)

我们可以看到，`df1` 已被原地修改，因为它现在包含了两年的数据。

In [24]:
df1

year,exporter,importer,quantity
i64,str,str,i64
2023,"""India""","""Russia""",0
2023,"""India""","""Russia""",1
2024,"""India""","""Russia""",2
2024,"""India""","""Russia""",2


Before continuing we re-assign `df1` back to its original value to reduce confusion if cells are executed out-of-order!

In [25]:
df1 = pl.DataFrame(
    [
        {"year": 2023, "exporter": "India", "importer": "Russia", "quantity": 0},
        {"year": 2023, "exporter": "India", "importer": "Russia", "quantity": 1},
    ]
)
df1

year,exporter,importer,quantity
i64,str,str,i64
2023,"""India""","""Russia""",0
2023,"""India""","""Russia""",1


### Use case of `vstack`, `rechunk` and `extend`
- If you are combining `DataFrames` to do more transformations/groupbys/joins etc it is normally best to use `vstack` and `rechunk` so that all the data is together in memory. In practice it is simpler to use `pl.concat` to do this as we see below
- If you want to combine two `DataFrames` but do not want to do more operations on them (e.g. you just want to check their length of perhaps write to a file) you should use `vstack`
- If you want to add a small `DataFrame` to a large `DataFrame` use `extend` as it only copies the data in the small `DataFrame`

### `vstack`、`rechunk` 和 `extend` 的使用场景

- 如果您要合并多个 `DataFrame` 来进行更多转换/分组/连接等操作，通常最好使用 `vstack` 和 `rechunk`，以便将所有数据集中存储在内存中。实际上，使用 `pl.concat` 更简单，如下所示。

- 如果您想要合并两个 `DataFrame`，但不想对它们进行更多操作（例如，您只想检查它们的长度或写入文件），则应使用 `vstack`。

- 如果您想将一个小的 `DataFrame` 添加到一个大的 `DataFrame` 中，请使用 `extend`，因为它只会复制小的 `DataFrame` 中的数据。

### Vertically concatenating `DataFrames`

Above we saw how to vertically combine two `DataFrames`. More generally, we can combine a `list` of `DataFrames` with `pl.concat`. For clarity, we set the `how="vertical"` argument explicitly this time although it is the default argument

### 垂直连接 `DataFrame`

上面我们已经了解了如何垂直连接两个 `DataFrame`。更一般地，我们可以使用 `pl.concat` 连接一个 `DataFrame` 列表。为了清晰起见，这次我们显式地设置了 `how="vertical"` 参数，尽管它是默认参数。

In [26]:
(pl.concat([df1, df2], how="vertical")) # 默认使用垂直连接, 我们这里显式执行.

year,exporter,importer,quantity
i64,str,str,i64
2023,"""India""","""Russia""",0
2023,"""India""","""Russia""",1
2024,"""India""","""Russia""",2
2024,"""India""","""Russia""",2


When we do `pl.concat` Polars:
- does a series of `vstacks` to combine the list of `DataFrames`
- then does a `rechunk` to gather all the data together in memory

We can stop Polars from doing the `rechunk` by passing the `rechunk=False` argument

当我们使用 `pl.concat` 时，Polars 会执行以下操作：

- 执行一系列 `vstacks` 操作来合并 `DataFrames` 列表

- 然后执行 `rechunk` 操作将所有数据重新集中到内存中

我们可以通过传递 `rechunk=False` 参数来阻止 Polars 执行 `rechunk` 操作。

In [27]:
df_vertical = pl.concat([df1, df2], rechunk=False)
df_vertical

year,exporter,importer,quantity
i64,str,str,i64
2023,"""India""","""Russia""",0
2023,"""India""","""Russia""",1
2024,"""India""","""Russia""",2
2024,"""India""","""Russia""",2


### Handling different dtypes in vertical concatenation

### 处理垂直连接中的不同数据类型

Polars expects the column names and dtypes to match when doing vertical concatenation.

To illustrate some approaches for handling differences in types we create an alternative `df2` where the `quantity` column is 64-bit float instead of 64-bit integer

Polars 要求在进行垂直连接时，列名和数据类型必须匹配。

为了演示处理数据类型差异的一些方法，我们创建了一个替代的 `df2` 表，其中 `quantity` 列是 64 位浮点数，而不是 64 位整数。

In [28]:
df2_float = df2.with_columns(pl.col("quantity").cast(pl.Float64))
df2_float

year,exporter,importer,quantity
i64,str,str,f64
2024,"""India""","""Russia""",2.0
2024,"""India""","""Russia""",2.0


When the dtypes do not match we may have to manage this by doing an explicit `cast` of the column types.

In this example we cast the `quantity` column back to `pl.Int64`

当数据类型不匹配时，我们可能需要通过显式地对列类型进行强制转换来解决这个问题。

在本例中，我们将 `quantity` 列强制转换回 `pl.Int64` 类型。

In [29]:
(pl.concat([df1, df2_float.with_columns(pl.col("quantity").cast(pl.Int64))]))

year,exporter,importer,quantity
i64,str,str,i64
2023,"""India""","""Russia""",0
2023,"""India""","""Russia""",1
2024,"""India""","""Russia""",2
2024,"""India""","""Russia""",2


However, Polars also has a way of managing certain differences by casting to a "supertype". For example, the supertype of `pl.Float64` and `pl.Int64` is `pl.Float64`.

We can do a vertical concatenation using supertypes where necessary by specifying the `how` method as `vertical_relaxed` instead of `vertical`

然而，Polars 也提供了一种通过强制转换为“超类型”来管理某些差异的方法。例如，`pl.Float64` 和 `pl.Int64` 的超类型都是 `pl.Float64`。

我们可以通过将 `how` 方法指定为 `vertical_relaxed` 而不是 `vertical`，在必要时使用超类型进行垂直连接。

In [30]:
(pl.concat([df1, df2_float], how="vertical_relaxed"))

year,exporter,importer,quantity
i64,str,str,f64
2023,"""India""","""Russia""",0.0
2023,"""India""","""Russia""",1.0
2024,"""India""","""Russia""",2.0
2024,"""India""","""Russia""",2.0



## Horizontal concatenation
We can horizontally concatenate `DataFrames` that have:
- the same number of rows and
- different column names

For horizontal concatenation we create another `DataFrame` that has more details about each of the trades in 2023

## 水平连接

我们可以水平连接满足以下条件的 `DataFrame`：

- 行数相同；

- 列名不同。

对于水平连接，我们创建一个新的 `DataFrame`，其中包含 2023 年每笔交易的更多详细信息。

In [31]:
df1_details = pl.DataFrame(
    [{"item": "Clothes", "value": 10}, {"item": "Machinery", "value": 100}]
)
df1_details

item,value
str,i64
"""Clothes""",10
"""Machinery""",100


### `hstack`

We can combine two `DataFrames` horizontally with `hstack`

### `hstack`

我们可以使用 `hstack` 将两个 `DataFrame` 水平合并。

In [32]:
(df1.hstack(df1_details))

year,exporter,importer,quantity,item,value
i64,str,str,i64,str,i64
2023,"""India""","""Russia""",0,"""Clothes""",10
2023,"""India""","""Russia""",1,"""Machinery""",100


This operation is *not* in-place unless we pass `in-place=True`.

We can also pass a `list` of `Series` inside `hstack`

除非我们传递 `in-place=True`，否则此操作*不会*进行原地操作。

我们还可以将 `Series` 的列表传递给 `hstack`。

In [33]:
(df1.hstack([df1_details["item"], df1_details["value"]]))

year,exporter,importer,quantity,item,value
i64,str,str,i64,str,i64
2023,"""India""","""Russia""",0,"""Clothes""",10
2023,"""India""","""Russia""",1,"""Machinery""",100



### Horizontal concatenation
We can also use `pl.concat` for horizontal concatenation

### 水平连接

我们还可以使用 `pl.concat` 进行水平连接。

In [34]:
(pl.concat([df1, df1_details], how="horizontal"))

year,exporter,importer,quantity,item,value
i64,str,str,i64,str,i64
2023,"""India""","""Russia""",0,"""Clothes""",10
2023,"""India""","""Russia""",1,"""Machinery""",100


If we have common columns and some overlap in the values in those common columns we can use an alternative horizontal concatenation method called `align` where Polars identifies the common columns and aligns the rows appropriately.

In this modified example we have `item` as a column in both `DataFrames` but in the second `DataFrame` we only have one `item`.

如果存在公共列，且这些公共列中的值存在重叠，我们可以使用一种名为 `align` 的替代水平连接方法。Polars 会识别公共列并相应地对齐行。

在这个修改后的示例中，两个 `DataFrame` 中都有 `item` 列，但第二个 `DataFrame` 中只有一个 `item` 列。

In [35]:
(
    pl.concat(
        [
            pl.DataFrame(
                [
                    {"year": 2023, "exporter": "India", "item": "Clothes"},
                    {"year": 2023, "exporter": "India", "item": "Machinery"},
                ]
            ),
            pl.DataFrame([{"item": "Machinery", "value": 100}]),
        ],
        how="align",
    )
)

year,exporter,item,value
i64,str,str,i64
2023,"""India""","""Clothes""",
2023,"""India""","""Machinery""",100.0


When we do an `align` concatenation Polars sees that we can horizontally concatenate the second `DataFrame` but that only the second row has a value t be concatenated.

当我们进行 `align` 连接时，Polars 发现我们可以水平连接第二个 `DataFrame`，但只有第二行有要连接的值。

## Diagonal concatenation

We are now looking at new trade records for 2023 and 2024 between India and the USA.

In 2023 the schema of the trade records is the same as we saw above with: 
- `year`
- `exporter` and 
- `importer`

However, in 2024 the schema also includes:
- `item` and 
- `value`

## 对角线连接

我们现在查看印度和美国之间 2023 年和 2024 年的新贸易记录。

2023 年的贸易记录格式与之前相同，包含：

- `year`

- `exporter` 和

- `importer`

然而，2024 年的格式还包含：

- `item` 和

- `value`

In [36]:
df1 = pl.DataFrame(
    [
        {"year": 2023, "exporter": "India", "importer": "USA", "quantity": 0},
        {"year": 2023, "exporter": "India", "importer": "USA", "quantity": 1},
    ]
)

df2 = pl.DataFrame(
    [
        {
            "year": 2024,
            "exporter": "India",
            "importer": "USA",
            "quantity": 2,
            "item": "Clothes",
            "value": 10,
        },
        {
            "year": 2024,
            "exporter": "India",
            "importer": "USA",
            "quantity": 3,
            "item": "Machinery",
            "value": 100,
        },
    ]
)

We want to combine these records into a single `DataFrame`. As the column names are not the same we cannot do a vertical concatenation.

Instead we can do a diagonal concatenation
    
我们希望将这些记录合并到一个 `DataFrame` 中。由于列名不同，我们无法进行垂直连接。

因此，我们可以进行对角连接。

In [37]:
(pl.concat([df1, df2], how="diagonal"))

year,exporter,importer,quantity,item,value
i64,str,str,i64,str,i64
2023,"""India""","""USA""",0,,
2023,"""India""","""USA""",1,,
2024,"""India""","""USA""",2,"""Clothes""",10.0
2024,"""India""","""USA""",3,"""Machinery""",100.0
