Skip to content

Commit

Permalink
docs: add FAQ and best practice docs (#1060)
Browse files Browse the repository at this point in the history
Co-authored-by: Yi Duan <duanyi.aster@bytedance.com>
  • Loading branch information
cqqqq777 and AsterDY committed May 19, 2024
1 parent 0cd536c commit c7da1fe
Show file tree
Hide file tree
Showing 28 changed files with 1,857 additions and 1 deletion.
9 changes: 9 additions & 0 deletions content/en/docs/kitex/Best Practice/self_check/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
---
title: "Self Troubleshooting Guide"
linkTitle: "Self Troubleshooting Guide"
weight: 6
date: 2024-02-18
description: "Provides instructions on troubleshooting common exceptional scenarios"

---

Original file line number Diff line number Diff line change
@@ -0,0 +1,269 @@
---
title: "FastWrite/FastRead Panic Error"
linkTitle: "FastWrite/FastRead Panic Error"
weight: 3
date: 2024-02-18
description: "Common errors and solutions related to FastCodec (including FastThrift, FastPB)"

---

## FastWrite Panic

The implementation of FastWrite consists of three steps: (1) scanning the object to calculate the required memory, (2) allocating memory, and (3) completing serialization. This approach eliminates the need for slice appending, reducing memory allocation and copying, resulting in significantly improved performance.

However, if there is a concurrency bug in the user code between step 1 (calculating the required memory) and step 3 (serialization), where another goroutine concurrently writes to the object being serialized (request parameters for the client or response values for the server), it can lead to Kitex reading an inconsistent object, resulting in errors or even panics.

**Note**:

1. Although the error occurs in the generated `FastWriteField` method of Kitex, the root cause of the error lies in the user code, which is a typical concurrency issue.
2. **These issues are concurrency bugs in the user code, not bugs in Kitex**. FastWrite simply exposes these issues more easily.
3. A typical scenario is when a field referenced in the request or response object is reused, as explained in the "Case Analysis" section below.
4. Importantly, the absence of errors or panics does not guarantee the absence of concurrency issues. In reality, **inconsistent data may have been constructed and passed to upstream or downstream components**.

### Typical Symptoms

The following error messages may occur:

- `runtime error: index out of range [3] with length 3`
- `runtime error: slice bounds out of range [86:59]`
- `runtime error: invalid memory address or nil pointer dereference`

### Case Analysis

#### Client Panic

A typical scenario involves a field referenced within the request being a global variable (or a cached object) that can be concurrently written to.

The simplified error message is as follows:

```go
KITEX: panic, ..., error=runtime error: invalid memory address or nil pointer dereference
panic(...):
github.com/cloudwego/kitex/pkg/protocol/bthrift.binaryProcotol.WriteBinaryNocopy(...)
git/to/project/kitex_gen/some_package.(*SomeType).fastWriteField2(...)
```

The panic stack includes `fastWriteField2`, which is a typical case of concurrent read and write in business logic. The simplified business code is as follows:

```go
key := "default" // reflect.StringHeader{Data=0xXXX, Len=7}
fallbackKey = "" // reflect.StringHeader{Data=nil, Len=0}
wg := sync.WaitGroup{}
for _, task := range taskList {
wg.Add(1)
go func() {
defer wg.Done()
if someCondition {
key = fallbackKey // `key` may be read by another goroutine
}
kitexClient.GetByKey(ctx, &Request{Key: key})
}()
}
wg.Wait()
```

**Analysis:**

1. The `key` variable of type `string` **may be read/written by different goroutines within the loop**.
2. Reading/writing a `string` is not thread-safe: In the Go Runtime implementation, a `reflect.StringHeader{Data uintptr, Len int}` is used with two fields that need to be assigned separately.
3. Kitex's FastWrite may read a "partially written string" (Data = nil, Len = 7) and pass it to the `fastWriteField2` method.
4. When `fastWriteField2` attempts to read 7 characters from `Data = nil`, it triggers a `nil pointer dereference` panic.

**Note**:

1. Similar issues can occur with slices, which correspond to `reflect.SliceHeader` and have `Data`, `Len`, and `Cap` fields.
2. Concurrent read and write of maps can result in a fatal error in Go.
3. The absence of errors or panics does not guarantee the absence of concurrency issues. In reality, **inconsistent data may have been constructed and passed to upstream or downstream components**: For example, if `fallbackKey = "123456789"` in the code above, the serialized data may be read as "1234567". If `fallbackKey = "123"`, it could result in out-of-bounds data access or even panics.

#### Server Panic

A typical scenario is when a field referenced in the response is a global variable (or a cached object), and that object may be concurrently written to.

The error message for this case is as follows (slightly simplified):

```go
KITEX: panic happened, ..., error=<Error: runtime error: index out of range [3] with length 1>
panic(...)
encoding/binary.bigEndian.PutUint32(...)
github.com/cloudwego/kitex/pkg/protocol/bthrift.binaryProtocol.WriteI32(...)
git/to/project/kitex_gen/some_package.(*SomeType).fastWriteField3(...)
```

From the panic stack, it can be seen that the `fastWriteField3` method generated by Kitex triggers the panic. It is a typical case of concurrent read and write in business logic. The simplified business code can be as follows:

```go
var localCache sync.Map{}

func (*Handler) GetByKey(ctx context.Context, req *xxx.Request) (*xxx.Response, error) {
resp := localCache.Get(req.Key)
resp.UserID = req.UserID
return resp, nil
}
```

**Analysis:**

1. `PutUint32(b []byte, v uint32)` checks if `b[3]` is readable before writing `v` to `b`. The panic occurs here, indicating an out-of-range access.
2. `FastWrite` has already traversed `resp` to determine the required length, indicating that between "(1) calculating required memory" and "(3) serialization," a field in `resp` has been modified, requiring more memory space.
3. Let's assume the server receives two requests simultaneously (both with `Key` as "X"):
1. `A.Request: {Key="X", UserID = "123" }`
2. `B.Request: {Key="X", UserID = "123456"} `
4. The process may execute in the following order:
1. `FastWrite` for request A calculates that it needs N+3 bytes (N represents the space required by other fields in `resp`) and allocates the memory.
2. Request B executes line 5 and replaces `UserID` with "123456" (note that the response is cached, which means A will also return this object).
3. Request A starts serializing `resp` into the allocated space. When it tries to write the last field, `UserID`, it finds that there is not enough space available (it now needs N+6 bytes): **`index out of range [3] with length 1`**

**Note:**

1. The absence of errors or panics does not imply the absence of concurrency issues. In reality, **abnormal data may have been constructed and passed to upstream and downstream components:** For example, if the execution order of A and B mentioned above is reversed, no error will occur, but the expected value to be written in the response message is "123456," while the actual value written is "123," **and the Kitex client can decode it correctly**.

### Troubleshooting Suggestions

#### Self-inspect the code based on the aforementioned case

If there is concurrent read and write on the `Request` or `Response` (including objects directly or indirectly referenced by them), fix this issue first.

#### Use the Go Race Detector

If possible, use the `-race` flag to identify and eliminate concurrency issues in the code. Refer to https://go.dev/blog/race-detector for more details.

Note: Be cautious when using it in a production environment as it can significantly impact performance.

#### Identify the problematic field based on the error message

Based on the panic stack, you can identify the specific field by looking at the field type and `fastWriteFieldN` method.

For example, if the innermost method in the panic stack is: `kitex_gen/some_package.(***Base**).**fastWriteField3**(...) `, it means that the error is caused by the field with index 3 in the `Base` type. In the following IDL, the field at index 3 is the `Addr` field of type `string`:

```go
struct Base {
1: string LogID = "",
2: string Caller = "",
3: string Addr = "",
}
```

If the panic error message contains `invalid memory address or nil pointer dereference`, it indicates that the `Addr` field may have been concurrently written with a null value by another goroutine.

If the panic error message is different (e.g., `index out of range [3] with length 3`), **it may indicate** that another field is being concurrently read and written (variable length) and is occupying more memory, causing insufficient memory allocation when writing the field. You can refer to the "Comparing Two Sampling Results" section below to locate the issue.

**Client Side**

The proposed solution for the client side is as follows:

1. First Sampling: Before sending the request, serialize the entire request body into JSON.

- Note: This may impact performance. Consider sampling the requests proportionally to reduce the impact on the production environment, but it may take longer to capture relevant cases.

2. Second Sampling (two approaches for reference):

a. Panic Sampling: Check if a panic has occurred and then serialize the request body into JSON.

- Check if the returned error from the client contains panic information.
- Check if a panic has occurred in the middleware. Refer to the "Kitex - Panic 处理" section in the [Kitex - Panic Handling](https://www.cloudwego.io/docs/kitex/tutorials/basic-feature/panic/) documentation.

b. Delayed Sampling: Create a goroutine that sleeps for a certain amount of time and then serialize the request body into JSON.

- The longer the sleep time, the higher the probability of finding differences.

3. Compare the results of the two samplings to identify different fields and investigate code sections where concurrent read/write operations may be happening.

Note:

1. If a certain performance loss (additional JSON marshaling, comparison) is acceptable, consider comparing all requests. Some requests may not cause panics but could still produce inconsistent data.
2. Injecting troubleshooting through middleware is simpler and does not require modifying the business code.

**Server Side**

On the server side, as the server has exited all middleware during encoding, it is not possible to capture panics or perform sampling comparisons using middleware.

The proposed solution for the server side is as follows:

1. Custom PayloadCodec:
- Perform sampling and panic comparison during the Marshal operation.
- Note: This may impact performance. Consider sampling the requests proportionally to reduce the impact on the production environment, but it may take longer to capture relevant cases.

```go
type codecForPanic struct {
payloadCodec remote.PayloadCodec
}

func (c *codecForPanic) Marshal(ctx context.Context, message remote.Message, out remote.ByteBuffer) error {
var before, after []byte
var err error
defer func() {
if err := recover(); err != nil {
klog.Errorf("panic: %v", err)
after, _ = json.Marshal(message.Data())
if bytes.Compare(before, after) != 0 {
klog.Errorf("before = %s, after = %s", before, after)
}
}
}()
before, err = json.Marshal(message.Data()) // Note the performance loss
if err != nil {
klog.Errorf("json encode before Marshal failed: err = %v", err)
}
return c.payloadCodec.Marshal(ctx, message, out)
}

func (c *codecForPanic) Unmarshal(ctx context.Context, message remote.Message, in remote.ByteBuffer) error {
// Recover and compare here
return c.payloadCodec.Unmarshal(ctx, message, in)
}

func (c *codecForPanic) Name() string {
return "codecForPanic"
}
```

1. Specify the custom PayloadCodec during server initialization:

```go
svr := test.NewServer(new(TestServiceImpl),
server.WithPayloadCodec(&codecForPanic{
payloadCodec: thrift.NewThriftCodecWithConfig(thrift.FastRead | thrift.FastWrite),
}),
// Other options
)
```

1. Compare the results of the two samplings to identify different fields and investigate code sections where concurrent read/write operations may be happening.

If panics occur infrequently, you can also consider starting a new goroutine, sleeping for a period of time, and then performing the comparison. This approach makes it easier to identify the modified sections.

## FastRead

### Typical Symptoms

`runtime error: slice bounds out of range [1600217702:1678338]`

### Cause

The received message (Client: response, Server: request) is encoded incorrectly.

### Troubleshooting Suggestions

Refer to the troubleshooting suggestions for FastWrite and investigate the corresponding endpoint (if the error occurs on the server side, check the client; vice versa).

## FAQ

### How to Disable FastWrite/Read

> **It is not recommended to disable it**: Race conditions in the business logic can lead to inconsistent business data and unpredictable consequences.
If there is a genuine need, you can choose one of the following options:

1. Disable the `FastCodec` in Thrift using options:

```go
// Server side
svr := xxxService.NewServer(handler, server.WithPayloadCodec(
thrift.NewThriftCodecDisableFastMode(true, false)))

// Client side
cli, err := xxxService.NewClient("", client.WithPayloadCodec(
thrift.NewThriftCodecDisableFastMode(true, false)))
```

1. Do not generate `FastCodec` encoding/decoding code: Use the `-no-fast-api` parameter of the Kitex command-line tool to regenerate the code.

0 comments on commit c7da1fe

Please sign in to comment.