Skip to content

Commit e348fbc

Browse files
committed
Revert "Merge: EDAC: add initial support for El Capitan"
This reverts commit 7315692, reversing changes made to 66f48ff. Before we merge this up to 9.5, let's wait for [1] to be merged to avoid needing to rebase that MR. [1] https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3680 Signed-off-by: Scott Weaver <scweaver@redhat.com>
1 parent bafb12f commit e348fbc

File tree

16 files changed

+640
-657
lines changed

16 files changed

+640
-657
lines changed

Documentation/driver-api/edac.rst

Lines changed: 0 additions & 120 deletions
Original file line numberDiff line numberDiff line change
@@ -106,16 +106,6 @@ will occupy those chip-select rows.
106106
This term is avoided because it is unclear when needing to distinguish
107107
between chip-select rows and socket sets.
108108

109-
* High Bandwidth Memory (HBM)
110-
111-
HBM is a new memory type with low power consumption and ultra-wide
112-
communication lanes. It uses vertically stacked memory chips (DRAM dies)
113-
interconnected by microscopic wires called "through-silicon vias," or
114-
TSVs.
115-
116-
Several stacks of HBM chips connect to the CPU or GPU through an ultra-fast
117-
interconnect called the "interposer". Therefore, HBM's characteristics
118-
are nearly indistinguishable from on-chip integrated RAM.
119109

120110
Memory Controllers
121111
------------------
@@ -186,113 +176,3 @@ nodes::
186176
the L1 and L2 directories would be "edac_device_block's"
187177

188178
.. kernel-doc:: drivers/edac/edac_device.h
189-
190-
191-
Heterogeneous system support
192-
----------------------------
193-
194-
An AMD heterogeneous system is built by connecting the data fabrics of
195-
both CPUs and GPUs via custom xGMI links. Thus, the data fabric on the
196-
GPU nodes can be accessed the same way as the data fabric on CPU nodes.
197-
198-
The MI200 accelerators are data center GPUs. They have 2 data fabrics,
199-
and each GPU data fabric contains four Unified Memory Controllers (UMC).
200-
Each UMC contains eight channels. Each UMC channel controls one 128-bit
201-
HBM2e (2GB) channel (equivalent to 8 X 2GB ranks). This creates a total
202-
of 4096-bits of DRAM data bus.
203-
204-
While the UMC is interfacing a 16GB (8high X 2GB DRAM) HBM stack, each UMC
205-
channel is interfacing 2GB of DRAM (represented as rank).
206-
207-
Memory controllers on AMD GPU nodes can be represented in EDAC thusly:
208-
209-
GPU DF / GPU Node -> EDAC MC
210-
GPU UMC -> EDAC CSROW
211-
GPU UMC channel -> EDAC CHANNEL
212-
213-
For example: a heterogeneous system with 1 AMD CPU is connected to
214-
4 MI200 (Aldebaran) GPUs using xGMI.
215-
216-
Some more heterogeneous hardware details:
217-
218-
- The CPU UMC (Unified Memory Controller) is mostly the same as the GPU UMC.
219-
They have chip selects (csrows) and channels. However, the layouts are different
220-
for performance, physical layout, or other reasons.
221-
- CPU UMCs use 1 channel, In this case UMC = EDAC channel. This follows the
222-
marketing speak. CPU has X memory channels, etc.
223-
- CPU UMCs use up to 4 chip selects, So UMC chip select = EDAC CSROW.
224-
- GPU UMCs use 1 chip select, So UMC = EDAC CSROW.
225-
- GPU UMCs use 8 channels, So UMC channel = EDAC channel.
226-
227-
The EDAC subsystem provides a mechanism to handle AMD heterogeneous
228-
systems by calling system specific ops for both CPUs and GPUs.
229-
230-
AMD GPU nodes are enumerated in sequential order based on the PCI
231-
hierarchy, and the first GPU node is assumed to have a Node ID value
232-
following those of the CPU nodes after latter are fully populated::
233-
234-
$ ls /sys/devices/system/edac/mc/
235-
mc0 - CPU MC node 0
236-
mc1 |
237-
mc2 |- GPU card[0] => node 0(mc1), node 1(mc2)
238-
mc3 |
239-
mc4 |- GPU card[1] => node 0(mc3), node 1(mc4)
240-
mc5 |
241-
mc6 |- GPU card[2] => node 0(mc5), node 1(mc6)
242-
mc7 |
243-
mc8 |- GPU card[3] => node 0(mc7), node 1(mc8)
244-
245-
For example, a heterogeneous system with one AMD CPU is connected to
246-
four MI200 (Aldebaran) GPUs using xGMI. This topology can be represented
247-
via the following sysfs entries::
248-
249-
/sys/devices/system/edac/mc/..
250-
251-
CPU # CPU node
252-
├── mc 0
253-
254-
GPU Nodes are enumerated sequentially after CPU nodes have been populated
255-
GPU card 1 # Each MI200 GPU has 2 nodes/mcs
256-
├── mc 1 # GPU node 0 == mc1, Each MC node has 4 UMCs/CSROWs
257-
│   ├── csrow 0 # UMC 0
258-
│   │   ├── channel 0 # Each UMC has 8 channels
259-
│   │   ├── channel 1 # size of each channel is 2 GB, so each UMC has 16 GB
260-
│   │   ├── channel 2
261-
│   │   ├── channel 3
262-
│   │   ├── channel 4
263-
│   │   ├── channel 5
264-
│   │   ├── channel 6
265-
│   │   ├── channel 7
266-
│   ├── csrow 1 # UMC 1
267-
│   │   ├── channel 0
268-
│   │   ├── ..
269-
│   │   ├── channel 7
270-
│   ├── .. ..
271-
│   ├── csrow 3 # UMC 3
272-
│   │   ├── channel 0
273-
│   │   ├── ..
274-
│   │   ├── channel 7
275-
│   ├── rank 0
276-
│   ├── .. ..
277-
│   ├── rank 31 # total 32 ranks/dimms from 4 UMCs
278-
279-
├── mc 2 # GPU node 1 == mc2
280-
│   ├── .. # each GPU has total 64 GB
281-
282-
GPU card 2
283-
├── mc 3
284-
│   ├── ..
285-
├── mc 4
286-
│   ├── ..
287-
288-
GPU card 3
289-
├── mc 5
290-
│   ├── ..
291-
├── mc 6
292-
│   ├── ..
293-
294-
GPU card 4
295-
├── mc 7
296-
│   ├── ..
297-
├── mc 8
298-
│   ├── ..

arch/x86/include/asm/mce.h

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -307,7 +307,6 @@ enum smca_bank_types {
307307
SMCA_PIE, /* Power, Interrupts, etc. */
308308
SMCA_UMC, /* Unified Memory Controller */
309309
SMCA_UMC_V2,
310-
SMCA_MA_LLC, /* Memory Attached Last Level Cache */
311310
SMCA_PB, /* Parameter Block */
312311
SMCA_PSP, /* Platform Security Processor */
313312
SMCA_PSP_V2,
@@ -323,15 +322,14 @@ enum smca_bank_types {
323322
SMCA_SHUB, /* System HUB Unit */
324323
SMCA_SATA, /* SATA Unit */
325324
SMCA_USB, /* USB Unit */
326-
SMCA_USR_DP, /* Ultra Short Reach Data Plane Controller */
327-
SMCA_USR_CP, /* Ultra Short Reach Control Plane Controller */
328325
SMCA_GMI_PCS, /* GMI PCS Unit */
329326
SMCA_XGMI_PHY, /* xGMI PHY Unit */
330327
SMCA_WAFL_PHY, /* WAFL PHY Unit */
331328
SMCA_GMI_PHY, /* GMI PHY Unit */
332329
N_SMCA_BANK_TYPES
333330
};
334331

332+
extern const char *smca_get_long_name(enum smca_bank_types t);
335333
extern bool amd_mce_is_memory_error(struct mce *m);
336334

337335
extern int mce_threshold_create_device(unsigned int cpu);

arch/x86/kernel/cpu/mce/amd.c

Lines changed: 46 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -87,50 +87,61 @@ struct smca_bank {
8787
static DEFINE_PER_CPU_READ_MOSTLY(struct smca_bank[MAX_NR_BANKS], smca_banks);
8888
static DEFINE_PER_CPU_READ_MOSTLY(u8[N_SMCA_BANK_TYPES], smca_bank_counts);
8989

90-
static const char * const smca_names[] = {
91-
[SMCA_LS ... SMCA_LS_V2] = "load_store",
92-
[SMCA_IF] = "insn_fetch",
93-
[SMCA_L2_CACHE] = "l2_cache",
94-
[SMCA_DE] = "decode_unit",
95-
[SMCA_RESERVED] = "reserved",
96-
[SMCA_EX] = "execution_unit",
97-
[SMCA_FP] = "floating_point",
98-
[SMCA_L3_CACHE] = "l3_cache",
99-
[SMCA_CS ... SMCA_CS_V2] = "coherent_slave",
100-
[SMCA_PIE] = "pie",
90+
struct smca_bank_name {
91+
const char *name; /* Short name for sysfs */
92+
const char *long_name; /* Long name for pretty-printing */
93+
};
94+
95+
static struct smca_bank_name smca_names[] = {
96+
[SMCA_LS ... SMCA_LS_V2] = { "load_store", "Load Store Unit" },
97+
[SMCA_IF] = { "insn_fetch", "Instruction Fetch Unit" },
98+
[SMCA_L2_CACHE] = { "l2_cache", "L2 Cache" },
99+
[SMCA_DE] = { "decode_unit", "Decode Unit" },
100+
[SMCA_RESERVED] = { "reserved", "Reserved" },
101+
[SMCA_EX] = { "execution_unit", "Execution Unit" },
102+
[SMCA_FP] = { "floating_point", "Floating Point Unit" },
103+
[SMCA_L3_CACHE] = { "l3_cache", "L3 Cache" },
104+
[SMCA_CS ... SMCA_CS_V2] = { "coherent_slave", "Coherent Slave" },
105+
[SMCA_PIE] = { "pie", "Power, Interrupts, etc." },
101106

102107
/* UMC v2 is separate because both of them can exist in a single system. */
103-
[SMCA_UMC] = "umc",
104-
[SMCA_UMC_V2] = "umc_v2",
105-
[SMCA_MA_LLC] = "ma_llc",
106-
[SMCA_PB] = "param_block",
107-
[SMCA_PSP ... SMCA_PSP_V2] = "psp",
108-
[SMCA_SMU ... SMCA_SMU_V2] = "smu",
109-
[SMCA_MP5] = "mp5",
110-
[SMCA_MPDMA] = "mpdma",
111-
[SMCA_NBIO] = "nbio",
112-
[SMCA_PCIE ... SMCA_PCIE_V2] = "pcie",
113-
[SMCA_XGMI_PCS] = "xgmi_pcs",
114-
[SMCA_NBIF] = "nbif",
115-
[SMCA_SHUB] = "shub",
116-
[SMCA_SATA] = "sata",
117-
[SMCA_USB] = "usb",
118-
[SMCA_USR_DP] = "usr_dp",
119-
[SMCA_USR_CP] = "usr_cp",
120-
[SMCA_GMI_PCS] = "gmi_pcs",
121-
[SMCA_XGMI_PHY] = "xgmi_phy",
122-
[SMCA_WAFL_PHY] = "wafl_phy",
123-
[SMCA_GMI_PHY] = "gmi_phy",
108+
[SMCA_UMC] = { "umc", "Unified Memory Controller" },
109+
[SMCA_UMC_V2] = { "umc_v2", "Unified Memory Controller v2" },
110+
[SMCA_PB] = { "param_block", "Parameter Block" },
111+
[SMCA_PSP ... SMCA_PSP_V2] = { "psp", "Platform Security Processor" },
112+
[SMCA_SMU ... SMCA_SMU_V2] = { "smu", "System Management Unit" },
113+
[SMCA_MP5] = { "mp5", "Microprocessor 5 Unit" },
114+
[SMCA_MPDMA] = { "mpdma", "MPDMA Unit" },
115+
[SMCA_NBIO] = { "nbio", "Northbridge IO Unit" },
116+
[SMCA_PCIE ... SMCA_PCIE_V2] = { "pcie", "PCI Express Unit" },
117+
[SMCA_XGMI_PCS] = { "xgmi_pcs", "Ext Global Memory Interconnect PCS Unit" },
118+
[SMCA_NBIF] = { "nbif", "NBIF Unit" },
119+
[SMCA_SHUB] = { "shub", "System Hub Unit" },
120+
[SMCA_SATA] = { "sata", "SATA Unit" },
121+
[SMCA_USB] = { "usb", "USB Unit" },
122+
[SMCA_GMI_PCS] = { "gmi_pcs", "Global Memory Interconnect PCS Unit" },
123+
[SMCA_XGMI_PHY] = { "xgmi_phy", "Ext Global Memory Interconnect PHY Unit" },
124+
[SMCA_WAFL_PHY] = { "wafl_phy", "WAFL PHY Unit" },
125+
[SMCA_GMI_PHY] = { "gmi_phy", "Global Memory Interconnect PHY Unit" },
124126
};
125127

126128
static const char *smca_get_name(enum smca_bank_types t)
127129
{
128130
if (t >= N_SMCA_BANK_TYPES)
129131
return NULL;
130132

131-
return smca_names[t];
133+
return smca_names[t].name;
132134
}
133135

136+
const char *smca_get_long_name(enum smca_bank_types t)
137+
{
138+
if (t >= N_SMCA_BANK_TYPES)
139+
return NULL;
140+
141+
return smca_names[t].long_name;
142+
}
143+
EXPORT_SYMBOL_GPL(smca_get_long_name);
144+
134145
enum smca_bank_types smca_get_bank_type(unsigned int cpu, unsigned int bank)
135146
{
136147
struct smca_bank *b;
@@ -167,7 +178,6 @@ static const struct smca_hwid smca_hwid_mcatypes[] = {
167178
{ SMCA_CS, HWID_MCATYPE(0x2E, 0x0) },
168179
{ SMCA_PIE, HWID_MCATYPE(0x2E, 0x1) },
169180
{ SMCA_CS_V2, HWID_MCATYPE(0x2E, 0x2) },
170-
{ SMCA_MA_LLC, HWID_MCATYPE(0x2E, 0x4) },
171181

172182
/* Unified Memory Controller MCA type */
173183
{ SMCA_UMC, HWID_MCATYPE(0x96, 0x0) },
@@ -202,8 +212,6 @@ static const struct smca_hwid smca_hwid_mcatypes[] = {
202212
{ SMCA_SHUB, HWID_MCATYPE(0x80, 0x0) },
203213
{ SMCA_SATA, HWID_MCATYPE(0xA8, 0x0) },
204214
{ SMCA_USB, HWID_MCATYPE(0xAA, 0x0) },
205-
{ SMCA_USR_DP, HWID_MCATYPE(0x170, 0x0) },
206-
{ SMCA_USR_CP, HWID_MCATYPE(0x180, 0x0) },
207215
{ SMCA_GMI_PCS, HWID_MCATYPE(0x241, 0x0) },
208216
{ SMCA_XGMI_PHY, HWID_MCATYPE(0x259, 0x0) },
209217
{ SMCA_WAFL_PHY, HWID_MCATYPE(0x267, 0x0) },
@@ -707,13 +715,11 @@ void mce_amd_feature_init(struct cpuinfo_x86 *c)
707715

708716
bool amd_mce_is_memory_error(struct mce *m)
709717
{
710-
enum smca_bank_types bank_type;
711718
/* ErrCodeExt[20:16] */
712719
u8 xec = (m->status >> 16) & 0x1f;
713720

714-
bank_type = smca_get_bank_type(m->extcpu, m->bank);
715721
if (mce_flags.smca)
716-
return (bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2) && xec == 0x0;
722+
return smca_get_bank_type(m->extcpu, m->bank) == SMCA_UMC && xec == 0x0;
717723

718724
return m->bank == 4 && xec == 0x8;
719725
}
@@ -1043,7 +1049,7 @@ static const char *get_name(unsigned int cpu, unsigned int bank, struct threshol
10431049
if (bank_type >= N_SMCA_BANK_TYPES)
10441050
return NULL;
10451051

1046-
if (b && (bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2)) {
1052+
if (b && bank_type == SMCA_UMC) {
10471053
if (b->block < ARRAY_SIZE(smca_umc_block_names))
10481054
return smca_umc_block_names[b->block];
10491055
return NULL;

0 commit comments

Comments
 (0)