Skip to content

Trixie Status

jhickeyNRC edited this page Jul 5, 2024 · 70 revisions

Current Trixie Operational Status

Upcoming Planned Downtime

None

Current Issues / Outages

None

Past Events / Incidents

  • [RESOLVED] - Friday June 28, 2024 - The Research Platform Support team is currently proceeding with operating system and storage appliance upgrades on the cluster. In order to do the storage appliance upgrade, the cluster will be offline from the 28th of June at 6AM EDT to the afternoon of July the 2nd for a final data synchronization from the old appliance to the new appliance. Please see the email sent out to users Friday, June 21 for important details concerning this change in the Trixie infrastructure. Thank you for your patience. Research Platform Support - Update - Due to ongoing RES VPN issues affecting the upgrade of the Trixie cluster, the return to service has been delayed to end-of-business July 3rd, 2024. - Update - Due to further RES VPN issues today affecting the upgrade of the Trixie cluster, the return to service has been delayed to July 4th, 2024. A notice will be sent when Trixie is back online.

  • [RESOLVED] - Thursday, June 6, 2024 - Trixie Bastion Host Shutdown Notice - For users connecting from the Internet or the Legacy network. This bastion host upgrades will require downtime that is scheduled to start at 7:00AM EDT on June 6th and will conclude at 5PM. Although the bastion hosts will be offline during this time, jobs summitted prior to this maintenance window by users connecting through the bastion host will continue to run normally. If you have any questions or concerns about this upgrade, please let us know. Thank you for your patience. Research Platform Support

  • [RESOLVED] - Tuesday, June 5, 2024 - There is currently a firewall issue resolving some addresses/URLs from the Digital Research Alliance of Canada (formerly Compute Canada) CVMFS mirrors which is affecting the loading of some modules. Please report any outstanding issues on either the issues page or the RPS mailbox.

  • [RESOLVED] - Friday, May 3, 2024 - This downtime is scheduled to start at 2:30PM EDT on Friday May 3rd and will conclude on the evening of Sunday the 5th. A notice will be sent out when the downtime is completed, and the cluster is back online. If you have any questions or concerns about this shutdown, please let us know. Thank you for your patience. Research Platform Support

  • [RESOLVED] - Friday, April 12, 2024 - This downtime is scheduled to start at 2:30PM EDT on Friday April 12th and will conclude on the evening of Sunday the 14th. A notice will be sent out when the downtime is completed, and the cluster is back online. If you have any questions or concerns about this shutdown, please let us know. Thank you for your patience. Research Platform Support

  • [RESOLVED] - Friday, February 2, 2024 - This downtime is scheduled to start at 2:00PM EST on February 2nd and will conclude on the evening of the 3rd. A notice will be sent out when the downtime is completed, and the cluster is back online. If you have any questions or concerns about this shutdown, please let us know. Thank you for your patience. Research Platform Support

  • [RESOLVED] - Friday, January 19, 2023 - This downtime is scheduled to start at 2:30 PM EST on Friday January 19th and will conclude on the evening of the 20th. A notice will be sent out when the downtime is completed, and the cluster is back online. If you have any questions or concerns about this shutdown, please let us know. Thank you for your patience. Research Platform Support

  • [CANCELLED] - Thursday, December 14th, 2023 - This downtime is scheduled to start at 2:30 PM EST on Thursday the 14th of December and will conclude on the morning of the 15th. A notice will be sent out when the downtime is completed, and the cluster is back online. If you have any questions or concerns about this shutdown, please let us know. Thank you for your patience. Research Platform Support

  • [RESOLVED] - Monday, August 28th, 2023 - KITS will be performing a maintenance on the Trixie HPC cluster on Tuesday, August 29th at 6AM EDT. The cluster should be brought back online around 12PM. Jobs with a run time conflicting with the maintenance starting period will stay in the queue and run after the maintenance. A notice will be sent out when the downtime is completed and the cluster is back online. If you have any questions or concerns please let us know. Research Platform Support: rps-spr@nrc-cnrc.gc.ca

  • [RESOLVED] - Wednesday July 19, 2023 - Thursday July 20, 2023 - Please note that RPPM will be shutting down regular power to building M-55 for electrical emergency repairs on July 20th from 6 to 7AM EDT. KITS will therefore be shutting down the Trixie HPC from July 19th at 5 PM to soon after 7AM on the 20th. A notice will be sent out when the downtime is completed and the cluster is back online. If you have any questions or concerns please let us know. IT Operations: ITOperations-OperationsTI@nrc-cnrc.gc.ca

  • [RESOLVED] - Monday June 19, 2023 - Wednesday June 21, 2023 - Please note that in support of the new generator installation taking place at building M-55, RPPM will be shutting down power to building M-55, taking place on two consecutive evenings. Trixie HPC will be unavailable during the scheduled period of Monday June 19th 1:00 pm EDT to morning of Wednesday June 21st. A notice will be sent out when the downtime is completed, and the cluster is back online. If you have any questions or concerns about this maintenance or the maintenance schedule please let us know. Thank you for your patience. IT Operations: ITOperations-OperationsTI@nrc-cnrc.gc.ca

  • [RESOLVED] - Friday June 9, 2023 - Monday June 12, 2023 - A period of downtime is required for the Trixie HPC due to work on the buildings electrical systems. We will also use this time to perform some routine maintenance and upgrades. This downtime is scheduled to start at 2:00 PM EDT on Friday June 9th and will conclude on the afternoon of Monday June 12th. A notice will be sent out when the downtime is completed, and the cluster is back online. If you have any questions or concerns about this maintenance or the maintenance schedule please let us know. Thank you for your patience. IT Operations: ITOperations-OperationsTI@nrc-cnrc.gc.ca

  • [RESOLVED] - Monday March 27, 2023 - A period of downtime is required for the Trixie HPC cluster to perform some routine maintenance and upgrades. This downtime is scheduled for the day of Monday March 27th. A notice will be sent out when the downtime is completed, and the cluster is back online. If you have any questions or concerns about this maintenance or the maintenance schedule please let us know. Please note that Slurm should stop accepting jobs that would run into the maintenance period. They will be held in the queue until the maintenance period has ended. Thank you for your patience. IT Operations: ITOperations-OperationsTI@nrc-cnrc.gc.ca

  • [RESOLVED] - Monday November 28, 2022 - A period of downtime is required for the Trixie HPC cluster to perform some routine maintenance and upgrades. This downtime is scheduled for the day of Monday November 28th. A notice will be sent out when the downtime is completed, and the cluster is back online. If you have any questions or concerns about this maintenance or the maintenance schedule please let us know. Thank you for your patience. IT Operations: ITOperations-OperationsTI@nrc-cnrc.gc.ca

  • [RESOLVED] - Monday May 30, 2022 - Please note that the Trixie server is currently offline - possibly due to a network issue.

  • [RESOLVED] - Monday April 11 - Wednesday April 13, 2022 (3 days) Please note that Trixie (AI4D HPC cluster) will be unavailable from April 11-13th (3 days) due to a scheduled upgrade. The GPFS file system will be upgraded during this time. If you have any questions or concerns about this maintenance please send an email to ITOperations-OperationsTI@nrc-cnrc.gc.ca. Update - Thursday April 14, 2022: Due to complications with the upgrade of the storage array the scheduled downtime for the AI4D-Trixie cluster has been extended. There is currently no estimate for when the cluster will return to service but an update will be sent out as soon as there is more information. - Update - Tuesday April 19, 2022 We are returning the AI4D-Trixie cluster to operational status. Unfortunately the storage array is in a degraded state and is only operating with 25% of its normal transfer capacity. Expect to see slowness from i/o intensive operations. All of the Compute Nodes as well as the Head Node have been re-imaged. The default operating environment has changed so expect many versions of the software loaded when you login to have changed. This may cause issues with job scripts created for the previous environment. If you experience any issues please let us know (ITOperations-OperationsTI@nrc-cnrc.gc.ca).

  • [RESOLVED] - Tuesday January 25, 2022 - There appears to be several issues with the Trixie HPC that are impacting access through the Bastion Host and general performance on the headnode. KITS-ITOps is investigating and hopes to resolve the issues as soon as possible. We will provide an update when we have more information. Update - There is a technical issue with the SSC managed switch due to a recent power outage. Access from Legacy and RES should still function but external access through the Bastion Host is not working. An SSC technician is supposed to be on site tomorrow morning to investigate.

  • [RESOLVED] - Wednesday December 15, 2021 - External access to Trixie is not available. It appears that there is an error with the SSL cert for the external LoginTC URL. When trying to use the LoginTC app on your phone to accept a login request a certificate error appears and the request is never received. Hopefully the issue will be resolved quickly, but access could be offline for a day or two.

  • [RESOLVED] - Wednesday December 15, 2021 - Trixie is currently unavailable and the issue is being investigated.

  • [RESOLVED] - Thursday Dec 2, 2021 - SSH connection to Trixie via the external bastion host are being blocked. Internal NRC network connectivity and Trixie operations continue normally. Investigation of root cause underway.

  • [RESOLVED - downtime completed successfully] - Monday August 23, 2021 - There will be a maintenance period for the Trixe AI4D Cluster on Monday August 23rd starting at 8:00 am EDT. Access to the cluster will not be possible during the maintenance. The entire day will be reserved for the maintenance but current estimates suggest it will be returned to service by noon. Maintenance will involve the replacement of a power distribution unit in one of the racks as well as configuration changes on the primary head node. Every effort will be made to preserve the job queue during the maintenance.

  • [RESOLVED - downtime completed successfully] - We are planning a period of scheduled downtime for the Trixie-AI4D cluster on Monday June 28th from 8:00am to 4:00pm EDT. This will allow a few maintenance tasks to be performed that would interrupt service. These tasks include modifying the partition structure on the primary head node as well as some security patching. - It has become necessary to add a firmware update to this maintenance window for the Mellanox switches. This will cause the GPFS file system to become unavailable forcing us to shutdown the cluster entirely. All jobs in the queue at the start of the maintenance period will likely be lost. - Due to unforeseen complications the maintenance period must be extended.

  • [RESOLVED - downtime completed successfully] - A period of downtime for the Trixie (AI4D) cluster is being scheduled for Monday, May 17th from 8:00 am - 6:00 pm EDT. Due to a hardware issue on the storage array there will need to be a scheduled maintenance period as per the vendors recommendation. Please note that the nature of the maintenance will require all jobs in the queue to be terminated at the start of the maintenance window.

  • [RESOLVED - nodes back in main queue] - Compute nodes cn110 and cn125 have been taken out of the main queue to troubleshoot GPU issues

  • [RESOLVED - downtime completed successfully] - A period of downtime for the Trixie (AI4D) cluster is being scheduled for Monday, April 19th from 9:00am-3:00pm. Due to the nature of the maintenance all jobs in the queue will be terminated at the start of the maintenance window.

  • [RESOLVED] - December 17 - We are currently experiencing issues with Trixie head node performance as detailed in #35 Investigation pending.

Notes