From 3b626755588a461c8f195a48c4777fd533bf54ec Mon Sep 17 00:00:00 2001 From: Stephan Ewen Date: Fri, 20 Jan 2017 18:50:21 +0100 Subject: [PATCH] [FLINK-5459] [docs] Add troubleshooting guide for classloading issues --- docs/monitoring/debugging_classloading.md | 101 ++++++++++++++++++++-- 1 file changed, 92 insertions(+), 9 deletions(-) diff --git a/docs/monitoring/debugging_classloading.md b/docs/monitoring/debugging_classloading.md index e4e908ea8c74c..40822c03cdccb 100644 --- a/docs/monitoring/debugging_classloading.md +++ b/docs/monitoring/debugging_classloading.md @@ -27,19 +27,102 @@ under the License. ## Overview of Classloading in Flink - - What is in the Application Classloader for different deployment techs - - What is in the user code classloader +When running Flink applications, the JVM will load various classes over time. +These classes can be devided into two domains: - - Access to the user code classloader for applications + - The **Flink Framework** domain: This includes all code in the `/lib` directory in the Flink directory. + By default these are the classes of Apache Flink and its core dependencies. -## Classpath Setups + - The **User Code** domain: These are all classes that are included in the JAR file submitted via the CLI or web interface. + That includes the job's classes, and all libraries and connectors that are put into the uber JAR. + + +The class loading behaves slightly different for various Flink setups: + +**Standalone** + +When starting a the Flink cluster, the JobManagers and TaskManagers are started with the Flink framework classes in the +classpath. The classes from all jobs that are submitted against the cluster are loaded *dynamically*. + +**YARN** + +YARN classloading differs between single job deploymens and sessions: + + - When submitting a Flink job directly to YARN (via `bin/flink run -m yarn-cluster ...`), dedicated TaskManagers and + JobManagers are started for that job. Those JVMs have both Flink framework classes and user code classes in their classpath. + That means that there is *no dynamic classloading* involved in that case. + + - When starting a YARN session, the JobManagers and TaskManagers are started with the Flink framework classes in the + classpath. The classes from all jobs that are submitted against the session are loaded dynamically. + +**Mesos** + +Mesos setups following [this documentation](../setup/mesos.html) currently behave very much like the a +YARN session: The TaskManager and JobManager processes are started with the Flink framework classes in classpath, job +classes are loaded dynamically when the jobs are submitted. + + +## Avoiding Dynamic Classloading + +All components (JobManger, TaskManager, Client, ApplicationMaster, ...) log their classpath setting on startup. +They can be found as part of the environment information at the beginnign of the log. + +When running a setup where the Flink JobManager and TaskManagers are exclusive to one particular job, one can put JAR files +directly into the `/lib` folder to make sure they are part of the classpath and not loaded dynamically. + +It usually works to put the job's JAR file into the `/lib` directory. The JAR will be part of both the classpath +(the *AppClassLoader*) and the dynamic class loader (*FlinkUserCodeClassLoader*). +Because the AppClassLoader is the parent of the FlinkUserCodeClassLoader (and Java loads parent-first), this should +result in classes being loaded only once. + +For setups where the job's JAR file cannot be put to the `/lib` folder (for example because the setup is a session that is +used by multiple jobs), it may still be posible to put common libraries to the `/lib` folder, and avoid dynamic class loading +for those. + + +## Manual Classloading in the Job + +In some cases, a transformation function, source, or sink needs to manually load classes (dynamically via reflection). +To do that, it needs the classloader that has access to the job's classes. + +In that case, the functions (or sources or sinks) can be made a `RichFunction` (for example `RichMapFunction` or `RichWindowFunction`) +and access the user code class loader via `getRuntimeContext().getUserCodeClassLoader()`. + + +## X cannot be cast to X exceptions + +When you see an exception in the style `com.foo.X cannot be cast to com.foo.X`, it means that multiple versions of the class +`com.foo.X` have been loaded by different class loaders, and types of that class are attempted to be assigned to each other. + +The reason is in most cases that an object of the `com.foo.X` class loaded from a previous execution attempt is still cached somewhere, +and picked up by a restarted task/operator that reloaded the code. Note that this is again only possible in deployments that use +dynamic class loading. + +Common causes of cached object instances: + + - When using *Apache Avro*: The *SpecificDatumReader* caches instances of records. Avoid using `SpecificData.INSTANCE`. See also + [this discussion](http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/How-to-get-help-on-ClassCastException-when-re-submitting-a-job-tp10972p11133.html) + + - Using certain serialization frameworks for cloning objects (such as *Apache Avro*) + + - Interning objects (for example via Guava's Interners) - - Finding classpaths in logs - - Moving libraries and/or user code to the Application Classpath ## Unloading of Dynamically Loaded Classes - - Checkpoint statistics overview - - Interpret time until checkpoints - - Synchronous vs. asynchronous checkpoint time +All scenarios that involve dynamic class loading (i.e., standalone, sessions, mesos, ...) rely on classes being *unloaded* again. +Class unloading means that the Garbage Collector finds that no objects from a class exist and more, and thus removes the class +(the code, static variable, metadata, etc). + +Whenever a TaskManager starts (or restarts) a task, it will load that specific task's code. Unless classes can be unloaded, this will +become a memory leak, as new versions of classes are loaded and the total number of loaded classes accumulates over time. This +typically manifests itself though a **OutOfMemoryError: PermGen**. + +Common causes for class leaks and suggested fixes: + + - *Lingering Threads*: Make sure the application functions/sources/sinks shuts down all threads. Lingering threads cost resources themselves and + additionally typically hold references to (user code) objects, preventing garbage collection and unloading of the classes. + + - *Interners*: Avoid caching objects in special structures that live beyond the lifetime of the functions/sources/sinks. Examples are Guava's + interners, or Avro's class/object caches in the serializers.