From 3b626755588a461c8f195a48c4777fd533bf54ec Mon Sep 17 00:00:00 2001
From: Stephan Ewen <sewen@apache.org>
Date: Fri, 20 Jan 2017 18:50:21 +0100
Subject: [PATCH] [FLINK-5459] [docs] Add troubleshooting guide for
 classloading issues

---
 docs/monitoring/debugging_classloading.md | 101 ++++++++++++++++++++--
 1 file changed, 92 insertions(+), 9 deletions(-)

diff --git a/docs/monitoring/debugging_classloading.md b/docs/monitoring/debugging_classloading.md
index e4e908ea8c74c..40822c03cdccb 100644
--- a/docs/monitoring/debugging_classloading.md
+++ b/docs/monitoring/debugging_classloading.md
@@ -27,19 +27,102 @@ under the License.
 
 ## Overview of Classloading in Flink
 
-  - What is in the Application Classloader for different deployment techs
-  - What is in the user code classloader
+When running Flink applications, the JVM will load various classes over time.
+These classes can be devided into two domains:
 
-  - Access to the user code classloader for applications
+  - The **Flink Framework** domain: This includes all code in the `/lib` directory in the Flink directory.
+    By default these are the classes of Apache Flink and its core dependencies.
 
-## Classpath Setups
+  - The **User Code** domain: These are all classes that are included in the JAR file submitted via the CLI or web interface.
+    That includes the job's classes, and all libraries and connectors that are put into the uber JAR.
+
+
+The class loading behaves slightly different for various Flink setups:
+
+**Standalone**
+
+When starting a the Flink cluster, the JobManagers and TaskManagers are started with the Flink framework classes in the
+classpath. The classes from all jobs that are submitted against the cluster are loaded *dynamically*.
+
+**YARN**
+
+YARN classloading differs between single job deploymens and sessions:
+
+  - When submitting a Flink job directly to YARN (via `bin/flink run -m yarn-cluster ...`), dedicated TaskManagers and
+    JobManagers are started for that job. Those JVMs have both Flink framework classes and user code classes in their classpath.
+    That means that there is *no dynamic classloading* involved in that case.
+
+  - When starting a YARN session, the JobManagers and TaskManagers are started with the Flink framework classes in the
+    classpath. The classes from all jobs that are submitted against the session are loaded dynamically.
+
+**Mesos**
+
+Mesos setups following [this documentation](../setup/mesos.html) currently behave very much like the a 
+YARN session: The TaskManager and JobManager processes are started with the Flink framework classes in classpath, job
+classes are loaded dynamically when the jobs are submitted.
+
+
+## Avoiding Dynamic Classloading
+
+All components (JobManger, TaskManager, Client, ApplicationMaster, ...) log their classpath setting on startup.
+They can be found as part of the environment information at the beginnign of the log.
+
+When running a setup where the Flink JobManager and TaskManagers are exclusive to one particular job, one can put JAR files
+directly into the `/lib` folder to make sure they are part of the classpath and not loaded dynamically. 
+
+It usually works to put the job's JAR file into the `/lib` directory. The JAR will be part of both the classpath
+(the *AppClassLoader*) and the dynamic class loader (*FlinkUserCodeClassLoader*).
+Because the AppClassLoader is the parent of the FlinkUserCodeClassLoader (and Java loads parent-first), this should
+result in classes being loaded only once.
+
+For setups where the job's JAR file cannot be put to the `/lib` folder (for example because the setup is a session that is
+used by multiple jobs), it may still be posible to put common libraries to the `/lib` folder, and avoid dynamic class loading
+for those.
+
+
+## Manual Classloading in the Job
+
+In some cases, a transformation function, source, or sink needs to manually load classes (dynamically via reflection).
+To do that, it needs the classloader that has access to the job's classes.
+
+In that case, the functions (or sources or sinks) can be made a `RichFunction` (for example `RichMapFunction` or `RichWindowFunction`)
+and access the user code class loader via `getRuntimeContext().getUserCodeClassLoader()`.
+
+
+## X cannot be cast to X exceptions
+
+When you see an exception in the style `com.foo.X cannot be cast to com.foo.X`, it means that multiple versions of the class
+`com.foo.X` have been loaded by different class loaders, and types of that class are attempted to be assigned to each other.
+
+The reason is in most cases that an object of the `com.foo.X` class loaded from a previous execution attempt is still cached somewhere,
+and picked up by a restarted task/operator that reloaded the code. Note that this is again only possible in deployments that use
+dynamic class loading.
+
+Common causes of cached object instances:
+
+  - When using *Apache Avro*: The *SpecificDatumReader* caches instances of records. Avoid using `SpecificData.INSTANCE`. See also
+    [this discussion](http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/How-to-get-help-on-ClassCastException-when-re-submitting-a-job-tp10972p11133.html)
+
+  - Using certain serialization frameworks for cloning objects (such as *Apache Avro*)
+
+  - Interning objects (for example via Guava's Interners)
 
-  - Finding classpaths in logs
-  - Moving libraries and/or user code to the Application Classpath 
 
 ## Unloading of Dynamically Loaded Classes
 
-  - Checkpoint statistics overview
-  - Interpret time until checkpoints
-  - Synchronous vs. asynchronous checkpoint time
+All scenarios that involve dynamic class loading (i.e., standalone, sessions, mesos, ...) rely on classes being *unloaded* again.
+Class unloading means that the Garbage Collector finds that no objects from a class exist and more, and thus removes the class
+(the code, static variable, metadata, etc).
+
+Whenever a TaskManager starts (or restarts) a task, it will load that specific task's code. Unless classes can be unloaded, this will
+become a memory leak, as new versions of classes are loaded and the total number of loaded classes accumulates over time. This
+typically manifests itself though a **OutOfMemoryError: PermGen**.
+
+Common causes for class leaks and suggested fixes:
+
+  - *Lingering Threads*: Make sure the application functions/sources/sinks shuts down all threads. Lingering threads cost resources themselves and
+    additionally typically hold references to (user code) objects, preventing garbage collection and unloading of the classes.
+
+  - *Interners*: Avoid caching objects in special structures that live beyond the lifetime of the functions/sources/sinks. Examples are Guava's
+    interners, or Avro's class/object caches in the serializers.