-
Notifications
You must be signed in to change notification settings - Fork 58
/
toi-3.13.patch
25560 lines (25064 loc) · 737 KB
/
toi-3.13.patch
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index b9e9bd8..a88912b 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3343,6 +3343,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
HIGHMEM regardless of setting
of CONFIG_HIGHPTE.
+ uuid_debug= (Boolean) whether to enable debugging of TuxOnIce's
+ uuid support.
+
vdso= [X86,SH]
vdso=2: enable compat VDSO (default with COMPAT_VDSO)
vdso=1: enable VDSO (default)
diff --git a/Documentation/power/tuxonice-internals.txt b/Documentation/power/tuxonice-internals.txt
new file mode 100644
index 0000000..7a96186
--- /dev/null
+++ b/Documentation/power/tuxonice-internals.txt
@@ -0,0 +1,477 @@
+ TuxOnIce 3.0 Internal Documentation.
+ Updated to 26 March 2009
+
+1. Introduction.
+
+ TuxOnIce 3.0 is an addition to the Linux Kernel, designed to
+ allow the user to quickly shutdown and quickly boot a computer, without
+ needing to close documents or programs. It is equivalent to the
+ hibernate facility in some laptops. This implementation, however,
+ requires no special BIOS or hardware support.
+
+ The code in these files is based upon the original implementation
+ prepared by Gabor Kuti and additional work by Pavel Machek and a
+ host of others. This code has been substantially reworked by Nigel
+ Cunningham, again with the help and testing of many others, not the
+ least of whom is Michael Frank. At its heart, however, the operation is
+ essentially the same as Gabor's version.
+
+2. Overview of operation.
+
+ The basic sequence of operations is as follows:
+
+ a. Quiesce all other activity.
+ b. Ensure enough memory and storage space are available, and attempt
+ to free memory/storage if necessary.
+ c. Allocate the required memory and storage space.
+ d. Write the image.
+ e. Power down.
+
+ There are a number of complicating factors which mean that things are
+ not as simple as the above would imply, however...
+
+ o The activity of each process must be stopped at a point where it will
+ not be holding locks necessary for saving the image, or unexpectedly
+ restart operations due to something like a timeout and thereby make
+ our image inconsistent.
+
+ o It is desirous that we sync outstanding I/O to disk before calculating
+ image statistics. This reduces corruption if one should suspend but
+ then not resume, and also makes later parts of the operation safer (see
+ below).
+
+ o We need to get as close as we can to an atomic copy of the data.
+ Inconsistencies in the image will result in inconsistent memory contents at
+ resume time, and thus in instability of the system and/or file system
+ corruption. This would appear to imply a maximum image size of one half of
+ the amount of RAM, but we have a solution... (again, below).
+
+ o In 2.6, we choose to play nicely with the other suspend-to-disk
+ implementations.
+
+3. Detailed description of internals.
+
+ a. Quiescing activity.
+
+ Safely quiescing the system is achieved using three separate but related
+ aspects.
+
+ First, we note that the vast majority of processes don't need to run during
+ suspend. They can be 'frozen'. We therefore implement a refrigerator
+ routine, which processes enter and in which they remain until the cycle is
+ complete. Processes enter the refrigerator via try_to_freeze() invocations
+ at appropriate places. A process cannot be frozen in any old place. It
+ must not be holding locks that will be needed for writing the image or
+ freezing other processes. For this reason, userspace processes generally
+ enter the refrigerator via the signal handling code, and kernel threads at
+ the place in their event loops where they drop locks and yield to other
+ processes or sleep.
+
+ The task of freezing processes is complicated by the fact that there can be
+ interdependencies between processes. Freezing process A before process B may
+ mean that process B cannot be frozen, because it stops at waiting for
+ process A rather than in the refrigerator. This issue is seen where
+ userspace waits on freezeable kernel threads or fuse filesystem threads. To
+ address this issue, we implement the following algorithm for quiescing
+ activity:
+
+ - Freeze filesystems (including fuse - userspace programs starting
+ new requests are immediately frozen; programs already running
+ requests complete their work before being frozen in the next
+ step)
+ - Freeze userspace
+ - Thaw filesystems (this is safe now that userspace is frozen and no
+ fuse requests are outstanding).
+ - Invoke sys_sync (noop on fuse).
+ - Freeze filesystems
+ - Freeze kernel threads
+
+ If we need to free memory, we thaw kernel threads and filesystems, but not
+ userspace. We can then free caches without worrying about deadlocks due to
+ swap files being on frozen filesystems or such like.
+
+ b. Ensure enough memory & storage are available.
+
+ We have a number of constraints to meet in order to be able to successfully
+ suspend and resume.
+
+ First, the image will be written in two parts, described below. One of these
+ parts needs to have an atomic copy made, which of course implies a maximum
+ size of one half of the amount of system memory. The other part ('pageset')
+ is not atomically copied, and can therefore be as large or small as desired.
+
+ Second, we have constraints on the amount of storage available. In these
+ calculations, we may also consider any compression that will be done. The
+ cryptoapi module allows the user to configure an expected compression ratio.
+
+ Third, the user can specify an arbitrary limit on the image size, in
+ megabytes. This limit is treated as a soft limit, so that we don't fail the
+ attempt to suspend if we cannot meet this constraint.
+
+ c. Allocate the required memory and storage space.
+
+ Having done the initial freeze, we determine whether the above constraints
+ are met, and seek to allocate the metadata for the image. If the constraints
+ are not met, or we fail to allocate the required space for the metadata, we
+ seek to free the amount of memory that we calculate is needed and try again.
+ We allow up to four iterations of this loop before aborting the cycle. If we
+ do fail, it should only be because of a bug in TuxOnIce's calculations.
+
+ These steps are merged together in the prepare_image function, found in
+ prepare_image.c. The functions are merged because of the cyclical nature
+ of the problem of calculating how much memory and storage is needed. Since
+ the data structures containing the information about the image must
+ themselves take memory and use storage, the amount of memory and storage
+ required changes as we prepare the image. Since the changes are not large,
+ only one or two iterations will be required to achieve a solution.
+
+ The recursive nature of the algorithm is miminised by keeping user space
+ frozen while preparing the image, and by the fact that our records of which
+ pages are to be saved and which pageset they are saved in use bitmaps (so
+ that changes in number or fragmentation of the pages to be saved don't
+ feedback via changes in the amount of memory needed for metadata). The
+ recursiveness is thus limited to any extra slab pages allocated to store the
+ extents that record storage used, and the effects of seeking to free memory.
+
+ d. Write the image.
+
+ We previously mentioned the need to create an atomic copy of the data, and
+ the half-of-memory limitation that is implied in this. This limitation is
+ circumvented by dividing the memory to be saved into two parts, called
+ pagesets.
+
+ Pageset2 contains most of the page cache - the pages on the active and
+ inactive LRU lists that aren't needed or modified while TuxOnIce is
+ running, so they can be safely written without an atomic copy. They are
+ therefore saved first and reloaded last. While saving these pages,
+ TuxOnIce carefully ensures that the work of writing the pages doesn't make
+ the image inconsistent. With the support for Kernel (Video) Mode Setting
+ going into the kernel at the time of writing, we need to check for pages
+ on the LRU that are used by KMS, and exclude them from pageset2. They are
+ atomically copied as part of pageset 1.
+
+ Once pageset2 has been saved, we prepare to do the atomic copy of remaining
+ memory. As part of the preparation, we power down drivers, thereby providing
+ them with the opportunity to have their state recorded in the image. The
+ amount of memory allocated by drivers for this is usually negligible, but if
+ DRI is in use, video drivers may require significants amounts. Ideally we
+ would be able to query drivers while preparing the image as to the amount of
+ memory they will need. Unfortunately no such mechanism exists at the time of
+ writing. For this reason, TuxOnIce allows the user to set an
+ 'extra_pages_allowance', which is used to seek to ensure sufficient memory
+ is available for drivers at this point. TuxOnIce also lets the user set this
+ value to 0. In this case, a test driver suspend is done while preparing the
+ image, and the difference (plus a margin) used instead. TuxOnIce will also
+ automatically restart the hibernation process (twice at most) if it finds
+ that the extra pages allowance is not sufficient. It will then use what was
+ actually needed (plus a margin, again). Failure to hibernate should thus
+ be an extremely rare occurence.
+
+ Having suspended the drivers, we save the CPU context before making an
+ atomic copy of pageset1, resuming the drivers and saving the atomic copy.
+ After saving the two pagesets, we just need to save our metadata before
+ powering down.
+
+ As we mentioned earlier, the contents of pageset2 pages aren't needed once
+ they've been saved. We therefore use them as the destination of our atomic
+ copy. In the unlikely event that pageset1 is larger, extra pages are
+ allocated while the image is being prepared. This is normally only a real
+ possibility when the system has just been booted and the page cache is
+ small.
+
+ This is where we need to be careful about syncing, however. Pageset2 will
+ probably contain filesystem meta data. If this is overwritten with pageset1
+ and then a sync occurs, the filesystem will be corrupted - at least until
+ resume time and another sync of the restored data. Since there is a
+ possibility that the user might not resume or (may it never be!) that
+ TuxOnIce might oops, we do our utmost to avoid syncing filesystems after
+ copying pageset1.
+
+ e. Power down.
+
+ Powering down uses standard kernel routines. TuxOnIce supports powering down
+ using the ACPI S3, S4 and S5 methods or the kernel's non-ACPI power-off.
+ Supporting suspend to ram (S3) as a power off option might sound strange,
+ but it allows the user to quickly get their system up and running again if
+ the battery doesn't run out (we just need to re-read the overwritten pages)
+ and if the battery does run out (or the user removes power), they can still
+ resume.
+
+4. Data Structures.
+
+ TuxOnIce uses three main structures to store its metadata and configuration
+ information:
+
+ a) Pageflags bitmaps.
+
+ TuxOnIce records which pages will be in pageset1, pageset2, the destination
+ of the atomic copy and the source of the atomically restored image using
+ bitmaps. The code used is that written for swsusp, with small improvements
+ to match TuxOnIce's requirements.
+
+ The pageset1 bitmap is thus easily stored in the image header for use at
+ resume time.
+
+ As mentioned above, using bitmaps also means that the amount of memory and
+ storage required for recording the above information is constant. This
+ greatly simplifies the work of preparing the image. In earlier versions of
+ TuxOnIce, extents were used to record which pages would be stored. In that
+ case, however, eating memory could result in greater fragmentation of the
+ lists of pages, which in turn required more memory to store the extents and
+ more storage in the image header. These could in turn require further
+ freeing of memory, and another iteration. All of this complexity is removed
+ by having bitmaps.
+
+ Bitmaps also make a lot of sense because TuxOnIce only ever iterates
+ through the lists. There is therefore no cost to not being able to find the
+ nth page in order 0 time. We only need to worry about the cost of finding
+ the n+1th page, given the location of the nth page. Bitwise optimisations
+ help here.
+
+ b) Extents for block data.
+
+ TuxOnIce supports writing the image to multiple block devices. In the case
+ of swap, multiple partitions and/or files may be in use, and we happily use
+ them all (with the exception of compcache pages, which we allocate but do
+ not use). This use of multiple block devices is accomplished as follows:
+
+ Whatever the actual source of the allocated storage, the destination of the
+ image can be viewed in terms of one or more block devices, and on each
+ device, a list of sectors. To simplify matters, we only use contiguous,
+ PAGE_SIZE aligned sectors, like the swap code does.
+
+ Since sector numbers on each bdev may well not start at 0, it makes much
+ more sense to use extents here. Contiguous ranges of pages can thus be
+ represented in the extents by contiguous values.
+
+ Variations in block size are taken account of in transforming this data
+ into the parameters for bio submission.
+
+ We can thus implement a layer of abstraction wherein the core of TuxOnIce
+ doesn't have to worry about which device we're currently writing to or
+ where in the device we are. It simply requests that the next page in the
+ pageset or header be written, leaving the details to this lower layer.
+ The lower layer remembers where in the sequence of devices and blocks each
+ pageset starts. The header always starts at the beginning of the allocated
+ storage.
+
+ So extents are:
+
+ struct extent {
+ unsigned long minimum, maximum;
+ struct extent *next;
+ }
+
+ These are combined into chains of extents for a device:
+
+ struct extent_chain {
+ int size; /* size of the extent ie sum (max-min+1) */
+ int allocs, frees;
+ char *name;
+ struct extent *first, *last_touched;
+ };
+
+ For each bdev, we need to store a little more info:
+
+ struct suspend_bdev_info {
+ struct block_device *bdev;
+ dev_t dev_t;
+ int bmap_shift;
+ int blocks_per_page;
+ };
+
+ The dev_t is used to identify the device in the stored image. As a result,
+ we expect devices at resume time to have the same major and minor numbers
+ as they had while suspending. This is primarily a concern where the user
+ utilises LVM for storage, as they will need to dmsetup their partitions in
+ such a way as to maintain this consistency at resume time.
+
+ bmap_shift and blocks_per_page apply the effects of variations in blocks
+ per page settings for the filesystem and underlying bdev. For most
+ filesystems, these are the same, but for xfs, they can have independant
+ values.
+
+ Combining these two structures together, we have everything we need to
+ record what devices and what blocks on each device are being used to
+ store the image, and to submit i/o using bio_submit.
+
+ The last elements in the picture are a means of recording how the storage
+ is being used.
+
+ We do this first and foremost by implementing a layer of abstraction on
+ top of the devices and extent chains which allows us to view however many
+ devices there might be as one long storage tape, with a single 'head' that
+ tracks a 'current position' on the tape:
+
+ struct extent_iterate_state {
+ struct extent_chain *chains;
+ int num_chains;
+ int current_chain;
+ struct extent *current_extent;
+ unsigned long current_offset;
+ };
+
+ That is, *chains points to an array of size num_chains of extent chains.
+ For the filewriter, this is always a single chain. For the swapwriter, the
+ array is of size MAX_SWAPFILES.
+
+ current_chain, current_extent and current_offset thus point to the current
+ index in the chains array (and into a matching array of struct
+ suspend_bdev_info), the current extent in that chain (to optimise access),
+ and the current value in the offset.
+
+ The image is divided into three parts:
+ - The header
+ - Pageset 1
+ - Pageset 2
+
+ The header always starts at the first device and first block. We know its
+ size before we begin to save the image because we carefully account for
+ everything that will be stored in it.
+
+ The second pageset (LRU) is stored first. It begins on the next page after
+ the end of the header.
+
+ The first pageset is stored second. It's start location is only known once
+ pageset2 has been saved, since pageset2 may be compressed as it is written.
+ This location is thus recorded at the end of saving pageset2. It is page
+ aligned also.
+
+ Since this information is needed at resume time, and the location of extents
+ in memory will differ at resume time, this needs to be stored in a portable
+ way:
+
+ struct extent_iterate_saved_state {
+ int chain_num;
+ int extent_num;
+ unsigned long offset;
+ };
+
+ We can thus implement a layer of abstraction wherein the core of TuxOnIce
+ doesn't have to worry about which device we're currently writing to or
+ where in the device we are. It simply requests that the next page in the
+ pageset or header be written, leaving the details to this layer, and
+ invokes the routines to remember and restore the position, without having
+ to worry about the details of how the data is arranged on disk or such like.
+
+ c) Modules
+
+ One aim in designing TuxOnIce was to make it flexible. We wanted to allow
+ for the implementation of different methods of transforming a page to be
+ written to disk and different methods of getting the pages stored.
+
+ In early versions (the betas and perhaps Suspend1), compression support was
+ inlined in the image writing code, and the data structures and code for
+ managing swap were intertwined with the rest of the code. A number of people
+ had expressed interest in implementing image encryption, and alternative
+ methods of storing the image.
+
+ In order to achieve this, TuxOnIce was given a modular design.
+
+ A module is a single file which encapsulates the functionality needed
+ to transform a pageset of data (encryption or compression, for example),
+ or to write the pageset to a device. The former type of module is called
+ a 'page-transformer', the later a 'writer'.
+
+ Modules are linked together in pipeline fashion. There may be zero or more
+ page transformers in a pipeline, and there is always exactly one writer.
+ The pipeline follows this pattern:
+
+ ---------------------------------
+ | TuxOnIce Core |
+ ---------------------------------
+ |
+ |
+ ---------------------------------
+ | Page transformer 1 |
+ ---------------------------------
+ |
+ |
+ ---------------------------------
+ | Page transformer 2 |
+ ---------------------------------
+ |
+ |
+ ---------------------------------
+ | Writer |
+ ---------------------------------
+
+ During the writing of an image, the core code feeds pages one at a time
+ to the first module. This module performs whatever transformations it
+ implements on the incoming data, completely consuming the incoming data and
+ feeding output in a similar manner to the next module.
+
+ All routines are SMP safe, and the final result of the transformations is
+ written with an index (provided by the core) and size of the output by the
+ writer. As a result, we can have multithreaded I/O without needing to
+ worry about the sequence in which pages are written (or read).
+
+ During reading, the pipeline works in the reverse direction. The core code
+ calls the first module with the address of a buffer which should be filled.
+ (Note that the buffer size is always PAGE_SIZE at this time). This module
+ will in turn request data from the next module and so on down until the
+ writer is made to read from the stored image.
+
+ Part of definition of the structure of a module thus looks like this:
+
+ int (*rw_init) (int rw, int stream_number);
+ int (*rw_cleanup) (int rw);
+ int (*write_chunk) (struct page *buffer_page);
+ int (*read_chunk) (struct page *buffer_page, int sync);
+
+ It should be noted that the _cleanup routine may be called before the
+ full stream of data has been read or written. While writing the image,
+ the user may (depending upon settings) choose to abort suspending, and
+ if we are in the midst of writing the last portion of the image, a portion
+ of the second pageset may be reread. This may also happen if an error
+ occurs and we seek to abort the process of writing the image.
+
+ The modular design is also useful in a number of other ways. It provides
+ a means where by we can add support for:
+
+ - providing overall initialisation and cleanup routines;
+ - serialising configuration information in the image header;
+ - providing debugging information to the user;
+ - determining memory and image storage requirements;
+ - dis/enabling components at run-time;
+ - configuring the module (see below);
+
+ ...and routines for writers specific to their work:
+ - Parsing a resume= location;
+ - Determining whether an image exists;
+ - Marking a resume as having been attempted;
+ - Invalidating an image;
+
+ Since some parts of the core - the user interface and storage manager
+ support - have use for some of these functions, they are registered as
+ 'miscellaneous' modules as well.
+
+ d) Sysfs data structures.
+
+ This brings us naturally to support for configuring TuxOnIce. We desired to
+ provide a way to make TuxOnIce as flexible and configurable as possible.
+ The user shouldn't have to reboot just because they want to now hibernate to
+ a file instead of a partition, for example.
+
+ To accomplish this, TuxOnIce implements a very generic means whereby the
+ core and modules can register new sysfs entries. All TuxOnIce entries use
+ a single _store and _show routine, both of which are found in
+ tuxonice_sysfs.c in the kernel/power directory. These routines handle the
+ most common operations - getting and setting the values of bits, integers,
+ longs, unsigned longs and strings in one place, and allow overrides for
+ customised get and set options as well as side-effect routines for all
+ reads and writes.
+
+ When combined with some simple macros, a new sysfs entry can then be defined
+ in just a couple of lines:
+
+ SYSFS_INT("progress_granularity", SYSFS_RW, &progress_granularity, 1,
+ 2048, 0, NULL),
+
+ This defines a sysfs entry named "progress_granularity" which is rw and
+ allows the user to access an integer stored at &progress_granularity, giving
+ it a value between 1 and 2048 inclusive.
+
+ Sysfs entries are registered under /sys/power/tuxonice, and entries for
+ modules are located in a subdirectory named after the module.
+
diff --git a/Documentation/power/tuxonice.txt b/Documentation/power/tuxonice.txt
new file mode 100644
index 0000000..3bf0575
--- /dev/null
+++ b/Documentation/power/tuxonice.txt
@@ -0,0 +1,948 @@
+ --- TuxOnIce, version 3.0 ---
+
+1. What is it?
+2. Why would you want it?
+3. What do you need to use it?
+4. Why not just use the version already in the kernel?
+5. How do you use it?
+6. What do all those entries in /sys/power/tuxonice do?
+7. How do you get support?
+8. I think I've found a bug. What should I do?
+9. When will XXX be supported?
+10 How does it work?
+11. Who wrote TuxOnIce?
+
+1. What is it?
+
+ Imagine you're sitting at your computer, working away. For some reason, you
+ need to turn off your computer for a while - perhaps it's time to go home
+ for the day. When you come back to your computer next, you're going to want
+ to carry on where you left off. Now imagine that you could push a button and
+ have your computer store the contents of its memory to disk and power down.
+ Then, when you next start up your computer, it loads that image back into
+ memory and you can carry on from where you were, just as if you'd never
+ turned the computer off. You have far less time to start up, no reopening of
+ applications or finding what directory you put that file in yesterday.
+ That's what TuxOnIce does.
+
+ TuxOnIce has a long heritage. It began life as work by Gabor Kuti, who,
+ with some help from Pavel Machek, got an early version going in 1999. The
+ project was then taken over by Florent Chabaud while still in alpha version
+ numbers. Nigel Cunningham came on the scene when Florent was unable to
+ continue, moving the project into betas, then 1.0, 2.0 and so on up to
+ the present series. During the 2.0 series, the name was contracted to
+ Suspend2 and the website suspend2.net created. Beginning around July 2007,
+ a transition to calling the software TuxOnIce was made, to seek to help
+ make it clear that TuxOnIce is more concerned with hibernation than suspend
+ to ram.
+
+ Pavel Machek's swsusp code, which was merged around 2.5.17 retains the
+ original name, and was essentially a fork of the beta code until Rafael
+ Wysocki came on the scene in 2005 and began to improve it further.
+
+2. Why would you want it?
+
+ Why wouldn't you want it?
+
+ Being able to save the state of your system and quickly restore it improves
+ your productivity - you get a useful system in far less time than through
+ the normal boot process. You also get to be completely 'green', using zero
+ power, or as close to that as possible (the computer may still provide
+ minimal power to some devices, so they can initiate a power on, but that
+ will be the same amount of power as would be used if you told the computer
+ to shutdown.
+
+3. What do you need to use it?
+
+ a. Kernel Support.
+
+ i) The TuxOnIce patch.
+
+ TuxOnIce is part of the Linux Kernel. This version is not part of Linus's
+ 2.6 tree at the moment, so you will need to download the kernel source and
+ apply the latest patch. Having done that, enable the appropriate options in
+ make [menu|x]config (under Power Management Options - look for "Enhanced
+ Hibernation"), compile and install your kernel. TuxOnIce works with SMP,
+ Highmem, preemption, fuse filesystems, x86-32, PPC and x86_64.
+
+ TuxOnIce patches are available from http://tuxonice.net.
+
+ ii) Compression support.
+
+ Compression support is implemented via the cryptoapi. You will therefore want
+ to select any Cryptoapi transforms that you want to use on your image from
+ the Cryptoapi menu while configuring your kernel. We recommend the use of the
+ LZO compression method - it is very fast and still achieves good compression.
+
+ You can also tell TuxOnIce to write its image to an encrypted and/or
+ compressed filesystem/swap partition. In that case, you don't need to do
+ anything special for TuxOnIce when it comes to kernel configuration.
+
+ iii) Configuring other options.
+
+ While you're configuring your kernel, try to configure as much as possible
+ to build as modules. We recommend this because there are a number of drivers
+ that are still in the process of implementing proper power management
+ support. In those cases, the best way to work around their current lack is
+ to build them as modules and remove the modules while hibernating. You might
+ also bug the driver authors to get their support up to speed, or even help!
+
+ b. Storage.
+
+ i) Swap.
+
+ TuxOnIce can store the hibernation image in your swap partition, a swap file or
+ a combination thereof. Whichever combination you choose, you will probably
+ want to create enough swap space to store the largest image you could have,
+ plus the space you'd normally use for swap. A good rule of thumb would be
+ to calculate the amount of swap you'd want without using TuxOnIce, and then
+ add the amount of memory you have. This swapspace can be arranged in any way
+ you'd like. It can be in one partition or file, or spread over a number. The
+ only requirement is that they be active when you start a hibernation cycle.
+
+ There is one exception to this requirement. TuxOnIce has the ability to turn
+ on one swap file or partition at the start of hibernating and turn it back off
+ at the end. If you want to ensure you have enough memory to store a image
+ when your memory is fully used, you might want to make one swap partition or
+ file for 'normal' use, and another for TuxOnIce to activate & deactivate
+ automatically. (Further details below).
+
+ ii) Normal files.
+
+ TuxOnIce includes a 'file allocator'. The file allocator can store your
+ image in a simple file. Since Linux has the concept of everything being a
+ file, this is more powerful than it initially sounds. If, for example, you
+ were to set up a network block device file, you could hibernate to a network
+ server. This has been tested and works to a point, but nbd itself isn't
+ stateless enough for our purposes.
+
+ Take extra care when setting up the file allocator. If you just type
+ commands without thinking and then try to hibernate, you could cause
+ irreversible corruption on your filesystems! Make sure you have backups.
+
+ Most people will only want to hibernate to a local file. To achieve that, do
+ something along the lines of:
+
+ echo "TuxOnIce" > /hibernation-file
+ dd if=/dev/zero bs=1M count=512 >> /hibernation-file
+
+ This will create a 512MB file called /hibernation-file. To get TuxOnIce to use
+ it:
+
+ echo /hibernation-file > /sys/power/tuxonice/file/target
+
+ Then
+
+ cat /sys/power/tuxonice/resume
+
+ Put the results of this into your bootloader's configuration (see also step
+ C, below):
+
+ ---EXAMPLE-ONLY-DON'T-COPY-AND-PASTE---
+ # cat /sys/power/tuxonice/resume
+ file:/dev/hda2:0x1e001
+
+ In this example, we would edit the append= line of our lilo.conf|menu.lst
+ so that it included:
+
+ resume=file:/dev/hda2:0x1e001
+ ---EXAMPLE-ONLY-DON'T-COPY-AND-PASTE---
+
+ For those who are thinking 'Could I make the file sparse?', the answer is
+ 'No!'. At the moment, there is no way for TuxOnIce to fill in the holes in
+ a sparse file while hibernating. In the longer term (post merge!), I'd like
+ to change things so that the file could be dynamically resized and have
+ holes filled as needed. Right now, however, that's not possible and not a
+ priority.
+
+ c. Bootloader configuration.
+
+ Using TuxOnIce also requires that you add an extra parameter to
+ your lilo.conf or equivalent. Here's an example for a swap partition:
+
+ append="resume=swap:/dev/hda1"
+
+ This would tell TuxOnIce that /dev/hda1 is a swap partition you
+ have. TuxOnIce will use the swap signature of this partition as a
+ pointer to your data when you hibernate. This means that (in this example)
+ /dev/hda1 doesn't need to be _the_ swap partition where all of your data
+ is actually stored. It just needs to be a swap partition that has a
+ valid signature.
+
+ You don't need to have a swap partition for this purpose. TuxOnIce
+ can also use a swap file, but usage is a little more complex. Having made
+ your swap file, turn it on and do
+
+ cat /sys/power/tuxonice/swap/headerlocations
+
+ (this assumes you've already compiled your kernel with TuxOnIce
+ support and booted it). The results of the cat command will tell you
+ what you need to put in lilo.conf:
+
+ For swap partitions like /dev/hda1, simply use resume=/dev/hda1.
+ For swapfile `swapfile`, use resume=swap:/dev/hda2:0x242d.
+
+ If the swapfile changes for any reason (it is moved to a different
+ location, it is deleted and recreated, or the filesystem is
+ defragmented) then you will have to check
+ /sys/power/tuxonice/swap/headerlocations for a new resume_block value.
+
+ Once you've compiled and installed the kernel and adjusted your bootloader
+ configuration, you should only need to reboot for the most basic part
+ of TuxOnIce to be ready.
+
+ If you only compile in the swap allocator, or only compile in the file
+ allocator, you don't need to add the "swap:" part of the resume=
+ parameters above. resume=/dev/hda2:0x242d will work just as well. If you
+ have compiled both and your storage is on swap, you can also use this
+ format (the swap allocator is the default allocator).
+
+ When compiling your kernel, one of the options in the 'Power Management
+ Support' menu, just above the 'Enhanced Hibernation (TuxOnIce)' entry is
+ called 'Default resume partition'. This can be used to set a default value
+ for the resume= parameter.
+
+ d. The hibernate script.
+
+ Since the driver model in 2.6 kernels is still being developed, you may need
+ to do more than just configure TuxOnIce. Users of TuxOnIce usually start the
+ process via a script which prepares for the hibernation cycle, tells the
+ kernel to do its stuff and then restore things afterwards. This script might
+ involve:
+
+ - Switching to a text console and back if X doesn't like the video card
+ status on resume.
+ - Un/reloading drivers that don't play well with hibernation.
+
+ Note that you might not be able to unload some drivers if there are
+ processes using them. You might have to kill off processes that hold
+ devices open. Hint: if your X server accesses an USB mouse, doing a
+ 'chvt' to a text console releases the device and you can unload the
+ module.
+
+ Check out the latest script (available on tuxonice.net).
+
+ e. The userspace user interface.
+
+ TuxOnIce has very limited support for displaying status if you only apply
+ the kernel patch - it can printk messages, but that is all. In addition,
+ some of the functions mentioned in this document (such as cancelling a cycle
+ or performing interactive debugging) are unavailable. To utilise these
+ functions, or simply get a nice display, you need the 'userui' component.
+ Userui comes in three flavours, usplash, fbsplash and text. Text should
+ work on any console. Usplash and fbsplash require the appropriate
+ (distro specific?) support.
+
+ To utilise a userui, TuxOnIce just needs to be told where to find the
+ userspace binary:
+
+ echo "/usr/local/sbin/tuxoniceui_fbsplash" > /sys/power/tuxonice/user_interface/program
+
+ The hibernate script can do this for you, and a default value for this
+ setting can be configured when compiling the kernel. This path is also
+ stored in the image header, so if you have an initrd or initramfs, you can
+ use the userui during the first part of resuming (prior to the atomic
+ restore) by putting the binary in the same path in your initrd/ramfs.
+ Alternatively, you can put it in a different location and do an echo
+ similar to the above prior to the echo > do_resume. The value saved in the
+ image header will then be ignored.
+
+4. Why not just use the version already in the kernel?
+
+ The version in the vanilla kernel has a number of drawbacks. The most
+ serious of these are:
+ - it has a maximum image size of 1/2 total memory;
+ - it doesn't allocate storage until after it has snapshotted memory.
+ This means that you can't be sure hibernating will work until you
+ see it start to write the image;
+ - it does not allow you to press escape to cancel a cycle;
+ - it does not allow you to press escape to cancel resuming;
+ - it does not allow you to automatically swapon a file when
+ starting a cycle;
+ - it does not allow you to use multiple swap partitions or files;
+ - it does not allow you to use ordinary files;
+ - it just invalidates an image and continues to boot if you
+ accidentally boot the wrong kernel after hibernating;
+ - it doesn't support any sort of nice display while hibernating;
+ - it is moving toward requiring that you have an initrd/initramfs
+ to ever have a hope of resuming (uswsusp). While uswsusp will
+ address some of the concerns above, it won't address all of them,
+ and will be more complicated to get set up;
+ - it doesn't have support for suspend-to-both (write a hibernation
+ image, then suspend to ram; I think this is known as ReadySafe
+ under M$).
+
+5. How do you use it?
+
+ A hibernation cycle can be started directly by doing:
+
+ echo > /sys/power/tuxonice/do_hibernate
+
+ In practice, though, you'll probably want to use the hibernate script
+ to unload modules, configure the kernel the way you like it and so on.
+ In that case, you'd do (as root):
+
+ hibernate
+
+ See the hibernate script's man page for more details on the options it
+ takes.
+
+ If you're using the text or splash user interface modules, one feature of
+ TuxOnIce that you might find useful is that you can press Escape at any time
+ during hibernating, and the process will be aborted.
+
+ Due to the way hibernation works, this means you'll have your system back and
+ perfectly usable almost instantly. The only exception is when it's at the
+ very end of writing the image. Then it will need to reload a small (usually
+ 4-50MBs, depending upon the image characteristics) portion first.
+
+ Likewise, when resuming, you can press escape and resuming will be aborted.
+ The computer will then powerdown again according to settings at that time for
+ the powerdown method or rebooting.
+
+ You can change the settings for powering down while the image is being
+ written by pressing 'R' to toggle rebooting and 'O' to toggle between
+ suspending to ram and powering down completely).
+
+ If you run into problems with resuming, adding the "noresume" option to
+ the kernel command line will let you skip the resume step and recover your
+ system. This option shouldn't normally be needed, because TuxOnIce modifies
+ the image header prior to the atomic restore, and will thus prompt you
+ if it detects that you've tried to resume an image before (this flag is
+ removed if you press Escape to cancel a resume, so you won't be prompted
+ then).
+
+ Recent kernels (2.6.24 onwards) add support for resuming from a different
+ kernel to the one that was hibernated (thanks to Rafael for his work on
+ this - I've just embraced and enhanced the support for TuxOnIce). This
+ should further reduce the need for you to use the noresume option.
+
+6. What do all those entries in /sys/power/tuxonice do?
+
+ /sys/power/tuxonice is the directory which contains files you can use to
+ tune and configure TuxOnIce to your liking. The exact contents of
+ the directory will depend upon the version of TuxOnIce you're
+ running and the options you selected at compile time. In the following
+ descriptions, names in brackets refer to compile time options.
+ (Note that they're all dependant upon you having selected CONFIG_TUXONICE
+ in the first place!).
+
+ Since the values of these settings can open potential security risks, the
+ writeable ones are accessible only to the root user. You may want to
+ configure sudo to allow you to invoke your hibernate script as an ordinary
+ user.
+
+ - alloc/failure_test
+
+ This debugging option provides a way of testing TuxOnIce's handling of
+ memory allocation failures. Each allocation type that TuxOnIce makes has
+ been given a unique number (see the source code). Echo the appropriate
+ number into this entry, and when TuxOnIce attempts to do that allocation,
+ it will pretend there was a failure and act accordingly.
+
+ - alloc/find_max_mem_allocated
+
+ This debugging option will cause TuxOnIce to find the maximum amount of
+ memory it used during a cycle, and report that information in debugging
+ information at the end of the cycle.
+
+ - alt_resume_param
+
+ Instead of powering down after writing a hibernation image, TuxOnIce
+ supports resuming from a different image. This entry lets you set the
+ location of the signature for that image (the resume= value you'd use
+ for it). Using an alternate image and keep_image mode, you can do things
+ like using an alternate image to power down an uninterruptible power
+ supply.
+
+ - block_io/target_outstanding_io
+
+ This value controls the amount of memory that the block I/O code says it
+ needs when the core code is calculating how much memory is needed for
+ hibernating and for resuming. It doesn't directly control the amount of
+ I/O that is submitted at any one time - that depends on the amount of
+ available memory (we may have more available than we asked for), the
+ throughput that is being achieved and the ability of the CPU to keep up
+ with disk throughput (particularly where we're compressing pages).
+
+ - checksum/enabled
+
+ Use cryptoapi hashing routines to verify that Pageset2 pages don't change
+ while we're saving the first part of the image, and to get any pages that
+ do change resaved in the atomic copy. This should normally not be needed,
+ but if you're seeing issues, please enable this. If your issues stop you
+ being able to resume, enable this option, hibernate and cancel the cycle
+ after the atomic copy is done. If the debugging info shows a non-zero
+ number of pages resaved, please report this to Nigel.
+
+ - compression/algorithm
+
+ Set the cryptoapi algorithm used for compressing the image.
+
+ - compression/expected_compression
+
+ These values allow you to set an expected compression ratio, which TuxOnice
+ will use in calculating whether it meets constraints on the image size. If
+ this expected compression ratio is not attained, the hibernation cycle will
+ abort, so it is wise to allow some spare. You can see what compression
+ ratio is achieved in the logs after hibernating.
+
+ - debug_info:
+
+ This file returns information about your configuration that may be helpful
+ in diagnosing problems with hibernating.
+
+ - did_suspend_to_both:
+
+ This file can be used when you hibernate with powerdown method 3 (ie suspend
+ to ram after writing the image). There can be two outcomes in this case. We
+ can resume from the suspend-to-ram before the battery runs out, or we can run
+ out of juice and and up resuming like normal. This entry lets you find out,
+ post resume, which way we went. If the value is 1, we resumed from suspend
+ to ram. This can be useful when actions need to be run post suspend-to-ram
+ that don't need to be run if we did the normal resume from power off.
+
+ - do_hibernate:
+
+ When anything is written to this file, the kernel side of TuxOnIce will
+ begin to attempt to write an image to disk and power down. You'll normally
+ want to run the hibernate script instead, to get modules unloaded first.
+
+ - do_resume:
+
+ When anything is written to this file TuxOnIce will attempt to read and
+ restore an image. If there is no image, it will return almost immediately.
+ If an image exists, the echo > will never return. Instead, the original
+ kernel context will be restored and the original echo > do_hibernate will
+ return.
+
+ - */enabled
+
+ These option can be used to temporarily disable various parts of TuxOnIce.
+
+ - extra_pages_allowance
+
+ When TuxOnIce does its atomic copy, it calls the driver model suspend
+ and resume methods. If you have DRI enabled with a driver such as fglrx,
+ this can result in the driver allocating a substantial amount of memory
+ for storing its state. Extra_pages_allowance tells TuxOnIce how much
+ extra memory it should ensure is available for those allocations. If
+ your attempts at hibernating end with a message in dmesg indicating that
+ insufficient extra pages were allowed, you need to increase this value.
+
+ - file/target:
+
+ Read this value to get the current setting. Write to it to point TuxOnice
+ at a new storage location for the file allocator. See section 3.b.ii above
+ for details of how to set up the file allocator.
+
+ - freezer_test
+
+ This entry can be used to get TuxOnIce to just test the freezer and prepare
+ an image without actually doing a hibernation cycle. It is useful for
+ diagnosing freezing and image preparation issues.
+
+ - full_pageset2
+
+ TuxOnIce divides the pages that are stored in an image into two sets. The
+ difference between the two sets is that pages in pageset 1 are atomically
+ copied, and pages in pageset 2 are written to disk without being copied
+ first. A page CAN be written to disk without being copied first if and only
+ if its contents will not be modified or used at any time after userspace
+ processes are frozen. A page MUST be in pageset 1 if its contents are
+ modified or used at any time after userspace processes have been frozen.
+
+ Normally (ie if this option is enabled), TuxOnIce will put all pages on the
+ per-zone LRUs in pageset2, then remove those pages used by any userspace
+ user interface helper and TuxOnIce storage manager that are running,
+ together with pages used by the GEM memory manager introduced around 2.6.28
+ kernels.
+
+ If this option is disabled, a much more conservative approach will be taken.
+ The only pages in pageset2 will be those belonging to userspace processes,
+ with the exclusion of those belonging to the TuxOnIce userspace helpers
+ mentioned above. This will result in a much smaller pageset2, and will
+ therefore result in smaller images than are possible with this option
+ enabled.
+
+ - ignore_rootfs
+
+ TuxOnIce records which device is mounted as the root filesystem when
+ writing the hibernation image. It will normally check at resume time that
+ this device isn't already mounted - that would be a cause of filesystem
+ corruption. In some particular cases (RAM based root filesystems), you
+ might want to disable this check. This option allows you to do that.
+
+ - image_exists:
+
+ Can be used in a script to determine whether a valid image exists at the
+ location currently pointed to by resume=. Returns up to three lines.
+ The first is whether an image exists (-1 for unsure, otherwise 0 or 1).
+ If an image eixsts, additional lines will return the machine and version.
+ Echoing anything to this entry removes any current image.
+
+ - image_size_limit:
+
+ The maximum size of hibernation image written to disk, measured in megabytes
+ (1024*1024).
+
+ - last_result:
+
+ The result of the last hibernation cycle, as defined in
+ include/linux/suspend-debug.h with the values SUSPEND_ABORTED to
+ SUSPEND_KEPT_IMAGE. This is a bitmask.
+
+ - late_cpu_hotplug:
+
+ This sysfs entry controls whether cpu hotplugging is done - as normal - just