Skip to content

Extract content files from a Google Takeout archive, rename files with a consistent naming scheme and add missing metadata. Runs on Windows and MacOS.

License

Notifications You must be signed in to change notification settings

andyjohnson0/TakeoutExtractor

Repository files navigation

Takout Extractor

Extracts the contents of a Google Takeout archive - re-organising it, adding missing metadata, and applying a uniform file naming convention. Runs on Windows and MacOS. Requires .NET 7 or later.

Gui and command-line implementations are provided - see the TakeoutExtractor.Gui and TakeoutExtractor.Cli projects respectively. Releases include the Gui for Windows and MacOS, and the cli for Windows only.

Takeout Extractor GUI screenshot

This software currently extracts only photo and video files. It is planned to add support for other Takeout media types in the future.

  • Photos and Videos Image files in a Google takeout dataset have inconsistent naming. They also do not contain exif timestamps - although, confusingly, they do contain other metadata such as location information and camera settings. This software builds a uniformaly named copy of the image and video files in a takeout dataset and restores their exif timestamps.

Getting Started

  1. Build the solution in Visual Studio 2022

  2. The command-line extractor, tex.exe, will be found in TakeoutExtractorCli\bin\Release\net6.0\. Run tex /h for help.

Built With

  • Visual Studio 2022. Maui-based gui currently requires VS2022 7.3.0 preview 2.0 or later.
  • .net 7.0, with nullable reference type checking enabled
  • A fork of ExifLibNet v2.1.4 with additional fixes, available at https://github.com/andyjohnson0/exiflibrary. A pre-built dll is included in the ThirdParty directory.

Author

Andrew Johnson | github.com/andyjohnson0 | https://andyjohnson.uk

Licence

Except for third-party elements that are licened separately, this project is licensed under the MIT License - see the LICENSE.md file for details.

The folder picker implementations used in the gui project are based on code from MauiFolderPickerSample and https://blog.verslu.is/maui/folder-picker-with-dotnet-maui/ by Gerald Versluis. Licenced as Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) by the original author.

The shovel icon used by the GUI project is from svgrepo.com and is licenced as CC0 / Public Domain by svgrepo.

Future Enhancements, TODOs, and known bugs

These are logged as issues at https://github.com/andyjohnson0/TakeoutExtractor/issues/

Things to do are flagged with TODO: in the code.

Implementation Notes

Project Structure

  • TakeoutExtractor.Gui GUI front-end using MAUI.

  • TakeoutExtractor.Cli Command-line tex app that drives the extraction process.

  • TakeoutExtractor.Lib Core library for the Takeout Extractor project. Exposes the TakeoutExtractor class which coordinates the extraction and reassembly of files from an unzipped Google Takeout archive.

  • TakeoutExtractor.Cli.Tests Tests for the TakeoutExtractor.Cli project.

  • TakeoutExtractor.Lib.Tests Tests for the TakeoutExtractor.Lib project.

Method

Images and Videos

  1. Iterate over all .json files containing image or video metadata.

  2. Extract the title element. This will be the original image file name, including extension.

  3. Truncate the name part of the title to a maximum of 47 characters. This gives the image file name in the archive.

  4. Search for images with the same name but with a (possibly truncated) "-edited" suffix. If this exists then the image was edited and the image file in the previous step is the original, un-edited, version. If there is no edited file then only the original image exists.

  5. Extract timestamps from the json file and, for images only, update the image's embedded exif metadata. Google seems to preserve all(?) other exif fields that were populated at the time of image capture.

  6. Rename the file or files according to the timestamp and place into appropriate directories.

Some Resources

EXIF tag reference: https://exiftool.org/TagNames/EXIF.html

Some Notes on Takeout's Image File Naming

Google takeout appears to provide access to the original captured form of an image, together with the last edited version of the image, if any. The images are linked together by a photo sidecar file. The easiest way to iterate over the images is to iterate over the metadata files

Image file extentions can be ".jpeg", ".jpg", ".png", ".gif" and ".mp4". Sometimes an image may have a different extension in archives requested at different times. For example, it may be .jpeg in one archive and .jpg in an archive created time time later. I suspect that this may be caused by the introduction of an attempt to normalise file extensions.

The maximum length of file names, including the extension but excluding the dot/period, appears to be 50 characters. So a jpg file will have a name part with a maximum length of 47, and for a json file this will be 48. File names are truncated to fit these limits, preserving the extension.

The /title element in the metadata file gives the full file name of the original image. However, as google truncates the name-part of the image file name, so a title of "a5025662-cb40-45dd-be98-684ee48aa226_IMG_20210818_122959697_HDR.jpg" would refer to an original image file named "a5025662-cb40-45dd-be98-684ee48aa226_IMG_202108.jpg"

If the title contains & or ? characters then these are substituted with _ characters in the file names of the corresponding images.

If an edited version of the image exists then it will have the same name-part as the original, but with a suffix. This suffix (which I suspect is generated by Google Photos, not Takeout) is usually "-edited"". It can be truncated (e.g. to "-edit" or "-edi") if necessary by the 47 character name-part limit.

It is possible to have an "-edited" file with no original file - for example, if the original has been deleted. In this case the name part of json file name will end with the edited suffix. E.g. IMG_20190329_083618347-edited.jpeg.json and IMG_20190329_083618347-edited.jpeg

File Name Uniqueness

To ensure that names are unique, Takeout appends a "uniqueness suffix" in the form of a bracketed integer (e.g. "(1)") to the end the name part of the filename. Commonly this will be present in the name of the orginal file, because there is another original file that would otherwise have the same name. The uniqueness suff will also be present in the json manifest filename, but in a different position. For example:

  • IMG_20180830_123540573.jpg(1).json
  • IMG_20180830_123540573(1).jpg
  • IMG_20180830_123540573-edited(1).jpg Here the manifest filename is IMG_20180830_123540573.jpg(1).json, not IMG_20180830_123540573(1).jpg.json as would be expected.

If the original filename (excluding extension) is 47 characters or more in length then the json manifest will use the first 46 characters in its filename - because the extension is one character longer. If there is an edited file then it will save the same name as the original, but will have a "uniqueness suffix appended to distinguish it from its own original.

EXIF Metadata

EXIF image metadata is often - but not always - present in the images. This includes timestamps, description (if the user has provided one), and geolocation data. The data is included in the json manifest. It appears that the original EXIF metadata is preserved, but if it is edited in the Google Photos website then the edits are only reflected in the json manifest.

About

Extract content files from a Google Takeout archive, rename files with a consistent naming scheme and add missing metadata. Runs on Windows and MacOS.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages