<a href="https://colab.research.google.com/github/e3la/i2dc/blob/main/instagram2digitalcommons.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome to instagram2digitalcommons!

This Colab notebook is designed to help you transform an Instagram archive (the `.zip` file you download from Instagram) into structured packages that are easier to ingest into a Digital Commons (or similar) institutional repository.

This tool was created with vibecoding. The workflow was varied but the most successful version was downloading the ipynb file from colab uploading it to aistudio gemini and asking to add the next cell, and playing with it until it worked, and repeating it for each piece of the code.

**What you need to do:**

1.  **Have a google account:**
    *   To run this you'll need to log into a google account, or know enough magic to run this pile of python somewhere else.
2.  **Provide your Instagram Archive:**
    *   The first code cell will ask you how you want to provide your Instagram `.zip` file. You can either:
        *   Upload it directly to this Colab session (suitable for smaller archives).
        *   Place it in a specific folder (`MyDrive/i2dc/`) on your Google Drive and let the script find it (recommended for larger archives or repeated use).
    *   Make sure you have your Instagram archive `.zip` file ready. You can request it from Instagram by going to your Profile -> Your Activity -> Download Your Information. **Crucially, request the data in JSON format and select High media quality.**

3.  **Run the Cells Sequentially:**
    *   Execute each code cell in this notebook from top to bottom by clicking the "play" button next to each cell or by using "Runtime" > "Run all".
    *   The notebook will:
        *   Extract the contents of your `.zip` file.
        *   Scan the archive and provide a summary of its contents.
        *   Attempt to fix common text encoding issues (mojibake) in captions and titles using `ftfy`.
        *   Process **Reels**, **Stories**, and **Posts** separately.

**What the notebook will produce:**

For each content type (Reels, Stories, Posts), the notebook will:

1.  **Create a dedicated export folder** within the Colab environment (e.g., `/content/extracted_data/reels_export_dc_format/`).
2.  **Copy and rename relevant media files** (videos, images) into this folder. For Reels, it will also include `.srt` subtitle files if they exist in your archive. Filenames are structured for clarity (e.g., `instagram_yourhandle_reel_YYYY-MM-DD_item-number_original-name.mp4`).
3.  **Generate an Excel metadata file** (e.g., `reels_metadata_dc_format.xlsx`). This file is formatted with columns typically used for Digital Commons batch uploads (`title`, `fulltext_url`, `keywords`, `abstract`, etc.).
    *   For **Posts** containing multiple media items (carousels), each media item will get its own row in the Excel sheet, sharing common post metadata but having a unique `fulltext_url`.
    *   For **Stories** posted at the same time (grouped by timestamp), they will share common metadata like `title` and `abstract`, but each media item will have its own row and unique `fulltext_url`.
4.  **Create a `README.txt` file** in the export folder, explaining the contents of that specific package, how metadata was generated, and a summary of processed/skipped items.
5.  **Package these items into a single `.zip` file** (e.g., `reels_package_dc_format.zip`) within its export folder.
6.  **Copy this package to `/content/batchup/`** and rename it (e.g., `reels.zip`, `stories.zip`, `posts.zip`).
7.  **Offer this final `.zip` file for download** directly to your computer after each respective section (Reels, Stories, Posts) is processed.

You will end up with three main downloadable `.zip` files: `reels.zip`, `stories.zip`, and `posts.zip`, each ready for further review and potential batch upload to Digital Commons.

**Important Notes:**
*   This notebook processes media files that are *locally available* within your downloaded Instagram archive. It does **not** download content from web links (e.g., some older story links might point to Instagram's servers). The script attempts to skip these web links.
*   The quality of the output depends on the completeness and structure of your Instagram archive.
*   Always review the generated metadata and files before uploading to your repository.

Let's get started! Run the first code cell below (the one that asks how you want to provide the ZIP file).

In [None]:
import os
import shutil # For file operations like deleting directories
from google.colab import files, drive
import zipfile

zip_filepath = None # This will store the path to the zip file
uploaded_filename_original = None # To store the original name from upload

# --- Helper function to clean up previous uploads from /content/ if method 1 is chosen ---
def cleanup_content_directory(filename_to_keep=None):
    """Removes all files and folders from /content/ except specified ones."""
    print("Cleaning up /content/ directory...")
    items_to_preserve = ["drive", "sample_data"] # Default Colab folders
    if filename_to_keep:
        items_to_preserve.append(os.path.basename(filename_to_keep))

    for item in os.listdir("/content/"):
        if item in items_to_preserve:
            continue
        item_path = os.path.join("/content/", item)
        try:
            if os.path.isfile(item_path) or os.path.islink(item_path):
                os.unlink(item_path)
                # print(f"Removed file: {item_path}")
            elif os.path.isdir(item_path):
                shutil.rmtree(item_path)
                # print(f"Removed directory: {item_path}")
        except Exception as e:
            print(f"Failed to delete {item_path}. Reason: {e}")
    print("Cleanup of /content/ complete.")

# --- Ask the user how they want to provide the file ---
while True:
    print("-" * 50)
    method = input(
        "How do you want to provide the ZIP file?\n"
        "1. Upload directly to Colab (for smaller files, keeps original name).\n"
        "2. Use the ZIP file from Google Drive (searches 'MyDrive/i2dc/' for a unique .zip file).\n"
        "Enter choice (1 or 2): "
    ).strip()
    if method in ['1', '2']:
        break
    else:
        print("Invalid choice. Please enter 1 or 2.")
print("-" * 50)

# --- Option 1: Direct Upload ---
if method == '1':
    print("Selected: Upload directly to Colab.")
    print("Please wait for the upload dialog and select your ZIP file...")

    # Clean up /content/ before new upload
    cleanup_content_directory()

    try:
        uploaded = files.upload()
        if not uploaded:
            print("No file was uploaded. Exiting.")
        else:
            # Get the uploaded file name (key in the 'uploaded' dict)
            uploaded_filename_original = list(uploaded.keys())[0]

            if not uploaded_filename_original.lower().endswith('.zip'):
                print(f"Error: The uploaded file '{uploaded_filename_original}' is not a ZIP file.")
                # Clean up the wrongly uploaded file
                wrong_file_path = os.path.join("/content/", uploaded_filename_original)
                if os.path.exists(wrong_file_path):
                    os.remove(wrong_file_path)
                uploaded_filename_original = None # Reset as it's not a valid zip
            else:
                # The file is uploaded directly to /content/
                zip_filepath = os.path.join("/content/", uploaded_filename_original)
                print(f"Successfully uploaded: '{uploaded_filename_original}'")
                print(f"File path in Colab: {zip_filepath}")

    except Exception as e:
        print(f"An error occurred during upload: {e}")
        print("If the file is too large, please try the Google Drive option next time.")

# --- Option 2: Google Drive ---
elif method == '2':
    print("Selected: Use ZIP file from Google Drive.")
    print("Attempting to mount Google Drive...")
    try:
        drive.mount('/content/drive', force_remount=True)
        print("Google Drive mounted successfully at /content/drive")

        gdrive_my_drive_path = "/content/drive/MyDrive/"
        target_folder_name = "i2dc"
        target_folder_path_in_drive = os.path.join(gdrive_my_drive_path, target_folder_name)

        print(f"Searching for a unique .zip file in: '{target_folder_path_in_drive}'")

        if not os.path.isdir(target_folder_path_in_drive):
            print(f"Error: The folder '{target_folder_path_in_drive}' ('{target_folder_name}' in your MyDrive) does not exist.")
            print(f"Please ensure you have a folder named '{target_folder_name}' directly under 'My Drive' containing your .zip file.")
        else:
            zip_files_found = []
            for item_name in os.listdir(target_folder_path_in_drive):
                item_full_path = os.path.join(target_folder_path_in_drive, item_name)
                if os.path.isfile(item_full_path) and item_name.lower().endswith('.zip'):
                    zip_files_found.append(item_full_path)

            if len(zip_files_found) == 0:
                print(f"No .zip files found in '{target_folder_path_in_drive}'.")
            elif len(zip_files_found) == 1:
                zip_filepath = zip_files_found[0]
                uploaded_filename_original = os.path.basename(zip_filepath) # Get original name from path
                print(f"Found ZIP file: '{uploaded_filename_original}'")
                print(f"File path in Colab (via Drive): {zip_filepath}")
            else:
                print(f"Error: Multiple .zip files found in '{target_folder_path_in_drive}':")
                for f_path in zip_files_found:
                    print(f" - {os.path.basename(f_path)}")
                print("Please ensure there is only one .zip file in that folder for this option to work automatically.")

    except Exception as e:
        print(f"An error occurred while mounting or accessing Google Drive: {e}")

print("-" * 50)

# --- Proceed with the zip file if one was successfully identified ---
if zip_filepath and os.path.exists(zip_filepath):
    print(f"\nProceeding with ZIP file: '{uploaded_filename_original}'")
    print(f"Located at: {zip_filepath}")

    # --- Your next steps using zip_filepath ---
    extract_to_folder = "/content/extracted_data" # Define your extraction path

    # Clean up previous extraction if it exists
    if os.path.exists(extract_to_folder):
        print(f"Cleaning up previous extraction at '{extract_to_folder}'...")
        shutil.rmtree(extract_to_folder)
    os.makedirs(extract_to_folder, exist_ok=True)
    print(f"Extraction target directory: '{extract_to_folder}'")

    try:
        print(f"Attempting to extract '{uploaded_filename_original}'...")
        with zipfile.ZipFile(zip_filepath, 'r') as zip_ref:
            zip_ref.extractall(extract_to_folder)
        print(f"Successfully extracted '{uploaded_filename_original}' to '{extract_to_folder}'")

        print("\nContents of the extracted folder:")
        extracted_items = os.listdir(extract_to_folder)
        if not extracted_items:
            print("(The extracted folder is empty)")
        else:
            for item in extracted_items:
                print(f"- {item}")

    except zipfile.BadZipFile:
        print(f"Error: The file '{uploaded_filename_original}' is not a valid ZIP file or is corrupted.")
    except Exception as e:
        print(f"An error occurred during extraction: {e}")

elif method == '1' and not zip_filepath and uploaded_filename_original is None and 'uploaded' in locals() and not uploaded:
    # This case is already handled by "No file was uploaded."
    pass
elif method == '1' and not zip_filepath and uploaded_filename_original is None:
    # This case is for "uploaded file was not a ZIP"
    print("\nCannot proceed as the uploaded file was not a valid ZIP file.")
elif not zip_filepath:
    print("\nNo ZIP file was successfully specified or found. Cannot proceed with further operations.")
elif not os.path.exists(zip_filepath): # Should be rare if logic above is correct
     print(f"\nError: The determined ZIP file path '{zip_filepath}' does not seem to exist. This is unexpected. Cannot proceed.")

print("-" * 50)
print("Script finished.")

--------------------------------------------------
How do you want to provide the ZIP file?
1. Upload directly to Colab (for smaller files, keeps original name).
2. Use the ZIP file from Google Drive (searches 'MyDrive/i2dc/' for a unique .zip file).
Enter choice (1 or 2): 2
--------------------------------------------------
Selected: Use ZIP file from Google Drive.
Attempting to mount Google Drive...
Mounted at /content/drive
Google Drive mounted successfully at /content/drive
Searching for a unique .zip file in: '/content/drive/MyDrive/i2dc'
Found ZIP file: 'instagram-umsllibraries-2025-03-07-Ardjbhx1.zip'
File path in Colab (via Drive): /content/drive/MyDrive/i2dc/instagram-umsllibraries-2025-03-07-Ardjbhx1.zip
--------------------------------------------------

Proceeding with ZIP file: 'instagram-umsllibraries-2025-03-07-Ardjbhx1.zip'
Located at: /content/drive/MyDrive/i2dc/instagram-umsllibraries-2025-03-07-Ardjbhx1.zip
Cleaning up previous extraction at '/content/extracted_data'

In [None]:
# 🔍 Revised: Inspect contents of the Instagram archive using your file counting logic
import os

MEDIA_DIR = "/content/extracted_data"

print(f"\n🔎 Scanning '{MEDIA_DIR}' for Instagram data...")

file_count = 0
image_count = 0
video_count = 0

for root, dirs, files in os.walk(MEDIA_DIR):
    for file in files:
        file_count += 1
        if file.lower().endswith(('.png', '.jpg', '.jpeg', '.gif', '.bmp', '.webp')):
            image_count += 1
        elif file.lower().endswith('.mp4'):
            video_count += 1

print("\n📊 Archive Summary")
print("-----------------------------------------")
print(f"Total files: {file_count}")
print(f"Image files: {image_count}")
print(f"MP4 video files: {video_count}")
print("You can also browse these files using the 'Files' panel on the left sidebar.")
print("-----------------------------------------")

# 📦 Count Instagram media types from extracted JSON data
import json

# Base directory: previously defined as extract_dir
base_dir = MEDIA_DIR
activity_dir = None

# Base directory: previously defined as extract_dir
base_dir = MEDIA_DIR

# Default activity folder name
activity_dir = 'your_instagram_activity'
print(f"📁 Using default Instagram activity folder: '{activity_dir}'")

media_json_path = os.path.join(base_dir, activity_dir, 'media')
print(f"\n📂 Searching media JSON files in: {media_json_path}")
# Define target files
json_files = {
    "posts": "posts_1.json",
    "reels": "reels.json",
    "stories": "stories.json"
}

counts = {}

for media_type, filename in json_files.items():
    file_path = os.path.join(media_json_path, filename)
    print(f"\n🔍 {media_type.capitalize()}: {filename}")

    try:
        if not os.path.exists(file_path):
            print("   ⚠️ File not found.")
            counts[media_type] = 0
            continue

        with open(file_path, 'r', encoding='utf-8') as f:
            data = json.load(f)

        if isinstance(data, list):  # posts_1.json
            counts[media_type] = len(data)
            print(f"   ✅ Found {counts[media_type]} entries (flat list).")

        elif isinstance(data, dict):
            top_key = {
                "reels": "ig_reels_media",
                "stories": "ig_stories"
            }.get(media_type)

            if top_key not in data:
                print(f"   ⚠️ Key '{top_key}' not found.")
                counts[media_type] = 0
                continue

            flat_list = []
            for group in data[top_key]:
                if isinstance(group, dict) and "media" in group:
                    flat_list.extend(group["media"])
                else:
                    flat_list.append(group)

            counts[media_type] = len(flat_list)
            print(f"   ✅ Found {counts[media_type]} entries (flattened).")

        else:
            print("   ❌ Unrecognized JSON structure.")
            counts[media_type] = 0

    except json.JSONDecodeError:
        print("   ❌ Error decoding JSON.")
        counts[media_type] = 0
    except Exception as e:
        print(f"   ❌ Unexpected error: {e}")
        counts[media_type] = 0

# prompt: this is in the json
# {
#   "profile_user": [
#     {
...
#         "Username": {
#           "href": "",
#           "value": "umsllibraries",
#           "timestamp": 0
# in the file /content/extracted_data/personal_information/personal_information/personal_information.json
# and I want to pull out just the Username value

# Path to the JSON file
json_file_path = "/content/extracted_data/personal_information/personal_information/personal_information.json"

# Load the JSON data
try:
    with open(json_file_path, 'r', encoding='utf-8') as f:
        data = json.load(f)

    # Navigate through the structure to find the Username value
    username_value = None
    if isinstance(data, dict) and "profile_user" in data and isinstance(data["profile_user"], list) and len(data["profile_user"]) > 0:
        # Assuming the relevant data is in the first element of the profile_user list
        profile_info = data["profile_user"][0]
        if isinstance(profile_info, dict) and "string_map_data" in profile_info and isinstance(profile_info["string_map_data"], dict):
            string_data = profile_info["string_map_data"]
            if "Username" in string_data and isinstance(string_data["Username"], dict) and "value" in string_data["Username"]:
                username_value = string_data["Username"]["value"]

    # Print the extracted username
    if username_value is not None:
        print(f"\nExtracted Username: {username_value}")
    else:
        print("Username not found in the expected structure.")

except FileNotFoundError:
    print(f"Error: The file was not found at {json_file_path}")
except json.JSONDecodeError:
    print(f"Error: Could not decode JSON from {json_file_path}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")



🔎 Scanning '/content/extracted_data' for Instagram data...

📊 Archive Summary
-----------------------------------------
Total files: 1445
Image files: 1229
MP4 video files: 103
You can also browse these files using the 'Files' panel on the left sidebar.
-----------------------------------------
📁 Using default Instagram activity folder: 'your_instagram_activity'

📂 Searching media JSON files in: /content/extracted_data/your_instagram_activity/media

🔍 Posts: posts_1.json
   ✅ Found 646 entries (flat list).

🔍 Reels: reels.json
   ✅ Found 25 entries (flattened).

🔍 Stories: stories.json
   ✅ Found 485 entries (flattened).

Extracted Username: umsllibraries


In [None]:
# CELL TO RUN BEFORE CONTENT PROCESSING CELLS
# This cell will iterate through all .json files in the media directory,
# back them up, and attempt to fix mojibake in known text fields using ftfy.

# 1. Install ftfy
!pip install ftfy -q

import os
import json
import shutil # For copying files
import ftfy

print("--- Starting JSON fixing process for all media files ---")

# Define the path to the media directory
MEDIA_DIR_BASE = "/content/extracted_data" # Make sure this matches your setup
TARGET_MEDIA_DIR = os.path.join(MEDIA_DIR_BASE, 'your_instagram_activity', 'media')

if not os.path.isdir(TARGET_MEDIA_DIR):
    print(f"❌ ERROR: Media directory not found at: {TARGET_MEDIA_DIR}")
    print("Please ensure the extraction in the first cell was successful and the path is correct.")
else:
    print(f"✅ Found media directory: {TARGET_MEDIA_DIR}")

    total_files_processed = 0
    total_files_changed_by_ftfy = 0
    total_titles_fixed_overall = 0

    # Helper function to fix text fields in common Instagram JSON structures
    def fix_text_fields_in_data(loaded_data, filename_basename):
        """
        Modifies loaded_data in-place by applying ftfy.fix_text to known text fields.
        Returns: (bool: data_was_changed, int: titles_fixed_count)
        """
        data_was_changed_locally = False
        titles_fixed_count_locally = 0

        def _apply_fix(obj, key):
            nonlocal data_was_changed_locally, titles_fixed_count_locally
            original_text = obj.get(key)
            if isinstance(original_text, str) and original_text: # Only fix non-empty strings
                fixed_text = ftfy.fix_text(original_text)
                if fixed_text != original_text:
                    obj[key] = fixed_text
                    data_was_changed_locally = True
                    titles_fixed_count_locally += 1

        if filename_basename == "reels.json" or filename_basename == "stories.json":
            top_level_key = "ig_reels_media" if filename_basename == "reels.json" else "ig_stories"
            if isinstance(loaded_data, dict) and top_level_key in loaded_data:
                item_list = loaded_data.get(top_level_key, [])
                for item_dict in item_list: # Can be group or direct media item
                    if isinstance(item_dict, dict):
                        if "media" in item_dict and isinstance(item_dict["media"], list): # It's a group
                            for media_sub_item_dict in item_dict["media"]:
                                if isinstance(media_sub_item_dict, dict):
                                    _apply_fix(media_sub_item_dict, "title")
                        else: # It's a direct media item in the list
                            _apply_fix(item_dict, "title")
            else:
                print(f"    ⚠️ Structure of {filename_basename} not as expected (missing '{top_level_key}' or not a dict). Skipping detailed fix.")


        elif filename_basename == "posts_1.json":
            if isinstance(loaded_data, list): # posts_1.json is a list of post objects
                for post_item_dict in loaded_data:
                    if isinstance(post_item_dict, dict):
                        _apply_fix(post_item_dict, "title") # Overall post caption/title

                        media_list_in_post = post_item_dict.get("media", [])
                        if isinstance(media_list_in_post, list):
                            for media_item_dict in media_list_in_post:
                                if isinstance(media_item_dict, dict):
                                    _apply_fix(media_item_dict, "title") # Caption for individual media
            else:
                print(f"    ⚠️ Structure of {filename_basename} not as expected (not a list). Skipping detailed fix.")

        # Add handlers for other specific JSON files here if needed
        # else:
        #     print(f"    ℹ️ No specific ftfy fixing logic defined for {filename_basename}. Text fields might not be fixed.")

        return data_was_changed_locally, titles_fixed_count_locally

    # Iterate through all files in the target directory
    for filename in os.listdir(TARGET_MEDIA_DIR):
        if filename.lower().endswith('.json'):
            total_files_processed += 1
            current_json_path = os.path.join(TARGET_MEDIA_DIR, filename)
            print(f"\n--- Processing: {filename} ---")

            # 2. Backup original file
            backup_json_path = os.path.join(TARGET_MEDIA_DIR, f"{os.path.splitext(filename)[0]}-original.json")
            if not os.path.exists(backup_json_path):
                try:
                    shutil.copy2(current_json_path, backup_json_path)
                    print(f"  💾 Original backed up to: {os.path.basename(backup_json_path)}")
                except Exception as e:
                    print(f"  ⚠️ WARNING: Could not create backup of {filename}: {e}")
            else:
                print(f"  ℹ️ Backup '{os.path.basename(backup_json_path)}' already exists. Not overwriting.")

            # 3. Load, Fix, and Save the current JSON file
            try:
                print(f"  L_oading {filename} for fixing...")
                with open(current_json_path, 'r', encoding='utf-8') as f:
                    json_content = json.load(f)
                print(f"    {filename} loaded successfully.")

                # Apply ftfy fixes based on file type
                data_actually_changed_in_file, titles_fixed_in_file = fix_text_fields_in_data(json_content, filename)

                if titles_fixed_in_file > 0:
                    print(f"    🔧 Titles/text fields fixed by ftfy in {filename}: {titles_fixed_in_file}")
                    total_titles_fixed_overall += titles_fixed_in_file

                if data_actually_changed_in_file:
                    total_files_changed_by_ftfy +=1
                    print(f"  S_aving modified {filename} back to disk...")
                    try:
                        with open(current_json_path, 'w', encoding='utf-8') as f:
                            json.dump(json_content, f, ensure_ascii=False, indent=2)
                        print(f"    ✅ Successfully saved fixed {filename}.")
                    except Exception as e:
                        print(f"    ❌ ERROR saving modified {filename}: {e}")
                elif titles_fixed_in_file > 0 and not data_actually_changed_in_file : # Should not happen if titles_fixed_in_file > 0
                    print(f"    ℹ️ ftfy processed text in {filename} but resulted in no net change. File not rewritten.")
                else: # No titles fixed or no changes made
                    print(f"    ℹ️ No changes made by ftfy to {filename}. File not rewritten.")

            except json.JSONDecodeError:
                print(f"  ❌ ERROR: Could not decode JSON from {filename}. The file might be corrupted. Skipping this file.")
            except FileNotFoundError:
                 print(f"  ❌ ERROR: File {filename} not found during processing loop (should not happen if listed). Skipping.")
            except Exception as e:
                print(f"  ❌ An unexpected error occurred while processing {filename}: {e}")

    print("\n--- Summary of JSON fixing process ---")
    print(f"Total .json files found and processed: {total_files_processed}")
    print(f"Total files modified by ftfy and resaved: {total_files_changed_by_ftfy}")
    print(f"Total text fields fixed across all files: {total_titles_fixed_overall}")
    print("--- Finished JSON fixing process for all media files ---")

--- Starting JSON fixing process for all media files ---
✅ Found media directory: /content/extracted_data/your_instagram_activity/media

--- Processing: posts_1.json ---
  💾 Original backed up to: posts_1-original.json
  L_oading posts_1.json for fixing...
    posts_1.json loaded successfully.
    🔧 Titles/text fields fixed by ftfy in posts_1.json: 225
  S_aving modified posts_1.json back to disk...
    ✅ Successfully saved fixed posts_1.json.

--- Processing: stories.json ---
  💾 Original backed up to: stories-original.json
  L_oading stories.json for fixing...
    stories.json loaded successfully.
    🔧 Titles/text fields fixed by ftfy in stories.json: 23
  S_aving modified stories.json back to disk...
    ✅ Successfully saved fixed stories.json.

--- Processing: reels.json ---
  💾 Original backed up to: reels-original.json
  L_oading reels.json for fixing...
    reels.json loaded successfully.
    🔧 Titles/text fields fixed by ftfy in reels.json: 14
  S_aving modified reels.json bac

In [None]:
import os
import shutil
from google.colab import files, drive
import zipfile
import json
from datetime import datetime
import pandas as pd
from shutil import copy2
import re # For hashtag extraction

# Ensure MEDIA_DIR is defined, as it's used in subsequent cells
MEDIA_DIR = "/content/extracted_data" # This should be consistent with the extraction path

media_type = 'reels' # We are focusing on reels
output_dir = os.path.join(MEDIA_DIR, f'{media_type}_export_dc_format') # New output dir name
os.makedirs(output_dir, exist_ok=True)

# Excel and ZIP file paths
excel_filename = f'{media_type}_metadata_dc_format.xlsx'
excel_path = os.path.join(output_dir, excel_filename)
readme_path = os.path.join(output_dir, 'README.txt')
zip_path = os.path.join(output_dir, f'{media_type}_package_dc_format.zip') # New zip name

# Attempt to get username_value from global scope (set in a previous cell)
instagram_handle = globals().get('username_value', "unknown_user")
print(f"Using Instagram handle: {instagram_handle}")

# --- Helper function to extract hashtags ---
def extract_hashtags(text):
    if not isinstance(text, str):
        return ""
    hashtags = re.findall(r"#(\w+)", text)
    return ", ".join(hashtags)

# --- Load Reels JSON Data ---
instagram_media_data = {}
try:
    reels_json_path = os.path.join(MEDIA_DIR, 'your_instagram_activity', 'media', 'reels.json')
    if os.path.exists(reels_json_path):
         with open(reels_json_path, 'r', encoding='utf-8') as f:
            reels_full_data = json.load(f)
            if isinstance(reels_full_data, dict) and "ig_reels_media" in reels_full_data:
                 flat_list = []
                 for group in reels_full_data["ig_reels_media"]:
                     if isinstance(group, dict) and "media" in group:
                         flat_list.extend(group["media"])
                     else:
                         flat_list.append(group)
                 instagram_media_data['reels'] = flat_list # Assign to 'reels' key
            else:
                print("Warning: Reels JSON structure unexpected or 'ig_reels_media' key missing.")
                instagram_media_data['reels'] = []
    else:
        print(f"Reels JSON file not found at: {reels_json_path}")
        instagram_media_data['reels'] = []
except Exception as e:
    print(f"Error loading Reels JSON for export: {e}")
    instagram_media_data['reels'] = []

# --- Process Reels Data ---
reels_data_to_process = instagram_media_data.get('reels') # Get data from 'reels' key

if not reels_data_to_process:
    print("No Reels data loaded or found for export.")
else:
    print(f"Preparing to export metadata for {len(reels_data_to_process)} Reels in Digital Commons format...")

    excel_data_rows = []

    # Define the new column order for the Excel sheet
    column_names = [
        'title', 'fulltext_url', 'additional_files', 'keywords', 'abstract',
        'author1_fname', 'author1_mname', 'author1_lname', 'author1_suffix',
        'author1_email', 'author1_institution', 'author1_is_corporate',
        'author2_fname', 'author2_mname', 'author2_lname', 'author2_suffix',
        'author2_email', 'author2_institution', 'author2_is_corporate',
        'author3_fname', 'author3_mname', 'author3_lname', 'author3_suffix',
        'author3_email', 'author3_institution', 'author3_is_corporate',
        'author4_fname', 'author4_mname', 'author4_lname', 'author4_suffix',
        'author4_email', 'author4_institution', 'author4_is_corporate',
        'disciplines', 'instagram_username', 'document_type'
    ]

    copied_files_for_zip = []

    for i, item in enumerate(reels_data_to_process):
        original_reel_uri = item.get('uri')
        if not original_reel_uri:
            print(f"Warning: Reel item {i} has no URI, skipping.")
            continue

        media_path = os.path.join(MEDIA_DIR, original_reel_uri)
        if not os.path.exists(media_path):
            print(f"Warning: Media file not found - {media_path}, skipping reel item.")
            continue

        # --- Date and Filename (Reel Video) ---
        timestamp = item.get('creation_timestamp')
        date_obj = datetime.fromtimestamp(timestamp) if timestamp else datetime.now()
        date_str_for_title = date_obj.strftime('%Y-%m-%d')

        ext = os.path.splitext(original_reel_uri)[-1]
        handle_for_filename = instagram_handle if instagram_handle else "unknown_user"
        original_file_basename = os.path.splitext(os.path.basename(original_reel_uri))[0]
        sanitized_original_basename = ''.join(c if c.isalnum() else '_' for c in original_file_basename)

        # This is the reel video filename, goes into 'fulltext_url'
        reel_export_filename = f"instagram_{handle_for_filename}_reel_{date_str_for_title}_{i+1}_{sanitized_original_basename}{ext}"
        reel_export_path = os.path.join(output_dir, reel_export_filename)

        try:
            copy2(media_path, reel_export_path)
            copied_files_for_zip.append(reel_export_path)
        except Exception as e:
            print(f"Error copying media file {media_path} to {reel_export_path}: {e}. Skipping this reel.")
            continue

        # --- Abstract and Keywords ---
        original_caption = item.get('title', '') # This is the Instagram caption
        keywords_str = extract_hashtags(original_caption)

        # --- SRT File (Additional File) ---
        srt_export_filename = '' # Initialize to empty
        original_srt_uri = ''

        try:
            subtitles_data = item.get('media_metadata', {}).get('video_metadata', {}).get('subtitles', {})
            original_srt_uri = subtitles_data.get('uri', '')

            if original_srt_uri:
                srt_original_full_path = os.path.join(MEDIA_DIR, original_srt_uri)
                if os.path.exists(srt_original_full_path):
                    base_export_videoname, _ = os.path.splitext(reel_export_filename)
                    srt_export_filename = base_export_videoname + ".srt" # This goes into 'additional_files'
                    srt_export_path = os.path.join(output_dir, srt_export_filename)

                    copy2(srt_original_full_path, srt_export_path)
                    copied_files_for_zip.append(srt_export_path)
                else:
                    # srt_export_filename remains empty if source SRT not found
                    pass
        except (KeyError, TypeError):
            # srt_export_filename remains empty
            pass

        # --- Construct new title ---
        dc_title = f"Instagram {handle_for_filename} {date_str_for_title}"

        # --- Prepare row data for Excel ---
        row = {
            'title': dc_title,
            'fulltext_url': reel_export_filename,
            'additional_files': srt_export_filename, # Will be empty if no SRT
            'keywords': keywords_str,
            'abstract': original_caption,
            'author1_fname': '', 'author1_mname': '', 'author1_lname': '', 'author1_suffix': '',
            'author1_email': '', 'author1_institution': '', 'author1_is_corporate': '',
            'author2_fname': '', 'author2_mname': '', 'author2_lname': '', 'author2_suffix': '',
            'author2_email': '', 'author2_institution': '', 'author2_is_corporate': '',
            'author3_fname': '', 'author3_mname': '', 'author3_lname': '', 'author3_suffix': '',
            'author3_email': '', 'author3_institution': '', 'author3_is_corporate': '',
            'author4_fname': '', 'author4_mname': '', 'author4_lname': '', 'author4_suffix': '',
            'author4_email': '', 'author4_institution': '', 'author4_is_corporate': '',
            'disciplines': '',
            'instagram_username': instagram_handle,
            'document_type': 'Instagram Reel'
        }
        excel_data_rows.append(row)

    # --- Create DataFrame and save to Excel ---
    if excel_data_rows:
        df = pd.DataFrame(excel_data_rows, columns=column_names)
        try:
            df.to_excel(excel_path, index=False, engine='openpyxl')
            print(f"Metadata written to Excel: {excel_path}")
        except Exception as e:
            print(f"Error writing to Excel file {excel_path}: {e}")
            print("Make sure 'openpyxl' library is installed (pip install openpyxl).")
    else:
        print("No data to write to Excel.")

    # --- Write README file (Updated) ---
    with open(readme_path, 'w', encoding='utf-8') as f:
        f.write(f"""Instagram Reels Export Package (Digital Commons Format)
=====================================================

Handle: @{instagram_handle}
Exported: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}

This package contains:
- Exported Reel video files.
- Associated .srt subtitle files (if they existed in the archive and were found), renamed to match their corresponding video files.
- An Excel file ({excel_filename}) with metadata for each reel, formatted for potential Digital Commons import.
- This README.txt file.

Fields in the Excel file ({excel_filename}):
- title: A generated title in the format "Instagram [username] [YYYY-MM-DD]".
- fulltext_url: The filename of the exported reel video file in this package. This file should be uploaded as the primary file.
- additional_files: The filename of the exported .srt subtitle file (if available) in this package. This should be uploaded as an additional file.
- keywords: Comma-separated hashtags extracted from the reel's original caption.
- abstract: The original caption/text of the Instagram reel.
- author1_fname to author4_is_corporate: Fields for author information (currently left blank).
- disciplines: Field for academic disciplines (currently left blank).
- instagram_username: The Instagram handle from which the reel originated.
- document_type: Set to "Instagram Reel".

Generated by Instagram archive processing script.
""")
    print(f"README file written to: {readme_path}")

    # --- Zip everything up ---
    with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
        if os.path.exists(excel_path):
            zipf.write(excel_path, arcname=os.path.basename(excel_path))
        zipf.write(readme_path, arcname='README.txt')
        for file_to_zip in copied_files_for_zip: # Includes videos and SRTs
            if os.path.exists(file_to_zip):
                zipf.write(file_to_zip, arcname=os.path.basename(file_to_zip))
            else:
                print(f"Warning: File {file_to_zip} not found for zipping.")
    print(f"ZIP archive created: {zip_path}")

# The last cell (for downloading) will need to be updated to point to this new zip_path
# e.g., source_zip_path = "/content/extracted_data/reels_export_dc_format/reels_package_dc_format.zip"
# and target_zip_filename = "reels_dc_format.zip"

Using Instagram handle: umsllibraries
Preparing to export metadata for 25 Reels in Digital Commons format...
Metadata written to Excel: /content/extracted_data/reels_export_dc_format/reels_metadata_dc_format.xlsx
README file written to: /content/extracted_data/reels_export_dc_format/README.txt
ZIP archive created: /content/extracted_data/reels_export_dc_format/reels_package_dc_format.zip


In [None]:
import os
import shutil
from google.colab import files # Make sure this is imported

# --- Assuming your previous script has run and created the source ZIP ---
# Define the source and target paths
source_zip_path = "/content/extracted_data/reels_export_dc_format/reels_package_dc_format.zip"
target_base_dir = "/content/batchup" # The directory where you want the new ZIP
target_zip_filename = "reels.zip"     # The desired name for the ZIP in the target directory
target_zip_path = os.path.join(target_base_dir, target_zip_filename)

# 1. Check if the source ZIP file exists
if not os.path.exists(source_zip_path):
    print(f"❌ ERROR: Source ZIP file not found at: {source_zip_path}")
    print("Please ensure the previous steps to create the ZIP were successful.")
else:
    print(f"✔️ Source ZIP found: {source_zip_path}")

    # 2. Ensure the target directory exists, create it if not
    os.makedirs(target_base_dir, exist_ok=True)
    print(f"✔️ Ensured target directory exists: {target_base_dir}")

    try:
        # 3. Copy the file
        shutil.copy2(source_zip_path, target_zip_path) # copy2 preserves metadata
        print(f"✅ Successfully copied '{os.path.basename(source_zip_path)}' to '{target_zip_path}'")

        # 4. Offer a download link for the new file
        print(f"\n⬇️ Click the link below to download '{target_zip_filename}':")
        # Note: files.download() directly initiates the download in the browser
        # It doesn't print a clickable link in the classic HTML sense in the output cell,
        # but Colab's UI will typically show a download prompt or progress.
        files.download(target_zip_path)
        print(f"(If download doesn't start automatically, check your browser's download manager or pop-up blocker.)")
        print(f"The file is located at: {target_zip_path} in the Colab environment.")

    except Exception as e:
        print(f"❌ ERROR during copy or download: {e}")

# Example of how you might integrate this at the end of your existing script:
# ... (your existing script that creates /content/extracted_data/reels_export/reels_package.zip) ...
# print(f"ZIP archive created: {zip_path}") # zip_path from your previous script would be source_zip_path here

# --- Add the copy and download logic here ---
# (The code block above would go here, making sure source_zip_path matches
# the zip_path variable from your previous script part if you used a variable)

✔️ Source ZIP found: /content/extracted_data/reels_export_dc_format/reels_package_dc_format.zip
✔️ Ensured target directory exists: /content/batchup
✅ Successfully copied 'reels_package_dc_format.zip' to '/content/batchup/reels.zip'

⬇️ Click the link below to download 'reels.zip':


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

(If download doesn't start automatically, check your browser's download manager or pop-up blocker.)
The file is located at: /content/batchup/reels.zip in the Colab environment.


In [None]:
import os
import shutil
from google.colab import files, drive # files is used for download in a later cell
import zipfile
import json
from datetime import datetime
import pandas as pd
from shutil import copy2
import re # For hashtag extraction

# Ensure MEDIA_DIR is defined, as it's used in subsequent cells
MEDIA_DIR = "/content/extracted_data" # This should be consistent with the extraction path

media_type = 'stories' # We are focusing on stories
output_dir = os.path.join(MEDIA_DIR, f'{media_type}_export_dc_format') # New output dir name
os.makedirs(output_dir, exist_ok=True)
print(f"Created output directory for stories: {output_dir}")

# Excel and ZIP file paths
excel_filename = f'{media_type}_metadata_dc_format.xlsx'
excel_path = os.path.join(output_dir, excel_filename)
readme_path = os.path.join(output_dir, 'README.txt')
zip_package_filename = f'{media_type}_package_dc_format.zip'
zip_path = os.path.join(output_dir, zip_package_filename)

instagram_handle = globals().get('username_value', "unknown_user")
print(f"Using Instagram handle: {instagram_handle}")

def extract_hashtags(text):
    if not isinstance(text, str):
        return ""
    hashtags = re.findall(r"#(\w+)", text)
    return ", ".join(hashtags)

# --- Load Stories JSON Data ---
instagram_media_data = {}
total_stories_in_json = 0
try:
    stories_json_path = os.path.join(MEDIA_DIR, 'your_instagram_activity', 'media', 'stories.json')
    if os.path.exists(stories_json_path):
        with open(stories_json_path, 'r', encoding='utf-8') as f:
            stories_full_data = json.load(f)
            if isinstance(stories_full_data, dict) and "ig_stories" in stories_full_data and isinstance(stories_full_data["ig_stories"], list):
                # Sort by creation_timestamp to group items from the same "post time" consecutively
                # Handle items with no timestamp by placing them at the end (or beginning, consistently)
                instagram_media_data['stories'] = sorted(
                    stories_full_data["ig_stories"],
                    key=lambda x: x.get('creation_timestamp', float('inf')) # Sort None/missing timestamps last
                )
                total_stories_in_json = len(stories_full_data["ig_stories"])
            elif isinstance(stories_full_data, list):
                print("Assuming the JSON root is the list of stories (older format or direct list). Sorting by timestamp.")
                instagram_media_data['stories'] = sorted(
                    stories_full_data,
                    key=lambda x: x.get('creation_timestamp', float('inf'))
                )
                total_stories_in_json = len(stories_full_data)
            else:
                print(f"Warning: Stories JSON structure unexpected. Expected dict with 'ig_stories' list or a direct list. Found: {type(stories_full_data)}")
                instagram_media_data['stories'] = []
    else:
        print(f"Stories JSON file not found at: {stories_json_path}")
        instagram_media_data['stories'] = []
except Exception as e:
    print(f"Error loading or sorting Stories JSON for export: {e}")
    instagram_media_data['stories'] = []

# --- Process Stories Data ---
stories_data_to_process = instagram_media_data.get('stories')
skipped_absolute_uri_count = 0
skipped_missing_local_file_count = 0
skipped_no_uri_count = 0
skipped_no_extension_count = 0
processed_item_excel_count = 0 # Counter for items written to Excel

# For handling repeated metadata for same-timestamp posts
last_processed_group_timestamp = None
shared_group_metadata = {
    'title': '',
    'abstract': '',
    'keywords': ''
}

if not stories_data_to_process:
    print("No Stories data loaded or found for export.")
else:
    print(f"Found {total_stories_in_json} story entries in JSON. Preparing to export metadata for locally available items...")

    excel_data_rows = []
    copied_files_for_zip = []

    column_names = [
        'title', 'fulltext_url', 'additional_files', 'keywords', 'abstract',
        'author1_fname', 'author1_mname', 'author1_lname', 'author1_suffix',
        'author1_email', 'author1_institution', 'author1_is_corporate',
        'disciplines', 'instagram_username', 'document_type'
    ]

    for i, item in enumerate(stories_data_to_process): # i is index after sorting
        original_story_uri = item.get('uri')

        if not original_story_uri:
            skipped_no_uri_count += 1
            continue
        if original_story_uri.startswith('http://') or original_story_uri.startswith('https://'):
            skipped_absolute_uri_count +=1
            continue

        media_path = os.path.join(MEDIA_DIR, original_story_uri)
        if not os.path.exists(media_path):
            skipped_missing_local_file_count += 1
            continue

        current_item_timestamp = item.get('creation_timestamp')

        date_obj = datetime.fromtimestamp(current_item_timestamp) if current_item_timestamp else datetime.now()
        date_str_for_filename_and_title = date_obj.strftime('%Y-%m-%d')

        _, ext = os.path.splitext(original_story_uri)
        if not ext:
            if 'video_metadata' in item.get('media_metadata', {}): ext = '.mp4'
            elif 'photo_metadata' in item.get('media_metadata', {}): ext = '.jpg'
            else:
                skipped_no_extension_count +=1
                continue

        handle_for_filename = instagram_handle.replace("@", "") if instagram_handle else "unknown_user"
        original_file_basename = os.path.splitext(os.path.basename(original_story_uri))[0]
        sanitized_original_basename = ''.join(c if c.isalnum() else '_' for c in original_file_basename).strip('_')
        if not sanitized_original_basename:
             sanitized_original_basename = f"media_item_{processed_item_excel_count + 1}"

        # Filename must be unique for each distinct media file.
        # Using processed_item_excel_count ensures unique numbering for exported files.
        story_export_filename = f"instagram_{handle_for_filename}_story_{date_str_for_filename_and_title}_{processed_item_excel_count + 1}_{sanitized_original_basename}{ext}"
        story_export_path = os.path.join(output_dir, story_export_filename)

        try:
            copy2(media_path, story_export_path)
            copied_files_for_zip.append(story_export_path)
        except Exception as e:
            print(f"Error copying media file {media_path} to {story_export_path}: {e}. Skipping this story.")
            skipped_missing_local_file_count += 1
            continue

        # --- Handle grouped metadata based on timestamp ---
        if current_item_timestamp != last_processed_group_timestamp or last_processed_group_timestamp is None:
            # This is the first item of a new timestamp group (or the very first item overall)
            shared_group_metadata['abstract'] = item.get('title', '') # Story's text/caption
            shared_group_metadata['keywords'] = extract_hashtags(shared_group_metadata['abstract'])
            shared_group_metadata['title'] = f"Instagram Story by {instagram_handle} - {date_str_for_filename_and_title}"
            last_processed_group_timestamp = current_item_timestamp

        # All items in the same timestamp group will use these:
        dc_title_to_use = shared_group_metadata['title']
        abstract_to_use = shared_group_metadata['abstract']
        keywords_to_use = shared_group_metadata['keywords']

        processed_item_excel_count += 1

        row = {
            'title': dc_title_to_use,
            'fulltext_url': story_export_filename, # Unique per file
            'additional_files': '',
            'keywords': keywords_to_use,
            'abstract': abstract_to_use,
            'author1_fname': '', 'author1_mname': '', 'author1_lname': '', 'author1_suffix': '',
            'author1_email': '', 'author1_institution': '', 'author1_is_corporate': False,
            'disciplines': '',
            'instagram_username': instagram_handle,
            'document_type': 'Instagram Story'
        }
        excel_data_rows.append(row)

    print(f"\n--- Processing Summary ---")
    print(f"Total story entries in JSON: {total_stories_in_json}")
    print(f"Successfully processed and included in package (rows in Excel): {processed_item_excel_count}")
    print(f"Stories skipped due to being a web link (absolute URI): {skipped_absolute_uri_count}")
    print(f"Stories skipped because local media file was missing: {skipped_missing_local_file_count}")
    print(f"Stories skipped because JSON entry had no URI: {skipped_no_uri_count}")
    print(f"Stories skipped because media type/extension couldn't be determined: {skipped_no_extension_count}")
    print(f"--------------------------\n")

    if excel_data_rows:
        df = pd.DataFrame(excel_data_rows, columns=column_names)
        try:
            df.to_excel(excel_path, index=False, engine='openpyxl')
            print(f"Metadata for {len(excel_data_rows)} locally available stories written to Excel: {excel_path}")
        except Exception as e:
            print(f"Error writing to Excel file {excel_path}: {e}")
            print("Make sure 'openpyxl' library is installed (e.g., !pip install openpyxl).")
    else:
        print("No locally available story data processed to write to Excel.")

    with open(readme_path, 'w', encoding='utf-8') as f:
        f.write(f"""Instagram Stories Export Package (Digital Commons Format)
========================================================

Handle: @{instagram_handle}
Exported: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}

This package contains:
- Exported Story media files (images or videos) that were found locally within your downloaded Instagram archive.
- An Excel file ({excel_filename}) with metadata for these locally available stories, formatted for potential Digital Commons import.
- This README.txt file.

Important Note on Metadata for Grouped Stories:
-----------------------------------------------
If multiple story items were posted at the exact same time (same `creation_timestamp`),
this script groups them. For such groups:
- The 'title', 'abstract', and 'keywords' metadata fields in the Excel sheet will be
  identical for all items within that same-timestamp group. This metadata is taken
  from the first story item encountered within that group.
- Each story media file itself (referenced in 'fulltext_url') remains unique and is
  individually packaged.

Important Note on Potentially Missing Story Items:
-------------------------------------------------
The number of stories included in this package might be less than the total number of
story entries in your `stories.json` file. This is because:
1. Some story entries in `stories.json` may reference media using a web link
   (e.g., starting with "http://"). These point to files on Instagram's servers,
   not files included locally in your archive.
2. A story entry might list a local file path, but that file could be missing from
   your specific archive download or the script couldn't determine its type.
3. Some JSON entries might be incomplete (e.g., missing a URI).

This script packages *only* media files physically present and accessible within your
downloaded Instagram archive. It does *not* download files from web links.

Summary from this export run:
- Total story entries originally in stories.json: {total_stories_in_json}
- Stories successfully processed and included in this package: {processed_item_excel_count}
- Stories skipped (media was a web link): {skipped_absolute_uri_count}
- Stories skipped (local media file missing): {skipped_missing_local_file_count}
- Stories skipped (JSON entry had no URI): {skipped_no_uri_count}
- Stories skipped (media type/extension undetermined): {skipped_no_extension_count}

Fields in the Excel file ({excel_filename}):
- title: A generated title. For stories posted at the same time, this title will be repeated.
- fulltext_url: The unique filename of the exported story media file.
- additional_files: Typically empty for stories.
- keywords: Hashtags from the story's text. Repeated for same-timestamp groups.
- abstract: The story's text. Repeated for same-timestamp groups.
- (Author fields): For author information.
- disciplines: For academic disciplines.
- instagram_username: The Instagram handle.
- document_type: "Instagram Story".

Generated by Instagram archive processing script.
""")
    print(f"README file written to: {readme_path}")

    if copied_files_for_zip or (os.path.exists(excel_path) and excel_data_rows):
        print(f"Creating ZIP archive: {zip_path}")
        with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
            if os.path.exists(excel_path) and excel_data_rows:
                zipf.write(excel_path, arcname=os.path.basename(excel_path))
                print(f"Added {os.path.basename(excel_path)} to ZIP.")
            elif not excel_data_rows: print(f"Excel file {excel_filename} not added (no data).")
            elif not os.path.exists(excel_path): print(f"Excel file {excel_path} not found.")

            if os.path.exists(readme_path):
                zipf.write(readme_path, arcname='README.txt')
                print(f"Added README.txt to ZIP.")
            else: print(f"README.txt not found at {readme_path}.")

            successfully_zipped_media_count = 0
            for file_to_zip in copied_files_for_zip:
                if os.path.exists(file_to_zip):
                    zipf.write(file_to_zip, arcname=os.path.basename(file_to_zip))
                    successfully_zipped_media_count += 1
                else:
                    print(f"Warning: Copied media file {file_to_zip} not found during zipping.")
            print(f"Added {successfully_zipped_media_count} media files to ZIP.")
        print(f"ZIP archive '{zip_package_filename}' created successfully in '{output_dir}'.")
    else:
        print(f"No files processed/copied for stories. ZIP archive '{zip_package_filename}' not created.")

print(f"\n--- Story Export Process Finished ---")

Created output directory for stories: /content/extracted_data/stories_export_dc_format
Using Instagram handle: umsllibraries
Found 485 story entries in JSON. Preparing to export metadata for locally available items...

--- Processing Summary ---
Total story entries in JSON: 485
Successfully processed and included in package (rows in Excel): 450
Stories skipped due to being a web link (absolute URI): 35
Stories skipped because local media file was missing: 0
Stories skipped because JSON entry had no URI: 0
Stories skipped because media type/extension couldn't be determined: 0
--------------------------

Metadata for 450 locally available stories written to Excel: /content/extracted_data/stories_export_dc_format/stories_metadata_dc_format.xlsx
README file written to: /content/extracted_data/stories_export_dc_format/README.txt
Creating ZIP archive: /content/extracted_data/stories_export_dc_format/stories_package_dc_format.zip
Added stories_metadata_dc_format.xlsx to ZIP.
Added README.txt 

In [None]:
# --- Assuming your previous script has run and created the source ZIP ---
# Define the source and target paths
source_zip_path = "/content/extracted_data/stories_export_dc_format/stories_package_dc_format.zip"
target_base_dir = "/content/batchup" # The directory where you want the new ZIP
target_zip_filename = "stories.zip"     # The desired name for the ZIP in the target directory
target_zip_path = os.path.join(target_base_dir, target_zip_filename)

# 1. Check if the source ZIP file exists
if not os.path.exists(source_zip_path):
    print(f"❌ ERROR: Source ZIP file not found at: {source_zip_path}")
    print("Please ensure the previous steps to create the ZIP were successful.")
else:
    print(f"✔️ Source ZIP found: {source_zip_path}")

    # 2. Ensure the target directory exists, create it if not
    os.makedirs(target_base_dir, exist_ok=True)
    print(f"✔️ Ensured target directory exists: {target_base_dir}")

    try:
        # 3. Copy the file
        shutil.copy2(source_zip_path, target_zip_path) # copy2 preserves metadata
        print(f"✅ Successfully copied '{os.path.basename(source_zip_path)}' to '{target_zip_path}'")

        # 4. Offer a download link for the new file
        print(f"\n⬇️ Click the link below to download '{target_zip_filename}':")
        # Note: files.download() directly initiates the download in the browser
        # It doesn't print a clickable link in the classic HTML sense in the output cell,
        # but Colab's UI will typically show a download prompt or progress.
        files.download(target_zip_path)
        print(f"(If download doesn't start automatically, check your browser's download manager or pop-up blocker.)")
        print(f"The file is located at: {target_zip_path} in the Colab environment.")

    except Exception as e:
        print(f"❌ ERROR during copy or download: {e}")

✔️ Source ZIP found: /content/extracted_data/stories_export_dc_format/stories_package_dc_format.zip
✔️ Ensured target directory exists: /content/batchup
✅ Successfully copied 'stories_package_dc_format.zip' to '/content/batchup/stories.zip'

⬇️ Click the link below to download 'stories.zip':


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

(If download doesn't start automatically, check your browser's download manager or pop-up blocker.)
The file is located at: /content/batchup/stories.zip in the Colab environment.


In [None]:
import os
import shutil
# from google.colab import files, drive # files is used for download in a later cell. Keep if this cell is standalone runnable.
import zipfile
import json
from datetime import datetime
import pandas as pd
from shutil import copy2
import re # For hashtag extraction

# Ensure MEDIA_DIR is defined, as it's used in subsequent cells
MEDIA_DIR = "/content/extracted_data" # This should be consistent with the extraction path

media_type = 'posts' # We are focusing on posts
output_dir = os.path.join(MEDIA_DIR, f'{media_type}_export_dc_format_individual_media') # New output dir name
os.makedirs(output_dir, exist_ok=True)
print(f"Created output directory for posts (individual media items): {output_dir}")

# Excel and ZIP file paths
excel_filename = f'{media_type}_metadata_dc_format_individual_media.xlsx'
excel_path = os.path.join(output_dir, excel_filename)
readme_path = os.path.join(output_dir, 'README.txt')
zip_package_filename = f'{media_type}_package_dc_format_individual_media.zip'
zip_path = os.path.join(output_dir, zip_package_filename)

instagram_handle = globals().get('username_value', "unknown_user")
print(f"Using Instagram handle: {instagram_handle}")

def extract_hashtags(text):
    if not isinstance(text, str):
        return ""
    hashtags = re.findall(r"#(\w+)", text)
    return ", ".join(hashtags)

# --- Load Posts JSON Data ---
instagram_media_data = {}
total_posts_in_json = 0
try:
    posts_json_path = os.path.join(MEDIA_DIR, 'your_instagram_activity', 'media', 'posts_1.json')
    if os.path.exists(posts_json_path):
        with open(posts_json_path, 'r', encoding='utf-8') as f:
            loaded_post_data = json.load(f) # posts_1.json is a list of post objects
            if isinstance(loaded_post_data, list):
                # Sort posts by creation_timestamp if available, to process chronologically
                # Handle items with no timestamp by placing them at the end
                instagram_media_data['posts'] = sorted(
                    loaded_post_data,
                    key=lambda x: x.get('creation_timestamp', x.get('media', [{}])[0].get('creation_timestamp', float('inf')))
                                  if isinstance(x.get('media'), list) and x.get('media') else float('inf')
                )
                total_posts_in_json = len(loaded_post_data)
            else:
                print(f"Warning: Posts JSON ({posts_json_path}) was not a list as expected. Found type: {type(loaded_post_data)}")
                instagram_media_data['posts'] = []
    else:
        print(f"Posts JSON file not found at: {posts_json_path}")
        instagram_media_data['posts'] = []
except Exception as e:
    print(f"Error loading or sorting Posts JSON for export: {e}")
    instagram_media_data['posts'] = []

# --- Process Posts Data ---
posts_data_to_process = instagram_media_data.get('posts')

# Counters for summary
processed_media_items_excel_count = 0 # Counts each media item that gets a row
skipped_posts_no_media_list = 0
skipped_posts_no_timestamp = 0
skipped_media_items_no_uri = 0
skipped_media_items_absolute_uri = 0
skipped_media_items_missing_local_file = 0
skipped_media_items_no_extension = 0


if not posts_data_to_process:
    print("No Posts data loaded or found for export.")
else:
    print(f"Found {total_posts_in_json} post entries in JSON. Preparing to export metadata for each media item individually...")

    excel_data_rows = []
    copied_files_for_zip = []

    column_names = [
        'title', 'fulltext_url', 'additional_files', 'keywords', 'abstract',
        'author1_fname', 'author1_mname', 'author1_lname', 'author1_suffix',
        'author1_email', 'author1_institution', 'author1_is_corporate',
        'disciplines', 'instagram_username', 'document_type', 'post_id' # Added post_id
    ]

    for post_index, post_item in enumerate(posts_data_to_process):
        media_list_in_post = post_item.get('media')
        if not media_list_in_post or not isinstance(media_list_in_post, list) or len(media_list_in_post) == 0:
            # print(f"Debug: Post item {post_index} has no media or media is not a list. Content: {post_item}")
            skipped_posts_no_media_list += 1
            continue

        # Determine post caption and timestamp (consistent for all media items in this post)
        # Overall post caption is in post_item['title'] for carousels.
        # For single media posts, it's often in post_item['media'][0]['title'].
        # Post timestamp is in post_item['creation_timestamp'] for carousels or single media posts.
        # For single media, it can also be in post_item['media'][0]['creation_timestamp'].

        post_level_caption = post_item.get('title', '')
        post_level_timestamp = post_item.get('creation_timestamp')

        # Fallback for single media posts where caption/timestamp might be in the media item itself
        first_media_item_for_fallback = media_list_in_post[0] if media_list_in_post else {}
        if not isinstance(first_media_item_for_fallback, dict): # Ensure it's a dict before .get()
            first_media_item_for_fallback = {}


        if not post_level_caption: # If no overall post caption, try first media's title
            post_level_caption = first_media_item_for_fallback.get('title', '')
        if post_level_timestamp is None:
            post_level_timestamp = first_media_item_for_fallback.get('creation_timestamp')

        if post_level_timestamp is None:
            # print(f"Debug: Post item {post_index} has no valid timestamp. Content: {post_item}")
            skipped_posts_no_timestamp += 1
            continue

        # Use this derived post_level_timestamp as the post_id
        current_post_id = post_level_timestamp
        date_obj = datetime.fromtimestamp(current_post_id)
        date_str_for_filename_and_title = date_obj.strftime('%Y-%m-%d')
        handle_for_filename = instagram_handle.replace("@", "") if instagram_handle else "unknown_user"

        # Post-level abstract and keywords
        abstract_for_post = post_level_caption
        keywords_for_post = extract_hashtags(abstract_for_post)

        media_items_successfully_processed_for_this_post = 0

        for media_index, media_item in enumerate(media_list_in_post):
            if not isinstance(media_item, dict):
                # print(f"Debug: Media item {media_index} in post {post_index} is not a dictionary. Content: {media_item}")
                continue

            original_media_uri = media_item.get('uri')
            if not original_media_uri:
                # print(f"Debug: Media item {media_index} in post {post_index} has no URI. Content: {media_item}")
                skipped_media_items_no_uri += 1
                continue

            if original_media_uri.startswith(('http://', 'https://')):
                # print(f"Info: Media item {media_index} ('{original_media_uri}') in post {post_index} is an absolute URI. Skipping.")
                skipped_media_items_absolute_uri += 1
                continue

            media_path = os.path.join(MEDIA_DIR, original_media_uri)
            if not os.path.exists(media_path):
                # print(f"Warning: Media file not found for URI '{original_media_uri}' in post {post_index}. Path: {media_path}. Skipping media item.")
                skipped_media_items_missing_local_file += 1
                continue

            _, ext = os.path.splitext(original_media_uri)
            if not ext:
                if 'video_metadata' in media_item.get('media_metadata', {}): ext = '.mp4'
                # Add more specific inferences if needed, e.g. for image types based on other metadata
                else:
                    # print(f"Warning: Media URI '{original_media_uri}' in post {post_index} has no extension and type couldn't be inferred. Skipping media item.")
                    skipped_media_items_no_extension += 1
                    continue

            original_file_basename = os.path.splitext(os.path.basename(original_media_uri))[0]
            sanitized_original_basename = ''.join(c if c.isalnum() else '_' for c in original_file_basename).strip('_')
            if not sanitized_original_basename: # Handle cases where basename becomes empty after sanitizing
                 sanitized_original_basename = f"media_{media_index + 1}"

            # Filename for this specific media item
            media_export_filename = f"instagram_{handle_for_filename}_post_{date_str_for_filename_and_title}_p{post_index + 1}_m{media_index + 1}_{sanitized_original_basename}{ext}"
            media_export_path = os.path.join(output_dir, media_export_filename)

            try:
                copy2(media_path, media_export_path)
                copied_files_for_zip.append(media_export_path)
            except Exception as e:
                print(f"Error copying media file {media_path} to {media_export_path} for post {post_index}, media {media_index}: {e}. Skipping this media item.")
                skipped_media_items_missing_local_file += 1
                continue

            # DC Title for this specific media item
            dc_title_for_media_item = f"Instagram Post by {instagram_handle} - {date_str_for_filename_and_title} (Post {post_index + 1} - Media {media_index + 1} of {len(media_list_in_post)})"

            row = {
                'title': dc_title_for_media_item,
                'fulltext_url': media_export_filename, # This media item is the primary
                'additional_files': '', # No additional files for this individual media item's record
                'keywords': keywords_for_post, # From overall post
                'abstract': abstract_for_post, # From overall post
                'author1_fname': '', 'author1_mname': '', 'author1_lname': '', 'author1_suffix': '',
                'author1_email': '', 'author1_institution': '', 'author1_is_corporate': False,
                'disciplines': '',
                'instagram_username': instagram_handle,
                'document_type': 'Instagram Post', # Kept as 'Instagram Post'
                'post_id': current_post_id # Timestamp of the original post for grouping
            }
            excel_data_rows.append(row)
            processed_media_items_excel_count += 1
            media_items_successfully_processed_for_this_post +=1

        if media_list_in_post and media_items_successfully_processed_for_this_post == 0:
             print(f"Info: Post {post_index} had {len(media_list_in_post)} media items, but none could be processed successfully (e.g. all web links or files missing).")


    # --- Summary ---
    print(f"\n--- Processing Summary for Posts (Individual Media Items) ---")
    print(f"Total post entries in JSON: {total_posts_in_json}")
    print(f"Total media items successfully processed and included in Excel: {processed_media_items_excel_count}")
    print(f"Posts skipped entirely (e.g., no media list, no timestamp): {skipped_posts_no_media_list + skipped_posts_no_timestamp}")
    print(f"  - Due to no media list or empty media list: {skipped_posts_no_media_list}")
    print(f"  - Due to no valid timestamp for the post: {skipped_posts_no_timestamp}")
    print(f"Individual media items skipped (breakdown):")
    print(f"  - No URI in JSON entry: {skipped_media_items_no_uri}")
    print(f"  - Web link (absolute HTTP/S URI): {skipped_media_items_absolute_uri}")
    print(f"  - Local file missing or copy error: {skipped_media_items_missing_local_file}")
    print(f"  - No file extension and type not inferred: {skipped_media_items_no_extension}")
    print(f"-----------------------------------------------------------\n")

    # --- Create DataFrame and save to Excel ---
    if excel_data_rows:
        df = pd.DataFrame(excel_data_rows, columns=column_names)
        try:
            df.to_excel(excel_path, index=False, engine='openpyxl')
            print(f"Metadata for {len(excel_data_rows)} post media items written to Excel: {excel_path}")
        except Exception as e:
            print(f"Error writing to Excel file {excel_path}: {e}")
            print("Make sure 'openpyxl' library is installed (e.g., !pip install openpyxl).")
    else:
        print("No post media item data processed to write to Excel.")

    # --- Write README file ---
    with open(readme_path, 'w', encoding='utf-8') as f:
        f.write(f"""Instagram Posts Export Package (Digital Commons Format - Individual Media Items)
===============================================================================

Handle: @{instagram_handle}
Exported: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
This script processed {total_posts_in_json} post entries from 'posts_1.json'.

This package contains:
- Exported media files (images or videos) from Instagram posts that were found locally.
- An Excel file ({excel_filename}) with metadata. Each row in this Excel file
  represents a single media item (image or video) from an Instagram post.
- This README.txt file.

Structure of Excel Data:
------------------------
- Each media item from an Instagram post (including items from carousels) gets its own row.
- The `post_id` column contains the creation timestamp of the original Instagram post.
  This ID can be used to group all media items that originated from the same post.
- The `fulltext_url` column lists the unique filename of the specific media item for that row.
- The `additional_files` column is intentionally left blank for each row in this format.
- Metadata common to the original post (like caption/abstract and keywords) is duplicated
  across all rows corresponding to media items from that same post.

Metadata Fields in Excel ({excel_filename}):
---------------------------------------------
- title: A generated title, unique for each media item, indicating its position within the original post
         (e.g., "Instagram Post by [username] - [YYYY-MM-DD] (Post [index] - Media [media_index] of [total_media])").
- fulltext_url: The filename of the exported media file for this specific row.
- additional_files: Blank in this version.
- keywords: Hashtags extracted from the original post's caption (repeated for all media from the same post).
- abstract: The original caption/text of the Instagram post (repeated for all media from the same post).
- author1_...: Author fields (defaulted to blank/False).
- disciplines: Academic disciplines (defaulted to blank).
- instagram_username: The Instagram handle.
- document_type: "Instagram Post".
- post_id: The creation timestamp of the original Instagram post, used for grouping.

Processing Summary (from this export run):
------------------------------------------
- Total post entries in 'posts_1.json': {total_posts_in_json}
- Total media items successfully processed and included in Excel: {processed_media_items_excel_count}
- Posts skipped entirely (no media list or no timestamp): {skipped_posts_no_media_list + skipped_posts_no_timestamp}
  - Due to no media list: {skipped_posts_no_media_list}
  - Due to no timestamp: {skipped_posts_no_timestamp}
- Individual media items skipped (breakdown):
  - No URI in JSON: {skipped_media_items_no_uri}
  - Web link (HTTP/S): {skipped_media_items_absolute_uri}
  - Local file missing/error: {skipped_media_items_missing_local_file}
  - No extension/type unknown: {skipped_media_items_no_extension}

Generated by Instagram archive processing script.
""")
    print(f"README file written to: {readme_path}")

    # --- Zip everything up ---
    if copied_files_for_zip or (os.path.exists(excel_path) and excel_data_rows):
        print(f"Creating ZIP archive: {zip_path}")
        with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
            if os.path.exists(excel_path) and excel_data_rows:
                zipf.write(excel_path, arcname=os.path.basename(excel_path))
                print(f"Added {os.path.basename(excel_path)} to ZIP.")
            elif not excel_data_rows: print(f"Excel file {excel_filename} not added (no data).")
            elif not os.path.exists(excel_path): print(f"Excel file {excel_path} not found, not added.")

            if os.path.exists(readme_path):
                zipf.write(readme_path, arcname='README.txt')
                print(f"Added README.txt to ZIP.")
            else: print(f"README.txt not found at {readme_path}, not added.")

            successfully_zipped_media_count = 0
            for file_to_zip in copied_files_for_zip:
                if os.path.exists(file_to_zip):
                    zipf.write(file_to_zip, arcname=os.path.basename(file_to_zip))
                    successfully_zipped_media_count += 1
                else:
                    print(f"Warning: Copied media file {file_to_zip} not found during zipping, not added.")
            print(f"Added {successfully_zipped_media_count} media files to ZIP.")
        print(f"ZIP archive '{zip_package_filename}' created successfully in '{output_dir}'.")
    else:
        print(f"No files processed/copied for posts. ZIP archive '{zip_package_filename}' not created.")

print(f"\n--- Post Export Process (Individual Media Items) Finished ---")

# To download this specific package, the next cell would need:
# source_zip_path = "/content/extracted_data/posts_export_dc_format_individual_media/posts_package_dc_format_individual_media.zip"
# target_zip_filename = "posts_individual_media.zip" # Or similar

Created output directory for posts (individual media items): /content/extracted_data/posts_export_dc_format_individual_media
Using Instagram handle: umsllibraries
Found 646 post entries in JSON. Preparing to export metadata for each media item individually...

--- Processing Summary for Posts (Individual Media Items) ---
Total post entries in JSON: 646
Total media items successfully processed and included in Excel: 860
Posts skipped entirely (e.g., no media list, no timestamp): 0
  - Due to no media list or empty media list: 0
  - Due to no valid timestamp for the post: 0
Individual media items skipped (breakdown):
  - No URI in JSON entry: 0
  - Web link (absolute HTTP/S URI): 0
  - Local file missing or copy error: 0
  - No file extension and type not inferred: 0
-----------------------------------------------------------

Metadata for 860 post media items written to Excel: /content/extracted_data/posts_export_dc_format_individual_media/posts_metadata_dc_format_individual_media.xlsx


In [None]:
# --- Assuming your previous script has run and created the source ZIP ---
# Define the source and target paths
source_zip_path = "/content/extracted_data/posts_export_dc_format_individual_media/posts_package_dc_format_individual_media.zip"
target_base_dir = "/content/batchup" # The directory where you want the new ZIP
target_zip_filename = "posts.zip"     # The desired name for the ZIP in the target directory
target_zip_path = os.path.join(target_base_dir, target_zip_filename)

# 1. Check if the source ZIP file exists
if not os.path.exists(source_zip_path):
    print(f"❌ ERROR: Source ZIP file not found at: {source_zip_path}")
    print("Please ensure the previous steps to create the ZIP were successful.")
else:
    print(f"✔️ Source ZIP found: {source_zip_path}")

    # 2. Ensure the target directory exists, create it if not
    os.makedirs(target_base_dir, exist_ok=True)
    print(f"✔️ Ensured target directory exists: {target_base_dir}")

    try:
        # 3. Copy the file
        shutil.copy2(source_zip_path, target_zip_path) # copy2 preserves metadata
        print(f"✅ Successfully copied '{os.path.basename(source_zip_path)}' to '{target_zip_path}'")

        # 4. Offer a download link for the new file
        print(f"\n⬇️ Click the link below to download '{target_zip_filename}':")
        # Note: files.download() directly initiates the download in the browser
        # It doesn't print a clickable link in the classic HTML sense in the output cell,
        # but Colab's UI will typically show a download prompt or progress.
        files.download(target_zip_path)
        print(f"(If download doesn't start automatically, check your browser's download manager or pop-up blocker.)")
        print(f"The file is located at: {target_zip_path} in the Colab environment.")

    except Exception as e:
        print(f"❌ ERROR during copy or download: {e}")

✔️ Source ZIP found: /content/extracted_data/posts_export_dc_format_individual_media/posts_package_dc_format_individual_media.zip
✔️ Ensured target directory exists: /content/batchup
✅ Successfully copied 'posts_package_dc_format_individual_media.zip' to '/content/batchup/posts.zip'

⬇️ Click the link below to download 'posts.zip':


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

(If download doesn't start automatically, check your browser's download manager or pop-up blocker.)
The file is located at: /content/batchup/posts.zip in the Colab environment.
