## Modifying ZIM Files

#### The Larger Picture
* Kiwix scrapes many useful sources, but sometimes the chunks are too big for IIAB.
* Using the zimdump program, the highly compressed ZIM files can be flattened into a file tree, modified, and then re-packaged as a ZIM file.
* This Notebook has a collection of tools which filters the content into a new smaller zim selected by youtube views/per year.


#### How to Use this notebook
* The zimdump program (at https://github.com/openzim/zim-tools) needs to be compiled from source.
* A bash script makes is easy to compile zimtools (which contains zimdump) on Ubuntu 20.04. There are instructions for the compilation at the github url. In a terminal, do the following:

```
cd /opt/iiab/iiab-factory/content/kiwix/generic/ 
sudo ./install-zim-tools.sh

```
* This ```zimfilter``` program can be set up into a python virtual environment using a role in the iiab-factory repository:

```
sudo ./runrole youtube
```

* **Some conventions**: Jupyter does not want to run as root. We will create a file structure that exists in the users home directory -- so the application will be able to write all the files it needs to function.
```
<PREFIX><project name>
├── default_config.yaml
├── new-zim
├── output_tree
├── proof
├── tree
├── working
└── zim-src
```
In general terms, this program will dump the zim data into "tree", modify it, gather additional data into "working", copy the desired videos to "output_tree"
, and create a ZIM file in "new_zim". After the new ZIM is created, it is re-dumped to "proof".
* For testing purposes, the user will need to link from the server's document root to her home directory (so that the nginx http server in IIAB will serve the candidate in "tree):

```
cd
mkdir -p zims
ln -s /home/<user name>/zims/library/www/html/zims 
```
**Note**: At the bottom of this notebook there is information about installing jupyterhub on VirtualBox.

#### Declare input
* We let the Admin Console download the source ZIM to /library/zims/content
* There may be other ZIM files in /library/zims/content.
* Choose a string in the ZIM file you want to process that is not contained in any other ZIMs in /library/zims/content and write that string to "current_project".
* This string will be used to create a set of folders as described above.

In [1]:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import os,sys
import json
import youtube_dl
import pprint as pprint
from types import SimpleNamespace
import subprocess
from ruamel.yaml import YAML 
from pprint import pprint

# Get the current project from it's pointer in iiab-factory repo
FACTORY_REPO = '/opt/iiab/iiab-factory'
PREFIX = '/library/www/html/zims'
#PREFIX = '/ext/zims'

current_project = FACTORY_REPO + '/content/kiwix/zim-filter/current_project'
if not os.path.isfile(current_project):
    print(f'\"current_project\" file is missing: {current_project}')
    sys.exit(1)
with open(current_project,'r') as fp:
    project_name = fp.read().strip().split('/')
    if len(project_name) > 0:
        prefix = project_name[:-1]
        project_name = project_name[-1]
lookfor = f"{PREFIX}/{project_name}/default_config.yaml"
dflt_cfg = f'{FACTORY_REPO}/content/kiwix/zim-filter/default_filter.yaml'
yml = YAML()
if not os.path.isfile(lookfor):
    with open(dflt_cfg,'r') as fp:
        cfg = yml.load(fp)
    cfg['PREFIX'] = PREFIX
    cfg['PROJECT_NAME'] = project_name
    if not os.path.isdir(PREFIX + '/' + project_name):
       os.makedirs(PREFIX + '/' + project_name)
    with open(lookfor,'w') as newfp:
        yml.dump(cfg,newfp) 
else:
    with open(lookfor,'r') as fp:
        cfg = yml.load(fp)

PROJECT_NAME = cfg['PROJECT_NAME']
SOURCE_URL = cfg['SOURCE_URL']
CACHE_DIR = PREFIX + '/youtube/cache'
if not os.path.isdir(CACHE_DIR):
   os.makedirs(CACHE_DIR)
TARGET_SIZE = cfg['TARGET_SIZE']  #10GB

# The rest of the paths are computed and represent the standard layout
WORKING_DIR = PREFIX + '/' + PROJECT_NAME + '/working'
PROJECT_DIR = PREFIX + '/' + PROJECT_NAME + '/tree'
OUTPUT_DIR = PREFIX + '/' + PROJECT_NAME + '/output_tree'
SOURCE_DIR = PREFIX + '/' + PROJECT_NAME + '/zim-src'
NEW_ZIM_DIR = PREFIX + '/' + PROJECT_NAME + '/new-zim'
PROOF_DIR = PREFIX + '/' + PROJECT_NAME + '/proof'
dir_list = ['working','output_tree','tree','../youtube/cache/video_json','zim-src','new-zim','proof']
for f in dir_list: 
    if not os.path.isdir(PREFIX + '/' + PROJECT_NAME +'/' + f):
       os.makedirs(PREFIX + '/' + PROJECT_NAME +'/' + f)

# If this is not first filter of this zim, let cfg file specify source 
if SOURCE_URL != '':
    ZIM_PATH = SOURCE_URL
    cmd = 'wget -P %s %s'%(SOURCE_DIR,SOURCE_URL)
    print('command:%s'%cmd)
    subprocess.run(cmd,shell=True)
else:
    # pick the downloaded ZIM which contains <project> string       
    zim_path_contents = os.listdir('/library/zims/content')
    for zim in zim_path_contents:
        if zim.find(PROJECT_NAME) != -1:
            zim_file = '/library/zims/content/%s'%(zim)
            if not os.path.exists(f'{SOURCE_DIR}/{zim}'):
                os.symlink(zim_file,f'{SOURCE_DIR}/{zim}')
            ZIM_PATH = f'{SOURCE_DIR}/{zim}'
# abort if the input file cannot be found
if not os.path.exists(ZIM_PATH):
    print('%s path not found. Quitting. . .'%ZIM_PATH)
    exit


In [2]:
print(f'{PREFIX},{PROJECT_DIR},{project_name}')

/library/www/html/zims,/library/www/html/zims/trippy/tree,trippy


In [3]:
# The following command will zimdump to the "tree" directory
#  Despite the name, removing namespaces seems unnecessary, and more complex
# It will return without doing anything if the "tree' is not empty
print('Using zimdump to expand the zim file to %s'%PROJECT_DIR)
progname = '%s/content/kiwix/zim-filter/de-namespace.sh'%(cfg['FACTORY_REPO'])
cmd = "%s %s %s"%(progname,ZIM_PATH,PREFIX + '/' + PROJECT_NAME)
print('command:%s'%cmd)
subprocess.run(cmd,shell=True,capture_output=True)


Using zimdump to expand the zim file to /library/www/html/zims/trippy/tree
command:/opt/iiab/iiab-factory/content/kiwix/zim-filter/de-namespace.sh /library/www/html/zims/trippy/zim-src/ted_en_playlist-9-trippy-ted-talks_2021-01.zim /library/www/html/zims/trippy


CompletedProcess(args='/opt/iiab/iiab-factory/content/kiwix/zim-filter/de-namespace.sh /library/www/html/zims/trippy/zim-src/ted_en_playlist-9-trippy-ted-talks_2021-01.zim /library/www/html/zims/trippy', returncode=0, stdout=b'', stderr=b"+ set -e\n+ '[' 2 -lt 2 ']'\n+ '[' '!' -f /library/www/html/zims/trippy/zim-src/ted_en_playlist-9-trippy-ted-talks_2021-01.zim ']'\n++ ls /library/www/html/zims/trippy/tree\n++ wc -l\n+ contents=0\n+ '[' 0 -ne 0 ']'\n+ rm -rf /library/www/html/zims/trippy/tree\n+ mkdir -p /library/www/html/zims/trippy/tree\n+ echo 'This de-namespace file reminds you that this folder will be overwritten?'\n+ zimdump dump --dir=/library/www/html/zims/trippy/tree /library/www/html/zims/trippy/zim-src/ted_en_playlist-9-trippy-ted-talks_2021-01.zim\n+ mv /library/www/html/zims/trippy/tree/I/assets /library/www/html/zims/trippy/tree/I/favicon.png /library/www/html/zims/trippy/tree/I/videos /library/www/html/zims/trippy/tree\n+ '[' -d I ']'\n+ cp -rp /library/www/html/zims/t

* The next step is a manual one that you will need to do with your browser. That is: to verify that after the namespace directories were removed, and all of the html links have been adjusted correctly. Point your browser to <hostname>/zims/\<PROJECT_NAME\>/tree.
* If everything is working, it's time to go fetch the information about each video from youtube.

#### Unfortunate choices by Kiwix
* The first few youtube ZIMs I examined used the 11 character Youtube_id as the directory name where the video was stored, and also the link in the ```data.js``` file which links categories to the videos themselves.
* But some ZIM's use random numbers (unknown origin) as the video identity and  link.
* We need the actual Youbube_id in order to get the views per year which we use to select which videos to include in the output ZIM. -- So we use a search function in the youtube Google API to look up the unique 11 character id.
* Google charges 100 units of funny money to do a search, and allocates 10000 units to a developer per day. This means that I can do 100 searches per day. Obviously, search results need to be recorded, and accumulate.
* This Kiwix decision adds a lot of complexity to the zim-filter process. I will try to compartmentalize this complexity into a mostly self-contained set of cells.

In [4]:
# Save and restore youtube_id's corresponding to Kiwix's arbitrary ids
lookup_yt_id = {}
def recall_youtube_ids():
    global lookup_yt_id
    if not os.path.exists(CACHE_DIR + '/yt_id_from_kiwix_id'):
        with open(CACHE_DIR + '/yt_id_from_kiwix_id','w') as fp:
            fp.write(json.dumps('{}'))
    with open(CACHE_DIR + '/yt_id_from_kiwix_id','r') as fp:
        lookup_yt_id = json.loads(fp.read())
        
def save_youtube_ids():
    with open(CACHE_DIR + '/yt_id_from_kiwix_id','w') as fp:
        fp.write(json.dumps(lookup_yt_id,indent=2))
    

#### Pick the best Youtube answer to pick the best Youtube_id
* Youtube returns many items that may differ in small ways from the search query submitted. We need to use fuzzy logic to pick the best match. 
* The longest string for match may be best (maybe description). 

In [5]:
# Use the youtube API to get Id from string (description)
from fuzzywuzzy import fuzz
from  googleapiclient.discovery import build
import googleapiclient.errors

api_key = os.environ['API_KEY']
os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"
api_service_name = "youtube"
api_version = "v3"
youtube = build(api_service_name, api_version, developerKey=api_key)

def search_youtube(kiwix_id,project_name,description):
    request = youtube.search().list(
        part="id,snippet",
        type='video',
        maxResults=10,
        q=description
    )
    max_item = {}
    max_value = 0
    videoid = ''
    response = request.execute()
    #pprint(response)
    for item in response['items']:
        value = fuzz.ratio(item['snippet']['description'], description)
        if value > max_value:
            max_value = value
            max_item = item
    if max_item:
        videoid = max_item['id']['videoId']
        lookup_yt_id[kiwix_id] = {'project_name':project_name,
                                  'youtube_id':videoid}
        save_youtube_ids()
    return videoid

print(search_youtube(10,'test','We need to talk about an injustice | Bryan Stevenson'))



IKdwdPk_scU


In [6]:
def get_assets_data():
    # the file <root>/assets/data.js holds the category to video mappings
    outstr = '['
    data = {}
    with open(PROJECT_DIR + '/assets/data.js', 'r') as fp:
        line = fp.read()
        if line[-1:] != ']':
            line += ']'
        if line.startswith('json_data'):
            try:
                data = json.loads(line[11:])
            except Exception:
                print('startswith json_data parse error:%s'%line[11:])
                exit
        else:
            with open(PROJECT_DIR + '/assets/data.js', 'r') as fp:
                line = fp.readline()
                while True:
                    if line.startswith('var') or not line :
                        if len(outstr) > 3:
                            # clip off the trailing semicolon
                            outstr = outstr[:-2]
                            try:
                                data[category] = json.loads(outstr)
                            except Exception:
                                print('Parse error: %s'%outstr)
                                exit
                        category = line[9:-4]
                        outstr = '['
                        if not line: break
                    else:
                        outstr += line
                    line = fp.readline()
    return data

In [7]:
def fix_ids(data):
    # input data parameter is either a list or dict drawn for assets/data.js
    # on return the youtube_id variable in data has valid youtube_id
    global lookup_yt_ids
    recall_youtube_ids()  # read the stored values
    # if id is 11 char, use as youtube_id, otherwise look up from title
    if isinstance(data,list):
        for index in range(len(data)):
            if len(data[index]['id']) == 11:
                data[index]['youtube_id'] = data[index]['id']
            else:
                # videos in more than one category, don't look up more than once
                kiwix_id = data[index]['id']
                yt = lookup_yt_id.get(kiwix_id,'')
                if yt == '':
                    yt = search_youtube(kiwix_id,PROJECT_NAME,data[index]['description'][0]['text'])
                else: # we may have more than one with this kiwix id
                    if lookup_yt_id[kiwix_id]['project_name'] != PROJECT_NAME:
                        # object and quit
                        print('There are more than one kiwix ids with the value:%s'%kiwix_id)
                        print('You need to delete the file at zims/youtube/cache/yt_id_from_kiwix_id and start over')
                        sys.exit(0)
                    else:
                        yt = lookup_yt_id[kiwix_id]['youtube_id']
                data[index]['youtube_id'] = yt
    else: # data must be a dictionary of categories
        for cat in data:
            if len(data[cat]['id']) == 11:
                data[cat]['youtube_id'] = data[cat]['id']
            else:
                # videos in more than one category, don't look up more than once
                yt = lookup_yt_id.get(data[cat]['id'],'')
                if yt == '':
                    yt = search_youtube(data[cat]['title'])
                data[cat]['youtube_id'] = yt
    #pprint(data) 
    return data

In [8]:
data_js = get_assets_data()
#print(json.dumps(data_js,indent=2))
fix_ids(data_js) # Does a youtube search if id's are not youtube_ids
def get_zim_data(kiwix_id):
    rtn_dict = {}
    if isinstance(data_js,list):
        for video in range(len(data_js)):
            if data_js[video]['id'] == kiwix_id:
                rtn_dict = data_js[video]
                break
        return rtn_dict
        
    else:
        for cat in  iter(data_js.keys()):
            for video in range(len(data_js[cat])):
                if data_js[cat][video]['id'] == kiwix_id:
                    rtn_dict = data_js[cat][video]
                    break
            if len(rtn_dict) > 0: 
                break
        return rtn_dict

ans = get_zim_data('usdJgEwMinM')
#print(json.dumps(ans,indent=2))

In [9]:
ydl = youtube_dl.YoutubeDL()
print('Downloading metadata from Youtube')
downloaded = 0
skipped = 0
# Create a list of video id's (which may not be youtube_ids)
kiwix_id_list = os.listdir(PROJECT_DIR + '/videos/')
# And we also need a list of actual youtube ids.
yt_id_list = []
for id in iter(kiwix_id_list):
    if len(id) == 11:
        yt_id = id
    else:
        # The global lookup_yt_id contains the persistent youtube search result
        yt_id = lookup_yt_id[id]['youtube_id']
    yt_id_list.append(yt_id)
    if os.path.exists(CACHE_DIR + '/video_json/' + yt_id + '.json'):
        # skip over items that are already downloadd
        skipped += 1
        continue
    with ydl:
       result = ydl.extract_info(
                'http://www.youtube.com/watch?v=%s'%yt_id,
                download=False # We just want to extract the info
                )
       downloaded += 1

    with open(CACHE_DIR + '/video_json/' + yt_id + '.json','w') as fp:
        fp.write(json.dumps(result))
    #pprint.pprint(result['upload_date'],result['view_count'])
print('%s skipped and %s downloaded'%(skipped,downloaded))

Downloading metadata from Youtube
11 skipped and 0 downloaded


#### Playlist Navigation to Videos
* On the home page that shows when viewing a ZIM, there is a drop down selector which lists about 70 cateegories (or playlists).
* The value from that drop down is used to pick an entry in "-/assets/data.js", which in turn specifies the  youtube ID"s that are displayed when a selection is made.

#### The following Cell is subroutines and can be left minimized

In [10]:
from pprint import pprint
from pymediainfo import MediaInfo

def mediainfo_dict(path):
    try:
        minfo = MediaInfo.parse(path)
    except:
        print('mediainfo_dict. file not found: %s'%path)
        return {}
    return minfo.to_data()

def select_info(path):
    global data
    data = mediainfo_dict(path)
    rtn = {}
    infotrack = data.get('tracks',0)
    if infotrack == 0:
        return {}
    for index in range(len(infotrack)):
        #if index
        track = data['tracks'][index]
        if track['kind_of_stream'] == 'General':
            rtn['file_size'] = track.get('file_size',0)
            rtn['bit_rate'] = track.get('overall_bit_rate',0)
            rtn['time'] = track['other_duration'][0]
        if track['kind_of_stream'] == 'Audio':
            rtn['a_stream'] = track.get('stream_size',0)
            rtn['a_rate'] = track.get('maximum_bit_rate',0)
            rtn['a_channels'] = track.get('channel_s',0)
        if track['kind_of_stream'] == 'Video':
            rtn['v_stream'] = track.get('stream_size',0)
            rtn['v_format'] = track['other_format'][0]
            rtn['v_rate'] = track.get('bit_rate',0)
            rtn['v_frame_rate'] = track.get('frame_rate',0)
            rtn['v_width'] = track.get('width',0)
            rtn['v_height'] = track.get('height',0)
    return rtn

In [11]:
import sqlite3
class Sqlite():
   def __init__(self, filename):
      self.conn = sqlite3.connect(filename)
      self.conn.row_factory = sqlite3.Row
      self.conn.text_factory = str
      self.c = self.conn.cursor()
    
   def __del__(self):
      self.conn.commit()
      self.c.close()
      del self.conn

def get_video_json(path):
    with open(path,'r') as fp:
        try:
            jsonstr = fp.read()
            #print(path)
            modules = json.loads(jsonstr.strip())
        except Exception as e:
            print(e)
            print(jsonstr[:80])
            return {}
    return modules

def video_size(kiwix_id):
    return os.path.getsize(PROJECT_DIR + '/-/videos/' + kiwix_id + '/video.webm')

def make_directory(path):
    if not os.path.exists(path):
        os.makedirs(path)

def download_file(url,todir):
    local_filename = url.split('/')[-1]
    r = requests.get(url)
    f = open(todir + '/' + local_filename, 'wb')
    for chunk in r.iter_content(chunk_size=512 * 1024):
        if chunk:
            f.write(chunk)
    f.close()
    
from datetime import datetime
def age_in_years(upload_date):
    uploaded_dt = datetime.strptime(upload_date,"%Y%m%d")
    now_dt = datetime.now()
    days_delta = now_dt - uploaded_dt
    years = days_delta.days/365 + 1
    return years

#### Create a sqlite database which collects Data about each Video
* We've already downloaded the data from YouTube for each Video. So get the items that are interesing to us. Such as size,date uploaded to youtube,view count

In [12]:
def initialize_db():
    sql = 'CREATE TABLE IF NOT EXISTS video_info ('\
            'yt_id TEXT UNIQUE, zim_size INTEGER, view_count INTEGER, age INTEGER, '\
            'views_per_year INTEGER, upload_date TEXT, duration TEXT, '\
            'height INTEGER, width INTEGER,'\
            'bit_rate TEXT, format TEXT, '\
            'average_rating REAL,slug TEXT, title TEXT, kiwix_id TEXT)'
    db.c.execute(sql)

print('Creating/Updating a Sqlite database with information about the Videos in this ZIM.')
print(WORKING_DIR)
db = Sqlite(WORKING_DIR + '/zim_video_info.sqlite')
initialize_db()
sql = 'select count() as num from video_info'
db.c.execute(sql)
row = db.c.fetchone()
if row[0] == len(kiwix_id_list):
    print('skipping update of sqlite database. Number of records equals number of videos')
else:
    for kiwix_id in iter(kiwix_id_list):
        # some defaults
        age = 0
        views_per_year = 1
        yt_id = lookup_yt_id[kiwix_id]['youtube_id']
        # fetch data from assets/data.js
        zim_data = get_zim_data(kiwix_id)
        if len(zim_data) == 0: 
            print('get_zim_data returned no data for %s'%kiwix_id)
        #pprint(zim_data)
        slug = zim_data.get('slug','')

        # We already have youtube data for every video, use it 
        data = get_video_json(CACHE_DIR + "/video_json/" + yt_id + '.json')
        if len(data) == 0:
            print('get_video_json returned no data for %s'%yt_id)
        vsize = data.get('filesize',0)
        view_count = data.get('view_count',0)
        upload_date = data.get('upload_date','')
        average_rating = data.get('average_rating',0)
        title = data.get('title','unknown title')
        # calculate the views_per_year since it was uploaded
        if upload_date != '':
            age = round(age_in_years(upload_date))
            views_per_year = int(view_count / age)

        # interogate the video itself
        filename = PROJECT_DIR + '/videos/' + kiwix_id + '/video.webm'
        if os.path.isfile(filename):
            vsize = os.path.getsize(filename)
            #print('vsize:%s'%vsize)
            selected_data = select_info(filename)
            if len(selected_data) == 0:
                duration = "not found"
                bit_rate = "" 
                v_format = ""
                v_height = ""
                v_width = ""
            else:
                duration = selected_data['time']
                bit_rate = selected_data['bit_rate']
                v_format = selected_data['v_format']
                v_height = selected_data['v_height']
                v_width = selected_data['v_width']

        # colums names: yt_id,zim_size,view_count,views_per_year,upload_date,duration,
        #         bit_rate, format,average_rating,slug,title,kiwix_id
        sql = 'INSERT OR REPLACE INTO video_info VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)'
        db.c.execute(sql,[yt_id,vsize,view_count,round(age),views_per_year,upload_date, \
                          duration,v_height,v_width,bit_rate,v_format,average_rating,slug,title,kiwix_id ])
    db.conn.commit()
    print(yt_id,vsize,view_count,views_per_year,upload_date, \
                          duration,bit_rate,v_format,average_rating,slug,round(age))

Creating/Updating a Sqlite database with information about the Videos in this ZIM.
/library/www/html/zims/trippy/working
DOW2YTnzQBI 44957833 18 4 20180106 13 min 2 s 459434 VP8 0.0 a-scientific-approach-to-the-paranormal 4


#### Select the cutoff using view count and total size
* Order the videos by view countper year. Then select the sum line in the that has the target sum.

In [13]:
import pandas as pd
from IPython.display import display 
global tot_sum

def human_readable(num):
    # return 3 significant digits and unit specifier
    num = float(num)
    units = [ '','K','M','G']
    for i in range(4):
        if num<10.0:
            return "%.2f%s"%(num,units[i])
        if num<100.0:
            return "%.1f%s"%(num,units[i])
        if num < 1000.0:
            return "%.0f%s"%(num,units[i])
        num /= 1024.0

sql = 'select title,zim_size,views_per_year,view_count,duration,upload_date,'\
       'format,width,height,bit_rate from video_info order by views_per_year desc'
tot_sum = 0
db.c.execute(sql)
rows = db.c.fetchall()
row_list = []
boundary_views_per_year = 0
for row in rows:
    tot_sum += row['zim_size']
    row_list.append([row['title'][:60],human_readable(row['zim_size']),\
                              human_readable(tot_sum),human_readable(row['view_count']),\
                              human_readable(row['views_per_year']),\
                              row['upload_date'],row['duration'],row['bit_rate']])
    if tot_sum > TARGET_SIZE and boundary_views_per_year == 0:
        boundary_views_per_year = row['views_per_year']
print('%60s %6s %6s %6s %6s %8s %8s'%('Name','Size','Sum','Views','Views','Date  ','Duration'))
print('%60s %6s %6s %6s %6s %8s %8s'%('','','','','/ yr','',''))
tot_sum = 0
for row in rows:
    tot_sum += row['zim_size']
    print('%60s %6s %6s %6s %6s %8s %8s'%(row['title'][:60],human_readable(row['zim_size']),\
                              human_readable(tot_sum),human_readable(row['view_count']),\
                              human_readable(row['views_per_year']),\
                              row['upload_date'],row['duration']))
#df = pd.read_sql(sql,db.conn)
#df = pd.DataFrame(row_list,columns=['Name','Size','Sum','Views','Views','Date','Duration','Bit Rate'])
#display(df)

                                                        Name   Size    Sum  Views  Views   Date   Duration
                                                                                    / yr                  
    Reggie Watts disorients you in the most entertaining way  32.7M  32.7M  9.32M   954K 20120525 9 min 40 s
                                    Ze Frank: Are you human?  14.2M  46.9M  1.39M   178K 20140718 4 min 34 s
                        Your body is my canvas | Alexa Meade  22.7M  69.5M   296K  32.8K 20130906 7 min 4 s
A Musical Escape Into a World of Light and Color | Kaki King  47.0M   116M   147K  24.5K 20151203 11 min 35 s
             Psychedelic Science | Fabian Oefner | TED Talks  43.5M   160M   201K  22.3K 20131003 12 min 5 s
                       Bruno Maisonnier: Dance, tiny robots!  12.7M   173M   120K  13.3K 20130226 3 min 6 s
Quixotic Fusion dancers set to Serenity - HYPE RMX  feat Lau  37.8M   211M    841    120 20150410 12 min 19 s
              The first

* Now determine the video ID's that we want in our new zim

In [14]:
print('We will include videos with views_per_year greater than %s'%boundary_views_per_year)
wanted_ids = []
sql = 'SELECT kiwix_id, title from video_info where views_per_year > ?'
db.c.execute(sql,[boundary_views_per_year,])
rows = db.c.fetchall()
for row in rows:
    wanted_ids.append(row['kiwix_id'])

#with open(HOME + '/zims/' + PROJECT_NAME + '/wanted_list.csv','w') as fp:
#    for row in rows:
#        fp.write('%s,%s\n'%(row['yt_id'],row['title'],))

We will include videos with views_per_year greater than 0


In [15]:
wanted_ids

['2366',
 '2049',
 '1512',
 '1834',
 '1458',
 '1677',
 '1247',
 '1814',
 '24',
 '1464',
 '2702']

In [16]:
import shutil
# copy the default top level directories (these were in the zim's "-" directory )
print('Copying wanted folders and Videos to %s'%OUTPUT_DIR)
cpy_dirs = ['assets','cache','channels']
for d in cpy_dirs:
    shutil.rmtree(os.path.join(OUTPUT_DIR,d),ignore_errors=True)
    os.makedirs(os.path.join(OUTPUT_DIR,d))
    src = os.path.join(PROJECT_DIR,d)
    dest = os.path.join(OUTPUT_DIR,d)
    if os.path.exists(src):
        shutil.copytree(src,dest,dirs_exist_ok=True, symlinks=True)

Copying wanted folders and Videos to /library/www/html/zims/trippy/output_tree


In [17]:
# Copy the videos selected by the wanted_ids list to output file
import shutil
for f in wanted_ids:
    if not os.path.isdir(os.path.join(OUTPUT_DIR,'videos',f)):
        os.makedirs(os.path.join(OUTPUT_DIR,'videos',f))
        src = os.path.join(PROJECT_DIR,'videos',f)
        dest = os.path.join(OUTPUT_DIR,'videos',f)
        shutil.copytree(src,dest,dirs_exist_ok=True)

In [18]:
#  Copy the files in the root directory
import shutil
for kiwix_id in wanted_ids:
    map_index_to_slug = get_zim_data(kiwix_id)
    if len(map_index_to_slug) > 0:
        title = map_index_to_slug.get('slug','')
        if title == '':
            title = kiwix_id
        src = os.path.join(PROJECT_DIR,title)
        dest = OUTPUT_DIR + '/' + title
    if os.path.isfile(src) and not os.path.isfile(dest):
        shutil.copyfile(src,dest)
    else:
        print('src:%s'%src)

In [19]:
# There are essential files that are needed in the zim
needed = ['/favicon.png','/favicon.jpg','/home.html',\
          '/index','/profile.jpg','/index.html']
for f in needed:
    if os.path.exists(PROJECT_DIR  + f):
        cmd = '/bin/cp %s %s'%(PROJECT_DIR  + f,OUTPUT_DIR)
        subprocess.run(cmd,shell=True)

In [20]:
# Grab the meta data from the original zim "M" directory 
#   and create a script for zimwriterfs
def get_file_value(path):
    with open(path,'r') as fp:
        
        try:
            return fp.read()
        except:
            return ""
meta_data ={}
meta_file_names = os.listdir(PROJECT_DIR + '/M/')
for f in meta_file_names:
    meta_data[f] = get_file_value(PROJECT_DIR + '/M/' + f)
pprint(meta_data)
    



{'Counter': 'video/webm=11;image/webp=22;text/vtt=356;text/html=13;application/font-sfnt=6;text/css=6;application/javascript=83;text/plain=8;application/octet-stream=15;application/x-shockwave-flash=1;text/markdown=1;image/svg+xml=1;application/json=41;image/png=5',
 'Creator': 'TED',
 'Date': '2021-01-19',
 'Description': 'These surprising, slightly psychedelic talks are better than '
                'staring at a blacklight poster.',
 'IndexLanguage': 'eng',
 'Language': 'eng',
 'Name': 'ted_en_playlist-9-trippy-ted-talks',
 'Publisher': 'Kiwix',
 'Scraper': 'ted2zim 2.0.8',
 'Tags': '_category:ted;ted;_videos:yes',
 'Title': '11 trippy TED Talks'}


In [21]:
# Write a new mapping from categories to vides (with some removed)
print('Creating a new mapping from Categories to videos within each category.')
outstr = ''
if isinstance(data_js,list):
    json_data = []  
    outstr += 'json_data = ['
    for index in range(len(data_js)):
        item_dict = data_js[index]
        if not item_dict['id'] in wanted_ids: continue
        outstr += json.dumps(item_dict,indent=2)
        outstr += ','
    outstr = outstr[:-1] + ']'
else:
    for cat in zim_category_js:
        outstr += 'var json_%s = [\n'%cat
        for video in range(len(zim_category_js[cat])):
            if zim_category_js[cat][video].get('id','') in wanted_ids:
                outstr += json.dumps(zim_category_js[cat][video],indent=1)
                outstr += ','
        outstr = outstr[:-1]
        outstr += '];\n'
with open(OUTPUT_DIR + '/assets/data.js','w') as fp:
    fp.write(outstr)
    

Creating a new mapping from Categories to videos within each category.


In [22]:
print('Creating a new ZIM and Indexing it')

from pathlib import Path
from zimscraperlib.zim import make_zim_file
from glob import glob
from datetime import datetime

original_name = glob("%s/*.zim"%(SOURCE_DIR))
fname = os.path.basename(original_name[0].replace('all','top'))
print('fname:%s'%fname)
#sys.exit(1)

os.chdir(OUTPUT_DIR)
if os.path.exists(OUTPUT_DIR + '/favicon.png'):
    favicon_fn = 'favicon.png'
else:
    favicon_fn = 'favicon.jpg'
if not os.path.isfile(os.path.join(NEW_ZIM_DIR,fname)):
    make_zim_file(
        build_dir=Path(OUTPUT_DIR),
        fpath=Path(NEW_ZIM_DIR) / fname,
        name=fname,
        main_page= "home.html",
        favicon=favicon_fn,
        title=meta_data['Title'],
        description=meta_data['Description'],
        language=meta_data['Language'],
        creator=meta_data['Creator'],
        publisher="Internet In A Box",
        tags=meta_data['Tags'],
        scraper=meta_data['Scraper'],
    )


Creating a new ZIM and Indexing it
fname:ted_en_playlist-9-trippy-ted-talks_2021-01.zim


RuntimeError: Traceback (most recent call last):
  File "libzim/wrapper.pyx", line 121, in libzim.wrapper.blob_cy_call_fct
  File "/opt/iiab/jupyterhub/lib/python3.8/site-packages/libzim/writer.py", line 88, in _get_data
    self._blob = self.get_data()
  File "/opt/iiab/jupyterhub/lib/python3.8/site-packages/zimscraperlib/zim/filesystem.py", line 110, in get_data
    rewriter(self.fpath, root=self.root).encode("utf-8")
  File "/opt/iiab/jupyterhub/lib/python3.8/site-packages/zimscraperlib/zim/rewriting.py", line 221, in fix_links_in_html_file
    return fix_links_in_html(url, fh.read())
  File "/opt/iiab/jupyterhub/lib/python3.8/site-packages/zimscraperlib/zim/rewriting.py", line 191, in fix_links_in_html
    fixed = fix_target_for(
  File "/opt/iiab/jupyterhub/lib/python3.8/site-packages/zimscraperlib/zim/rewriting.py", line 130, in fix_target_for
    flat_target = flat_target.resolve().relative_to(root.resolve())
  File "/usr/lib/python3.8/pathlib.py", line 904, in relative_to
    raise ValueError("{!r} does not start with {!r}"
ValueError: '/hd/library/www/html/zims/trippy/I/favicon.png' does not start with '/hd/library/www/html/zims/trippy/output_tree'


In [None]:
# Dump the zim file to get the metadata accumulated during it's creation
cmd = f'zimdump dump --dir={PROOF_DIR} {NEW_ZIM_DIR}/{fname}'
subprocess.run(cmd,shell=True)


In [None]:
# Create a dict with the counts of file mime types in the ZIM
with open(f'{PROOF_DIR}/M/Counter','r') as fp:
    counts = fp.read().split(';')
countdict = {}
for nibble in counts:
    nibble = nibble.split('=')
    countdict[nibble[0]] = nibble[1]
pprint(countdict)
    

In [None]:
# Fetch the uuid from the new zim
cmd = f'zimdump info {NEW_ZIM_DIR}/{fname}'
infodump = subprocess.run(cmd,shell=True,capture_output=True)
lines = infodump.stdout.decode('utf-8').split('\n')
for line in lines:
    if line.split(' ')[0] == 'uuid:':
        uuid = line.split(' ')[1].strip()
if not uuid:
    print('failed to get uuid')
else:
    print("uuid:%s"%uuid)

In [None]:
# Create a catalog fragment for this video
import base64
uuidstr = uuid
CATALOG_FRAG_DIR = '/opt/iiab/iiab-content/catalogs/zim-cat-fragments'
WASABI_URL = 'https://s3.us-east-2.wasabisys.com/iiab-zims'
# Maintain the order of the zim catalog
cat_keys = ['path','title','description','language','creator','publisher','tags','favicomMimeType',
           'favicon','date','articleCount','mediaCount','size','url','name','flavour']
outstr = '{"%s": {\n'%uuidstr
outstr += f'"path": "../library/zims/content/{fname}",\n'
outstr += f'"title": "{meta_data["Title"]}",\n'
outstr += f'"description": "{meta_data["Description"]}",\n'
outstr += f'"language": "{meta_data["Language"]}",\n'
outstr += f'"creator": "{meta_data["Creator"]}",\n'
outstr += f'"publisher": "Internet In A box",\n'
outstr += f'"tags": "{meta_data["Tags"]}",\n'
outstr += f'"faviconMimeType": "image/png",\n'
with open(PROJECT_DIR + '/' + favicon_fn,'rb') as fp:
    favi = fp.read()
b64 = base64.b64encode(favi)
outstr += f'"favicon": "{b64}",\n'
outstr += f'"date": "{meta_data["Date"]}",\n'
outstr += f'"articleCount": "{countdict["text/html"]}",\n'
outstr += f'"mediaCount": "{countdict["video/webm"]}",\n'
size = os.path.getsize(f'{NEW_ZIM_DIR}/{fname}')
outstr += f'"size": "{size}",\n'
outstr += f'"url": "{WASABI_URL}/{fname}",\n'
outstr += f'"name": "{meta_data["Name"]}",\n'
outstr += f'"flavour": ""\n'
outstr += '}}'
cat_fragment = '%s/%s'%(CATALOG_FRAG_DIR,fname.replace('.zim','.json'))
with open(cat_fragment,'w') as fp:
    fp.write(outstr)
print(outstr)

In [None]:
# Now parse the file we just created to validate the json
with open(cat_fragment,'r') as fp:
    json_str = fp.read()
frag_dict = json.loads(json_str)

In [None]:
# Now recreate the iiab-zim-catalog by updating from zim fragments
from glob import glob
import json
CATALOG_FRAG_DIR = '/opt/iiab/iiab-content/catalogs/zim-cat-fragments'
ZIM_CATALOG = '/opt/iiab/iiab-content/catalogs/iiab-zim-cat.json'
with open(ZIM_CATALOG,'r') as zcat:
    zim_dict = json.loads(zcat.read())
frags = glob(CATALOG_FRAG_DIR + '/*.json')
for frag in frags:
    with open(frag,'r') as fp:
        frag_dict = json.loads(fp.read())
    zim_dict.update(frag_dict)
with open(ZIM_CATALOG,'w') as zcat:
    zcat.write(json.dumps(zim_dict,indent=2))
    

In [None]:
# Validate the Zim Catalog
ZIM_CATALOG = '/opt/iiab/iiab-content/catalogs/iiab-zim-cat.json'

with open(ZIM_CATALOG,'r') as zcat:
    try:
        valid_dict = json.loads(zcat.read())
    except Exception:
        print('ZIM cataloge does not parse')
        sys.exit(1)
    print('ZIM catalog parsed successfully')

#### Notes on Install to VirtualBox Ubuntu20.04
1. When I ran the IIAB jupyterhub role, I got a combination of Python packages that did not work.
2. I went to an earlier installed Ubuntu20.04 install (in a virtualenv) that did work, and executed ```pip freeze > requirements.txt```. 
3. Transferred the requirements.txt file to the failing vBox instance, activated the venv ```source /opt/iiab/jupyberhub/bin/activate```, and ```pip install -r requirements.txt```. Then the iiab-factory/content/kiwix/zim-filter/start_remote_notebook.sh``` worked
4. So the requirements.txt may be required until jupyterhub is fixed. It is in the repo in the same place as the noteboot.