# Read

> Read X for llm context

In [None]:
#| default_exp read

In [None]:
#| hide
from nbdev.showdoc import *

In [None]:
#| export
from httpx import get

Todo ideas:
- [X] read_url
- [X] read_gist
- [X] read_gh_file
- [ ] read_ghurl
- [X] read_file
- [X] read_dir
- [X] read_pdf
- [ ] read_msword
- [ ] read_gdoc
- [ ] read_yt
- [X] read_yt_transcript
- [X] read_gsheet

One possible interface:

```
read_thing(s)
```

Where the function would be smart enough to look at s and determine if
it is:

- a Github URL
- a YT URL
- a Google Doc URL
- a "plain" URL (not identified as more specific)
- a path to a file on disk
- etc...

But this is a convenience interface.

library should also expose the separate dedicated `read` functions. 

ideally, these should also "just work" when simply given one positional argument, with further arguments being optional keyword args to request more specific behavior when necessary (maybe, output format besides a string).

They should also return the same thing, where that thing is whatever
is easiest to spit into context. str? dict?

To start for now let us suppose:
- each read_ function MUST work with one positional arg and MUST return a string.
- later: optional args, maybe controlling other output formats, such a dictionary, a Claude-optimized bit of XML, etc..



requirements.txt

```
PyPDF2
httpx
youtube_transcript_api
pytube
```

In [None]:
from aimagic import create_magic,models
create_magic(models[1])

## Defining read_ functions

In [None]:
%ai reset

In [None]:
%%ai
hello

Hello! How can I assist you today?

In [None]:
%%ai 0
Tell me about the python function or package html2md or html2text


There are two popular Python packages for converting HTML to Markdown:

1. html2text:
   - Widely used and mature library
   - Converts HTML to Markdown-formatted plain text
   - Available via pip: `pip install html2text`
   - Usage:
     ```python
     import html2text
     h = html2text.HTML2Text()
     markdown = h.handle("<h1>Hello World</h1>")
     ```

2. html2markdown:
   - Newer alternative
   - Aims to produce cleaner Markdown output
   - Available via pip: `pip install html2markdown`
   - Usage:
     ```python
     from html2markdown import convert
     markdown = convert("<h1>Hello World</h1>")
     ```

Both libraries are useful for tasks like web scraping, content processing, or converting HTML emails to plain text.

In [None]:
#| export

def read_url(url):
    import html2text, httpx
    return html2text.html2text(httpx.get(url).text)   

In [None]:
sample_gist_url = "https://gist.github.com/algal/a490024ad088de1b857531c83abef0a0"
raw_gist_url = "https://gist.githubusercontent.com/algal/a490024ad088de1b857531c83abef0a0/raw/d8b04e5b7c11d5b753b9225978e0216098295e9a/iterm2-url.source"
simpleraw_gist_url = "https://gist.githubusercontent.com/algal/a490024ad088de1b857531c83abef0a0/raw"

In [None]:
print(sample_gist_url)

https://gist.github.com/algal/a490024ad088de1b857531c83abef0a0


In [None]:
print(simpleraw_gist_url)

https://gist.githubusercontent.com/algal/a490024ad088de1b857531c83abef0a0/raw


In [None]:
%%aip 0
generate python code which uses regexes to go from a URL like
sample_gist_url to the URL in simpleraw_gist_url

In [None]:
#| export
def read_gist(url):
    "Returns raw gist content, or None"
    import re
    pattern = r'https://gist\.github\.com/([^/]+)/([^/]+)'
    match = re.match(pattern, url)
    if match:
        user, gist_id = match.groups()
        raw_url = f'https://gist.githubusercontent.com/{user}/{gist_id}/raw'
        return httpx.get(raw_url).text
    else:
        return None
    
    

In [None]:
gh_file_url="https://github.com/hamelsmu/getrich-fasthtml/blob/main/.gitignore"
gh_raw_file_url="https://raw.githubusercontent.com/hamelsmu/getrich-fasthtml/refs/heads/main/.gitignore"
print(gh_file_url)
print(gh_raw_file_url)

https://github.com/hamelsmu/getrich-fasthtml/blob/main/.gitignore
https://raw.githubusercontent.com/hamelsmu/getrich-fasthtml/refs/heads/main/.gitignore


In [None]:
%%aip 0
Please generate regex code to transform gh_file_url-like URLs to gh_raw_file_url-like URLs

In [None]:

import re

def github_url_to_raw(url):
    pattern = r'https://github\.com/([^/]+)/([^/]+)/blob/([^/]+)/(.+)'
    replacement = r'https://raw.githubusercontent.com/\1/\2/refs/heads/\3/\4'
    return re.sub(pattern, replacement, url)

# Test the function
result = github_url_to_raw(gh_file_url)
print(result)

# Comment: This function assumes the URL structure remains consistent.
# It may need adjustment if GitHub changes their URL format.
# Also, it doesn't handle cases where the URL doesn't match the expected pattern.
# You might want to add error handling for such cases.

https://raw.githubusercontent.com/hamelsmu/getrich-fasthtml/refs/heads/main/.gitignore


In [None]:
#| export
def read_gh_file(url):
    import httpx
    import re
    pattern = r'https://github\.com/([^/]+)/([^/]+)/blob/([^/]+)/(.+)'
    replacement = r'https://raw.githubusercontent.com/\1/\2/refs/heads/\3/\4'
    raw_url = re.sub(pattern, replacement, url)
    return httpx.get(raw_url).text

    

In [None]:
s = read_gh_file(gh_file_url)

In [None]:
#| export
def read_file(path):
    return open(path,'r').read()


In [None]:
%%aip 0
Generate Python code which reads all files below a certain path,
concatenating their contents into a single string, adding within
the string delimiter lines which communicate the paths of the
individual files. Use glob patterns, please.

In [None]:
%%aip 0
Observe the TODO comments in the code above. Please generate a new
function which fills those TODOs, using only stdlib modules.

In [None]:
#| export
import os
import glob
import fnmatch
import mimetypes

import string

def is_unicode(filepath, sample_size=1024):
    try:
        with open(filepath, 'r') as file:
            sample = file.read(sample_size)
        return True
    except UnicodeDecodeError:
        return False

def read_dir(path, 
             exclude_non_unicode=True,
             excluded_patterns=[".git/**"],
             verbose=True):
    pattern = '**/*'
    result = []
    for file_path in glob.glob(os.path.join(path, pattern), recursive=True):
        if any(fnmatch.fnmatch(file_path, pat) for pat in excluded_patterns):
            continue
        if os.path.isfile(file_path):
            if exclude_non_unicode and not is_unicode(file_path):
                continue
            if verbose:
                print(f"Including {file_path}")
            result.append(f"--- File: {file_path} ---")
            with open(file_path, 'r', errors='ignore') as f:
                result.append(f.read())
            result.append(f"--- End of {file_path} ---")
    return '\n'.join(result)

# Comment: This implementation uses fnmatch for pattern matching and
# mimetypes for binary file detection. Note that mimetypes is not 100% accurate
# for binary detection, so you might want to implement a more robust method
# if accuracy is crucial.

### PDF reader

In [None]:
#| export
from PyPDF2 import PdfReader

def read_pdf(file_path: str) -> str:
    with open(file_path, 'rb') as file:
        reader = PdfReader(file)
        return ' '.join(page.extract_text() for page in reader.pages)
    

In [None]:
!pwd

/home/algal/gits/ContextKit/nbs


In [None]:
read_pdf('./test_dir/legal.pdf')

" \n PARETO INC.  \nCONFIDENTIAL INFORMATION AND  \nINVENTION ASSIGNMENT AGREEMENT  \nConsultant Name : James Macharia  (“Consultant ”)  \nEffective Date : May 05, 2021  \nAs a condition of becoming retained (or Consultant’s consulting relationship being \ncontinued) by Pareto Inc. , a Delaware  corporation, or any of its current or future subsidiaries, \naffiliates, successors or assigns (collectively, the “ Company ”), and in cons ideration of \nConsultant’s consulting relationship with the Company and receipt of the compensation now and \nhereafter paid by the Company, the receipt of Confidential Information (as defined below) while \nassociated with the Company, and other good and valua ble consideration, the receipt and \nsufficiency of which are hereby acknowledged, Consultant agrees to the following:  \n1. Relationship .  This Confidential Information and Invention Assignment \nAgreement (this “ Agreement ”) will apply to Consultant’s consulting re lationship with the \nCompany.

### YT Transcript

In [None]:
#| export
from pytube import YouTube

def read_yt_transcript(yt_url):
    from pytube import YouTube
    from youtube_transcript_api import YouTubeTranscriptApi
    try:
        yt = YouTube(yt_url)
        video_id = yt.video_id
    except Exception as e:
        print(f"An error occurred parsing yt urul: {e}")
        return None
    transcript = YouTubeTranscriptApi.get_transcript(video_id)
    return ' '.join(entry['text'] for entry in transcript)


    

In [None]:
yt_url = "https://youtu.be/MRtg6A1f2Ko?si=C7YZU6FFLdi6v9rk"
s = read_yt_transcript(yt_url)

In [None]:
print(s[:100])

- [Tim] A widescreen
iPod with touch controls, a revolutionary mobile phone, and a breakthrough inte


### Google Sheet

In [None]:
orig_url = 'https://docs.google.com/spreadsheets/d/17Q3LzRCyM4md28IBxzSSERpaafLgOH8MjH5r6UkyVz8/edit?gid=0#gid=0'
csv_url='https://docs.google.com/spreadsheets/d/17Q3LzRCyM4md28IBxzSSERpaafLgOH8MjH5r6UkyVz8/export?format=csv&id=17Q3LzRCyM4md28IBxzSSERpaafLgOH8MjH5r6UkyVz8&gid=0'
import requests as rs
res=rs.get(url=csv_url)
res.content


b'Band Pull Around/Aparts\r\nShoulder Dislocations Straight\r\nShoulder Dislocations Side\r\nSuperman Dislocation\r\nScorpion Chest Stretch\r\nLatt Pulldown\r\nTwisty Shoulders\r\nRotator Cuff Pull\r\nWide bent over row'

In [None]:
#| export
import requests

def read_google_sheet(orig_url):
    sheet_id = orig_url.split('/d/')[1].split('/')[0]
    csv_url = f'https://docs.google.com/spreadsheets/d/{sheet_id}/export?format=csv&id={sheet_id}&gid=0'
    res = requests.get(url=csv_url)
    return res.content


### Google Doc:

In [None]:
public_gdoc_url = "https://docs.google.com/document/d/13g-IDyuJyk5wE60bOH1YhhFgW8rlh2LnSXccBS0CQd0/edit"

In [None]:
print(public_gdoc_url)

https://docs.google.com/document/d/13g-IDyuJyk5wE60bOH1YhhFgW8rlh2LnSXccBS0CQd0/edit


In [None]:
parseable_gdoc_url = "https://docs.google.com/document/d/13g-IDyuJyk5wE60bOH1YhhFgW8rlh2LnSXccBS0CQd0/export?format=html"
print(parseable_gdoc_url)    

https://docs.google.com/document/d/13g-IDyuJyk5wE60bOH1YhhFgW8rlh2LnSXccBS0CQd0/export?format=html


In [None]:
import html2text

doc_url = 'https://docs.google.com/document/d/13g-IDyuJyk5wE60bOH1YhhFgW8rlh2LnSXccBS0CQd0/edit?format=html'
doc_id = doc_url.split('/d/')[1].split('/')[0]
export_url = f'https://docs.google.com/document/d/{doc_id}/export?format=html'
doc_content = rs.get(export_url).text
doc_content = html2text.html2text(doc_content)

In [None]:
doc_content

'# Top heading\n\nHello this is a context reading test\n\n## Heading 2\n\nBolded text is here as well as italisized\n\n  * I have bullets\n  * Of things\n\n## Heading 3\n\nAnd ordered\n\n  1. Lists\n  2. Of\n  3. Things\n\n#\n\n'

In [None]:
%%aip
Generate python which translates URLs from the shape used in public_gdoc_url to parseable_gdoc_url


In [None]:
import re

def gdoc_url_to_parseable(url):
    pattern = r'(https://docs\.google\.com/document/d/[^/]+)/edit'
    replacement = r'\1/export?format=html'
    return re.sub(pattern, replacement, url)

# Test the function
result = gdoc_url_to_parseable(public_gdoc_url)
print(result)

# Comment: This function assumes the URL structure remains consistent.
# It may need adjustment if Google changes their URL format.
# Also, it doesn't handle cases where the URL doesn't match the expected pattern.
# You might want to add error handling for such cases.

https://docs.google.com/document/d/13g-IDyuJyk5wE60bOH1YhhFgW8rlh2LnSXccBS0CQd0/export?format=html


In [None]:
#| export
def read_gdoc(url):
    import re
    import requests
    import html2text
    doc_url = url
    doc_id = doc_url.split('/d/')[1].split('/')[0]
    export_url = f'https://docs.google.com/document/d/{doc_id}/export?format=html'
    html_doc_content = rs.get(export_url).text
    doc_content = html2text.html2text(html_doc_content)
    return doc_content
    

In [None]:
read_gdoc("https://docs.google.com/document/d/10pmXIbmQCnh0BpaSFvMGJ8_QzKAyfqXnWV1YIp5J71o/edit")

' FRIDAY|  \n| Workshops in shaded boxes are $7 per person. Included with MAX pass; or buy\ntickets onsite.  \n---|---|---  \n | O\'Hare 1| O\'Hare 2  \n2:00PM| Waltz (N/I): Welcome back! We\'ve missed you!Paula & Bob Graves| WCS\n(N/I): Give and TakeJason Miklic & Sophy Kdep  \n3:00 PM| Two Step (N/I): Hold My Beer and Watch This!  \nDavid Miller & Erin Frazier| WCS (N): Find Your Groove  \nByron Calix  \n4:00 PM| WCS (I): Cool WhipsPJ Turner| Cha-Cha (N/I): We Make this Sh** Look\nGoodJohn Burns & Suzanne Mosley (shOW, silly people!)  \n5:00 PM| Triple Two (N): Building on the Basics  \nAmanda Clark| Hustle (I/A): Million Dollar Moves & Style  \nNino DiGiulio and Dawn Lara  \n6:00 PM| Nightclub (N/I): Best Moves from Calgary, eh?  \nBryan Senn| WCS (I): Battle of the WhipsJennifer Norris and Derek Leyva  \nLearn HOW TO DANCE WEST COAST SWING. No prior knowledge necessary. From the\nbeginning. Yes, you can do it.  \nSATURDAY: NOON and 3PM in the BALMORAL ballroom with Ryan Dobbins and

## Next:

In [None]:
%%aip
Generate python code which will take url and then return the string containing
the contents of the file whose download is triggered by navigating to that url.


In [None]:
import requests

def download_file_content(url):
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raises an HTTPError for bad responses
        return response.text
    except requests.RequestException as e:
        print(f"An error occurred: {e}")
        return None

# Usage example:
# content = download_file_content(url)
# if content:
#     print(content)

# Comment: This function uses the requests library to download the file content.
# It handles HTTP errors and network issues, but you might want to add more 
# specific error handling or timeout settings depending on your use case.
# Also, this assumes the file content can be reasonably loaded into memory.
# For very large files, you might need to use streaming or chunked downloads.

In [None]:
gdoc_url

In [None]:
s = read_gdoc(public_gdoc_url)
print(s[100:])




Hello! How can I assist you today?

In [None]:
%%aip
Please generate code that transforms from one URL shape to another, where the src shape
is orig_url and the dest shape is csv_url

In [None]:
%%ai
Count how many notebook cells precede this cell, please.

ValueError: not enough values to unpack (expected at least 1, got 0)

In [None]:
import re

def transform_url(orig_url):
    # Extract components from orig_url
    match = re.match(r'https://([^/]+)/([^/]+)/([^/]+)/([^/]+)', orig_url)
    if not match:
        raise ValueError("Invalid URL format")
    
    domain, org, repo, branch = match.groups()
    
    # Construct csv_url
    csv_url = f"https://{domain}/{org}/{repo}/raw/{branch}"
    
    return csv_url

# Example usage:
# orig_url = "https://github.com/org/repo/tree/main"
# csv_url = transform_url(orig_url)
# print(csv_url)

# Note: This assumes a specific URL structure. Adjust regex if needed.