Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the ability to generate web pages #6

Closed
killlowkey opened this issue Jun 10, 2022 · 55 comments
Closed

Add the ability to generate web pages #6

killlowkey opened this issue Jun 10, 2022 · 55 comments

Comments

@killlowkey
Copy link

Hello, thank you for creating an excellent project, it is very helpful to me.
I have an idea. You can add the ability to generate pages with links similar to those below, Generating images is not conducive to reading.
url-shortening-service-like-tiny-url
Thank you very much.

@anilabhadatta
Copy link
Owner

@killlowkey I thought of that method but there were some images which doesn't show up in HTML. This is why I had to take screenshot of each webpage.
Also you cannot be sure when educative may change the URLs of certain images, which in turn may effect the offline HTML file.
Example:
image

I would suggest you to use educative-viewer to view the scraped courses as it is also designed for mobile view.
You can use some OCR to text extensions to copy text from images.

@killlowkey
Copy link
Author

@anilabhadatta I might have a solution to this problem by trying to convert the image to Base64 encoding, as shown below

Snipaste_2022-06-10_22-05-12
After converted into base64
Snipaste_2022-06-10_22-05-29
This should work

@anilabhadatta
Copy link
Owner

@killlowkey try to implement it

@killlowkey
Copy link
Author

@killlowkey try to implement it

thank you

@BoostUpStation
Copy link
Contributor

@killlowkey try to implement it

thank you

Hi Killlowkey, any progress on saving webpage as html or mhtml ??

@anilabhadatta can you please try saving the webpage as .mhtml instead of taking screenshot?? would be really helpful.
and please give some path and/or a series of courses which will make me a web-scrapper like you :) [i know python, like have done competitive programming using it. that's it.]

@killlowkey
Copy link
Author

@killlowkey try to implement it

thank you

Hi Killlowkey, any progress on saving webpage as html or mhtml ??

@anilabhadatta can you please try saving the webpage as .mhtml instead of taking screenshot?? would be really helpful. and please give some path and/or a series of courses which will make me a web-scrapper like you :) [i know python, like have done competitive programming using it. that's it.]

@BoostUpStation I don’t found now. Image not be show if use browser saved webpage to html or mhtml.

@BoostUpStation
Copy link
Contributor

@killlowkey try to implement it

thank you

Hi Killlowkey, any progress on saving webpage as html or mhtml ??
@anilabhadatta can you please try saving the webpage as .mhtml instead of taking screenshot?? would be really helpful. and please give some path and/or a series of courses which will make me a web-scrapper like you :) [i know python, like have done competitive programming using it. that's it.]

@BoostUpStation I don’t found now. Image not be show if use browser saved webpage to html or mhtml.

here's an old repo which saves in html/mhtml and pdf, but in typescript, don't know that :(
https://github.com/MrAbdulQadeer/educative.io-downloader

hoping somebody can implement it here in Python :)

@killlowkey
Copy link
Author

killlowkey commented Jul 3, 2022

@killlowkey try to implement it

thank you

Hi Killlowkey, any progress on saving webpage as html or mhtml ??
@anilabhadatta can you please try saving the webpage as .mhtml instead of taking screenshot?? would be really helpful. and please give some path and/or a series of courses which will make me a web-scrapper like you :) [i know python, like have done competitive programming using it. that's it.]

@BoostUpStation I don’t found now. Image not be show if use browser saved webpage to html or mhtml.

here's an old repo which saves in html/mhtml and pdf, but in typescript, don't know that :( https://github.com/MrAbdulQadeer/educative.io-downloader

hoping somebody can implement it here in Python :)

@BoostUpStation I never use python and ts, so I don’t help you. I think key idea for save webpage to html or mhtml is convert image url to base64 encoding. Hope it helps you.

@anilabhadatta
Copy link
Owner

@BoostUpStation @killlowkey Mainly there are few svg tags which contain image URLs , so the main option is to find every image URL and convert to base64 and also keep a track of image tags inside svg's and show them up in mhtml.

https://www.educative.io/courses/operating-systems-virtualization-concurrency-persistence/3jj3lxm03xr

test URL where you can see the image wont show up in mhtml.
if you find a way to show that up in mhtml manually placing it in the right place then I will see to it.

@killlowkey
Copy link
Author

@BoostUpStation @killlowkey Mainly there are few svg tags which contain image URLs , so the main option is to find every image URL and convert to base64 and also keep a track of image tags inside svg's and show them up in mhtml.

https://www.educative.io/courses/operating-systems-virtualization-concurrency-persistence/3jj3lxm03xr

test URL where you can see the image wont show up in mhtml. if you find a way to show that up in mhtml manually placing it in the right place then I will see to it.

@anilabhadatta I can't test the URL currently because I don't have an Educative Pro account. You may be able to find an unlimited URL, let me see the effect

@anilabhadatta
Copy link
Owner

anilabhadatta commented Jul 4, 2022

@killlowkey i will try to send a free course link having the same issue.
here use this, https://www.educative.io/courses/getting-started-braintree-api/qABYKBmxEY0

@killlowkey
Copy link
Author

@killlowkey i will try to send a free course link having the same issue. here use this, https://www.educative.io/courses/getting-started-braintree-api/qABYKBmxEY0

@anilabhadatta This is a tricky problem, I currently have no way to display SVG in mhtml.

@anilabhadatta
Copy link
Owner

@killlowkey yes, that is why i didn't implement it. Try, if you can find a way to show the SVG image element in mhtml.
I also thought of saving HTML but that wont work actually due to styling issues.
PDF is out of question since text may be missing or cut when there is a page break.

@BoostUpStation
Copy link
Contributor

BoostUpStation commented Jul 4, 2022

@killlowkey i will try to send a free course link having the same issue. here use this, https://www.educative.io/courses/getting-started-braintree-api/qABYKBmxEY0

@anilabhadatta This is a tricky problem, I currently have no way to display SVG in mhtml.

@anilabhadatta
Its very simple with 1 stoppage, i.e. try to press ctrl+s in the webdriver opened chrome, and select 2nd option which is 'save as single file .mhtml' and press enter.

Now have to add these steps through scripting in python/js/html, so please do this. Rather than converting, decoding and encoding stuffs.

@anilabhadatta
Copy link
Owner

@BoostUpStation actually if you ctrl+s mhtml then you wont be able to see the image present inside a iframe > SVG
also you also have to change each image URLs to base64, few of them maybe already converted.
this is required because if educative changes its domain in future or the URL is updated to something new or your system is offline then the images wont load up.
base64 ensures the image is available for offline usage

@BoostUpStation
Copy link
Contributor

@anilabhadatta yes you are right,
It didn't even work while saving the complete 'complete html with files included'
none of those options work as expected.
Didn't even work in android.

So base64 is the only way then apart from image.
Hope you implement it :)

Pls see that past repo link i shared, he also took screenshot ig with some more implementation(typescript was used.), and in that even if we zoom more than 400%, quality remains the same and pixels doesn't tear apart.

@BoostUpStation
Copy link
Contributor

@anilabhadatta here's some python code which will convert image to base64 and vice versa
https://superuser.com/questions/263634/decoding-base64-images-and-saving-to-a-file

And the link to thosa svg's can be easily taken via js.
By searching for 'data:' in the document.

@anilabhadatta
Copy link
Owner

@BoostUpStation the issue is not with finding base64 or conversion.
The issue lies how to place a img element with base64 in mhtml in place of that iframe -> svg

@anilabhadatta
Copy link
Owner

@killlowkey @BoostUpStation New update.
I did a testing on svg element images, seems like the images inside svg was never the problem, the {object tag and #document} was the main issue.
I was able to change get the content inside #document and then put it above object tag.

ifrm = document.querySelectorAll("object[aria-label='svg viewer']")[0]
svg_element = ifrm.contentDocument.documentElement
ifrm.parentNode.append(svg_element)
cls_name = ifrm.className
svg_element.classList.add(cls_name)

Try this in your system chrome console and then save the file using SingleFile.
I can iterate all the possible object tags and change the HTML.
After conversion, I was thinking of using SingleFile HTML extension to save the HTML page because it automatically converts all the image URLs to base64 and also keeps the HTML intact.
I need some help regarding this extension. if there is any way to call the single file extension using chrome console and get the scraped HTML file, then I can just add the quiz images and the scraping would be complete.

@killlowkey
Copy link
Author

@anilabhadatta
The idea of calling the chrome extension via JavaScript in the console and getting the output HTML can be difficult to implement. you can compare the HTML saved by the SingleFile HTML extension with the previously saved HTML to see how it displays the svg. hopefully this will help you

@anilabhadatta
Copy link
Owner

@killlowkey i will test this after few hours gildas-lormeau/SingleFile#820

@BoostUpStation
Copy link
Contributor

@anilabhadatta
Yes after running the script in console and then using that extension, its saving all svg's in the html file.
Its great.
Now have to call that extension only.
I'll also find something if can.

@BoostUpStation
Copy link
Contributor

BoostUpStation commented Jul 19, 2022

@anilabhadatta
Getting this error when running the code in cosole for this link.

ifrm = document.querySelectorAll("object[aria-label='svg viewer']")[0]
svg_element = ifrm.contentDocument.documentElement
ifrm.parentNode.append(svg_element)
cls_name = ifrm.className
svg_element.classList.add(cls_name)

https://www.educative.io/courses/getting-started-braintree-api/x1BG30wrnol

Uncaught TypeError: Cannot read properties of undefined (reading 'contentDocument')
at :2:20

@BoostUpStation
Copy link
Contributor

So here we have to check if webpage has 'contentDocument' element or not.
And it will work fine then.

@BoostUpStation
Copy link
Contributor

@anilabhadatta you can do like this if it can work.
Add the quizzes and other such elements under one another by modifying the current opened web page.

And then run that above 4 5 lines script,
And then call that singlefile extension or implement its code from github.

@anilabhadatta
Copy link
Owner

So here we have to check if webpage has 'contentDocument' element or not. And it will work fine then.
@BoostUpStation
You receive the error because I already hardcoded ifrm for testing to take zero index node but queryselector will create empty list.
When I will implement, I will create a loop and traverse the list so it wont create any error.

I will have to see the singlefile injection part.
I was thinking of adding the quiz images after getting the HTML content from single file because my program is set to run like that else I have to change a lot of code. Also it may effect with code containers so better I can just get the HTML content using single file and then append all the quiz images. It is much safer in many ways.
Also I would ask you to test the topic list URL traverse method and create a pull request and attach 2-3 course zip. After that, I will push my code or else again you may need to delete the fork and refork it

@anilabhadatta
Copy link
Owner

@BoostUpStation @killlowkey implementation successfully completed.
launching a personal website using r.zip

@killlowkey
Copy link
Author

@anilabhadatta It works perfectly. Nice.

@anilabhadatta
Copy link
Owner

@killlowkey @BoostUpStation will do some testing and then I will push it.

@BoostUpStation
Copy link
Contributor

@anilabhadatta awesome.
No issues, all working perfect.

You add the code, i'll refork it.
Because sometimes in some urls, it exits,
So after you have uploaded as of now latest single file html code.
I'll test it and then will create pull request.

Waiting for code updation from your side.

And will the codes inside html be scrollable or still separate code files must be used to view the code?

@anilabhadatta
Copy link
Owner

@BoostUpStation code will not be scrollable because that is done dynamically from educative servers. I will recommend you to use educative-viewer to open code window and easier access to HTML files as well.
I will push it after few hours. currently testing it

@anilabhadatta
Copy link
Owner

@killlowkey @BoostUpStation i have pushed the latest version, clone it and test it for few courses.

@anilabhadatta
Copy link
Owner

@killlowkey @BoostUpStation Refer v5.2 latest commit pushed few minutes ago

@BoostUpStation
Copy link
Contributor

BoostUpStation commented Jul 19, 2022

@anilabhadatta yes, i have pulled latest code, and testing it.
Isn't it good to add singlefile script local path with the code rather than pulling it from git on the fly?

And what about when we have scraped courses, why would we scrape the same course when using the scraper for paths?
As paths also have many/all same courses that are given as separate courses.

@anilabhadatta
Copy link
Owner

@BoostUpStation i tried local injection but failed so i am pulling it from git. (If you are able to implement it then you can commit it )
I have built the scraper to course URLs irrespective of single course or path . I have added a single condition for next button page to check if the page is the last page of that path so that scraper can exit.
In paths generally, most of the content is the same except 1-2 pages or more I guess but the content is usually organized in paths and there is no need to manually check.

@anilabhadatta
Copy link
Owner

@BoostUpStation I have updated educative-viewer as well. Will show content in 100% zoom

@BoostUpStation
Copy link
Contributor

@anilabhadatta so is it better to scrape single single courses or paths?
And lets say if we scrape single single courses, then how to skip them when scraping paths? As don't want to download again.

@anilabhadatta
Copy link
Owner

@BoostUpStation basically the scraper needs the first topic url and index(for resume)
You can just skip providing topic urls of paths modules in url list.
Just go to educative.io/explore and copy all the topic url from each course and paste it in url.txt

@anilabhadatta
Copy link
Owner

@BoostUpStation if you want to check if the course if already downloaded so you don't want to scrape it again while scraping paths then you will need to manually remove those urls.
Currently there is no way to check and skip those courses because the url as well as the name is different.

@BoostUpStation
Copy link
Contributor

@anilabhadatta ok thanks, i'll try that in a few days.
And could you please tell me what is included in the "code widget" folders? Because till now they were empty, i have tried like more than 10 courses.
Like in this course topic.
https://www.educative.io/courses/master-deno-javascript-runtime/3w7RNLk1W7p

@anilabhadatta
Copy link
Owner

anilabhadatta commented Jul 19, 2022

@BoostUpStation codewidget may or maynot contain codes
U will see there a widget will have output tab, so there is no code and that is why the folder is empty.
But if there were multiple tabs then the folder would contain the codes.
https://www.educative.io/module/lesson/ace-html/g2DpwW50279
test this link
I found a bug, actually the widget type is also present inside code download type containers. i will fix it tonight

@anilabhadatta
Copy link
Owner

anilabhadatta commented Jul 19, 2022

@BoostUpStation fixed and added a feature to collect data from runjs type containers.
Very few text/output files may not be saved from widget-type containers since it is in the beta stage and I am not planning to fix that🤣because of high complexity cases. Although most of the content will be downloaded from widget type containers.
Also you may see that HTML doesn't show output images that are present inside widgets , so I have tried to capture the images and add them to their respective widget folders.
Test Link : https://www.educative.io/module/lesson/ace-html/g2DpwW50279
I wont be able to show the image in HTML itself since Iframe isn't allowing me to access it from outside (CORS issue).

@BoostUpStation
Copy link
Contributor

@anilabhadatta
whole html is text selectable except these runjs containers.
Any possiblity to make them text selectable as well?

@BoostUpStation
Copy link
Contributor

@anilabhadatta
I found the solution for that, you just have to remove 'no-user-select' property from 'monaco-editor' class div. If the property exists, else continue.
Try to implement this when saving singlefile, if not possible then have to edit html afterwards.

@anilabhadatta
Copy link
Owner

@BoostUpStation the whole code wont be available if the widget has a scroller

@BoostUpStation
Copy link
Contributor

@anilabhadatta
I have implemented it, will test and report.

@BoostUpStation
Copy link
Contributor

BoostUpStation commented Jul 20, 2022

@BoostUpStation the whole code wont be available if the widget has a scroller

Ya, but if it doesn't have scroller, then in that case it is more helpful and i have implemented it, if you allow?, i can create a pull request for just that.

@anilabhadatta
Copy link
Owner

@BoostUpStation create a pr then.

@BoostUpStation
Copy link
Contributor

@anilabhadatta 1 issue when saving the single file.
The 1st q of quizzes is repeated,
And all quiizzes are added to the end of page rather than at their specific places one after other.
And the screenshots of quizzes aren't zoom independent plus non selectable (but the 1st q of quizz is selectable as its taken with the single file script ig), see if anything can be done about them.

@anilabhadatta
Copy link
Owner

@BoostUpStation nothing can be done because i have to take screenshots of quiz and they are non selectable for that reason and let it repeat the 1st question, there maybe cases where the first question may not show in single file

@BoostUpStation
Copy link
Contributor

@anilabhadatta the page is not responsive, please do something about it.
Images are going out of screen in portrait mode of mobile. And can't be viewed.

@anilabhadatta
Copy link
Owner

@BoostUpStation which images?

@BoostUpStation
Copy link
Contributor

@anilabhadatta
I was trying to implement the code widget due to which the issue occured,
Now working fine.
Thanks.

@anilabhadatta
Copy link
Owner

anilabhadatta commented Jul 20, 2022

@BoostUpStation ooh okay, the only issue you may face is the code containers in html going out of screen in educative viewer. That is a css issue, see if you fix it.

@anilabhadatta
Copy link
Owner

@BoostUpStation @killlowkey I am closing this issue as it is now fixed, if there is any bugs, create a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants