Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better source for data model #2

Closed
cch5ng opened this issue Mar 22, 2015 · 8 comments
Closed

better source for data model #2

cch5ng opened this issue Mar 22, 2015 · 8 comments
Labels

Comments

@cch5ng
Copy link
Owner

cch5ng commented Mar 22, 2015

currently web scraping contents of h5bp interview questions gh-pages index.html. while this works, it is unreliable b/c of the likelihood that the master README.mdn (original questions source) has not been pushed to the gh-pages branch.
would like to parse the text of the README.mdn but not exactly sure how to do this. may require learning reg expressions. may use some 3rd party text parse library??

@cch5ng
Copy link
Owner Author

cch5ng commented Mar 28, 2015

  • try to update the scraper to point to the html output from https://github.com/h5bp/Front-end-Developer-Interview-Questions ... appears that the README content is all contained in <article> ... verified that this is a unique outer element for the existing scraper logic
  • or alt look at markdown parser (but those appear to output html instead of json)

@cch5ng
Copy link
Owner Author

cch5ng commented Mar 28, 2015

  • skimmed docs for one markdown parser (https://github.com/evilstreak/markdown-js) and it sounds like their output is basically .MDN to HTML. I want the interim JSON. need to test it out
  • on a little further reading, it is possible to use an interim function to get from .MDN to JSONML but it does not look super easy to go from JSONML to the JSON I am interested in. the kicker is the embedded html tags like <code> and then a bunch of nested lists
    • would it be viable to grab a copy of the raw .MDN file, run it through the MDN parser (to HTML) and then apply my current logic to those results? but then there are additional variables like what if the parser introduces bugs and causes my app to fail?
    • would there be a way to automate grabbing a copy of the raw .MDN file (weekly), run that thru MDN parser, save HTML results to my github repo. then my existing logic should work automatically (and there wouldn't be a slowdown from doing parsing every time)

@cch5ng
Copy link
Owner Author

cch5ng commented Mar 28, 2015

retried doing web scraping on the master github pages for the project root and the project README.mdn file. but got errors like:

@cch5ng
Copy link
Owner Author

cch5ng commented Mar 29, 2015

(self note: from what I can tell, the updates from original h5bp project README to their gh-pages index.html is being maintained manually by one person; I cannot detect any automated update process in the repo source files)

@cch5ng
Copy link
Owner Author

cch5ng commented Mar 29, 2015

temp workaround for time being...

  • grabbled h5bp raw README.mdn (master) and put it into http://dillinger.io/ > output as html
    • 03 29 15: 2 04a ... see one issue related to the readme formatting inconsistency. the coding questions are using <p> tags and <pre> tags so the max number counts by category are getting messed up. probably should swap the order of fun questions and coding questions. 2nd issue is that the form labels are currently hard coded and they should be dynamic based on the readme html contents
    • 03 29 15: 1 29a ... got slightly further trying to read the generated readme html on my gh-pages. now am getting a legit list of categories but for some reason the questions are not getting read and appended into the final js array of categories/questions
    • trying to test the results but the jquery .find() is not reading the html results correctly so I don't know what is the difference between the dillinger.io output and the html format used in h5bp's gh-pages index
  • plan to add resulting HTML into a new src folder in my repo and point to that file from my XHR
    • don't like introducing a manual dependency but really hate giving people unreliable content
    • in long term, would need better solution but in short would really prefer working on app functionality and improving angular skills

@cch5ng
Copy link
Owner Author

cch5ng commented Mar 29, 2015

  • temp workaround to inconsistent formatting for the coding questions section
    • hardcode the form labels. swap positions of fun questions and coding questions
    • set coding questions to just a read only text or input where it communicates that all coding questions will be returned no matter what
    • store the coding questions (category and questions set) in a different variable than the other category/question groups

@cch5ng
Copy link
Owner Author

cch5ng commented Mar 31, 2015

fixed handling the inconsistency with coding questions (non list format and using different html tags).
a16c78b

@cch5ng
Copy link
Owner Author

cch5ng commented Mar 31, 2015

  • this is about as much as I plan to do for this iteration
    • in the future may want to revisit having a better data model and better way of pulling data from the h5bp repo's README file.
    • but would like to wrap up this project more quickly and work on different projects

@cch5ng cch5ng closed this as completed Mar 31, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant