Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Added scrape_grades command #159

Merged
merged 30 commits into from Mar 22, 2020

Commits on Mar 22, 2020

  1. Added documents/ to .gitignore

    This is needed for grade dist. pdfs to not be tracked by git
    gannonprudhomme committed Mar 22, 2020
    Configuration menu
    Copy the full SHA
    17440e7 View commit details
    Browse the repository at this point in the history
  2. Added pdf_parser

    Most of this is from Good Bull Schedules, but will most likely change as we go along
    
    Also added __init__.py for it
    gannonprudhomme committed Mar 22, 2020
    Configuration menu
    Copy the full SHA
    fced98b View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    a4ff5df View commit details
    Browse the repository at this point in the history
  4. Updated load_json with load_pdf function

    Also added a _generate_path function for use in it + load_json_file
    gannonprudhomme committed Mar 22, 2020
    Configuration menu
    Copy the full SHA
    029e8db View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    a1a1ed4 View commit details
    Browse the repository at this point in the history
  6. Extracted out functions from pdf_parser

    Moved into pdf_helper, and simplifies the parse_page function accordingly
    gannonprudhomme committed Mar 22, 2020
    Configuration menu
    Copy the full SHA
    2c01670 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    41078cc View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    6fe96b2 View commit details
    Browse the repository at this point in the history
  9. Changed pdf_parser functions to use GradeData

    This basically just cleans up the return types so they're easier to understand
    gannonprudhomme committed Mar 22, 2020
    Configuration menu
    Copy the full SHA
    6d791cf View commit details
    Browse the repository at this point in the history
  10. Extracted out extract_letter_grades from parse_page

    Basically just for readability purposes, functions the same
    
    Also removed unused function generate_year_semesters()
    gannonprudhomme committed Mar 22, 2020
    Configuration menu
    Copy the full SHA
    eb20c98 View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    db67218 View commit details
    Browse the repository at this point in the history
  12. Added parse_page test to pdf_parser_tests

    Also some misc linting fixes
    gannonprudhomme committed Mar 22, 2020
    Configuration menu
    Copy the full SHA
    8d2a00c View commit details
    Browse the repository at this point in the history
  13. Added PyPDF2 to requirements for pdf_parser

    Also added it to the lint-requirements for GitHub Actions
    gannonprudhomme committed Mar 22, 2020
    Configuration menu
    Copy the full SHA
    33bb200 View commit details
    Browse the repository at this point in the history
  14. Changed generate_path to be public function in load_json.py

    ALso removed redundant json.close() in load_json_file and instead returned it directly
    gannonprudhomme committed Mar 22, 2020
    Configuration menu
    Copy the full SHA
    dadf932 View commit details
    Browse the repository at this point in the history
  15. Added parse_pdf test

    gannonprudhomme committed Mar 22, 2020
    Configuration menu
    Copy the full SHA
    a5f2012 View commit details
    Browse the repository at this point in the history
  16. Added returned of pdf_data in pdf_parse.parse_pdf

    Also changed pdf_reader.getNumPages() to .numPages
    
    Also fixed linting error
    gannonprudhomme committed Mar 22, 2020
    Configuration menu
    Copy the full SHA
    573bd95 View commit details
    Browse the repository at this point in the history
  17. Semantic fixes in pdf_parser

    Changed get_pdf_skip_count to assign returned variables inline
    
    Removed extra grades iteration by adding up num_students in existing for-loop
    
    Changed list addition operator to .extend for readability
    gannonprudhomme committed Mar 22, 2020
    Configuration menu
    Copy the full SHA
    79080ef View commit details
    Browse the repository at this point in the history
  18. Added Grades model + GradeManager

    GradeManager is used for calculating an instructor's past grade distributions
    gannonprudhomme committed Mar 22, 2020
    Configuration menu
    Copy the full SHA
    c777a45 View commit details
    Browse the repository at this point in the history
  19. Configuration menu
    Copy the full SHA
    107ad0d View commit details
    Browse the repository at this point in the history
  20. Minor fixes in models_tests & grades model

    Changed instructor_performance return to specify that Dict value can be a float or int
    
    Rest of commit is minor comment fixes
    gannonprudhomme committed Mar 22, 2020
    Configuration menu
    Copy the full SHA
    6a130a9 View commit details
    Browse the repository at this point in the history
  21. Added beautifulsoup and lxml to requirements.txt

    Also added beautiful soup to lint-requirements
    gannonprudhomme committed Mar 22, 2020
    Configuration menu
    Copy the full SHA
    692c49c View commit details
    Browse the repository at this point in the history
  22. Configuration menu
    Copy the full SHA
    bbf4fb1 View commit details
    Browse the repository at this point in the history
  23. Added tests for scrape_grades

    These are incomplete, and more need to be added as commented
    gannonprudhomme committed Mar 22, 2020
    Configuration menu
    Copy the full SHA
    0a49dc5 View commit details
    Browse the repository at this point in the history
  24. Updated pdf_parser to work with old pdf style

    Since only the header row of the PDF indicates that it's an old pdf style, we only knew that it was an old pdf style for the first row and not the actual grades themselves, which prevented us from actually correctly parsing the section's grades, since the old style has a different format.
    
    To remedy this, anytime old_pdf_style is True in pdf_helper.get_pdf_skip_count, we store it (in pdf_parser.parse_page) and use it for the rest of the page.
    
    Also adds the according tests for it
    gannonprudhomme committed Mar 22, 2020
    Configuration menu
    Copy the full SHA
    bfcf7b4 View commit details
    Browse the repository at this point in the history
  25. Added suggestions to scrape_grades

    Changed PDF_DOWNLOAD_DIR to use dirname instead of relative path
    
    Changed scrape_pdf's counts dictionary to use defaultdict
    
    Other misc semantic syntax changes
    gannonprudhomme committed Mar 22, 2020
    Configuration menu
    Copy the full SHA
    bc21c29 View commit details
    Browse the repository at this point in the history
  26. Configuration menu
    Copy the full SHA
    e3fba69 View commit details
    Browse the repository at this point in the history
  27. Updated documents/grade_dists error catching

    Moved to _create_documents_folder since thats where the actual error will occur
    gannonprudhomme committed Mar 22, 2020
    Configuration menu
    Copy the full SHA
    dc9fbf5 View commit details
    Browse the repository at this point in the history
  28. Added optional arguments for scrape_grades

    Example usage:
    
    python manage.py scrape_grades -c EN --year 2015
    gannonprudhomme committed Mar 22, 2020
    Configuration menu
    Copy the full SHA
    12ae93c View commit details
    Browse the repository at this point in the history
  29. Misc semantic changes in scrape_grades

    Also adds SSL verification back to scrape_grades.fetch_page_data
    gannonprudhomme committed Mar 22, 2020
    Configuration menu
    Copy the full SHA
    463c1e6 View commit details
    Browse the repository at this point in the history
  30. Minor syntax changes in scrape_grades per PR comments

    - Removed unnecessary import to pass linting
    - Changed task collecting to use list comprehension
    - Changed colleges & years assignment to use ternary operators
    gannonprudhomme committed Mar 22, 2020
    Configuration menu
    Copy the full SHA
    d9947e9 View commit details
    Browse the repository at this point in the history