Added scrape_grades command #159
Added scrape_grades command #159
Conversation
a0b0fcc
to
730932f
Compare
I got this error: EDIT: Never mind. I had to make the |
Damn I was afraid of that. Do you have the |
So I resolved the FileNotFound by creating the directory manually, but now I get this error:
I tried |
No the honors field is in there, I'm honestly not sure why that's happening. In situations like these I generally do the following:
|
Ok it owrks now. I'm assuming that eventually |
# Assert | ||
self.assertEqual(expected, result) | ||
|
||
# Test that it throws an error on a section-not-found? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add that test, but other than that, the tests look good to me for now. scrape_pdf
is a pretty simple functionality, so there's not much to test, and if you can figure out how to mock file IO and network requests, the rest of it should be good too.
So grade distributions don't really do terms like Banner does them. Instead they're ordered by year+semester and the school/college its under. For instance, there's a So like I mentioned in the PR description, we could add a year argument that takes the For testing though I'd say it'd be useful to have arguments just so we can fill the DB quickly for testing various features that use the grade distributions |
Oh, right, that makes more sense. If it's possible to get all of the data at once, then I think that's the way to go |
So I'm not sure how I just realized this, but the old pdf style (believe it changed around 2016 Fall) isn't actually parsed correctly, so none of the PDF's before 2016 Fall will actually have grades for them. You can see this if you try to run it for 2015 (change |
Just added the fix for the above comment. You can read the commit description for more information about it. |
I also added optional CLI arguments for scrape_grades, so you can run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of these are nitpicky/subjective so let me know what you think about about these suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good now 👍
I'm just gonna wait until #154 is merged to merge this so I can deal with rebasing here rather than in that PR |
784d066
to
256e8cf
Compare
c2503a8
to
1268c7f
Compare
256e8cf
to
58133e2
Compare
1268c7f
to
da1d7f5
Compare
Since this is completed, I'm going to change the base from |
This basically just cleans up the return types so they're easier to understand
Basically just for readability purposes, functions the same Also removed unused function generate_year_semesters()
Also some misc linting fixes
Also added it to the lint-requirements for GitHub Actions
ALso removed redundant json.close() in load_json_file and instead returned it directly
Also changed pdf_reader.getNumPages() to .numPages Also fixed linting error
Changed get_pdf_skip_count to assign returned variables inline Removed extra grades iteration by adding up num_students in existing for-loop Changed list addition operator to .extend for readability
GradeManager is used for calculating an instructor's past grade distributions
Changed instructor_performance return to specify that Dict value can be a float or int Rest of commit is minor comment fixes
Also added beautiful soup to lint-requirements
These are incomplete, and more need to be added as commented
Since only the header row of the PDF indicates that it's an old pdf style, we only knew that it was an old pdf style for the first row and not the actual grades themselves, which prevented us from actually correctly parsing the section's grades, since the old style has a different format. To remedy this, anytime old_pdf_style is True in pdf_helper.get_pdf_skip_count, we store it (in pdf_parser.parse_page) and use it for the rest of the page. Also adds the according tests for it
Changed PDF_DOWNLOAD_DIR to use dirname instead of relative path Changed scrape_pdf's counts dictionary to use defaultdict Other misc semantic syntax changes
Moved to _create_documents_folder since thats where the actual error will occur
Example usage: python manage.py scrape_grades -c EN --year 2015
Also adds SSL verification back to scrape_grades.fetch_page_data
- Removed unnecessary import to pass linting - Changed task collecting to use list comprehension - Changed colleges & years assignment to use ternary operators
Ok so I think I fixed it to be back to normal, but to double check I'm gonna reset the base back to |
784d066
to
6a130a9
Compare
c2503a8
to
d9947e9
Compare
Ok should be good to go now. Changing the base back to |
So this is mostly implemented other than the tests. I made this a PR (although probs should be a draft) just so we could discuss the testing for it, since a lot of the functions in it are kind of funky to test.
I'm also considering adding optional
--year
and--college
arguments to it so we can quickly scrape a small amount, rather than scraping everything from 2013 - 2019.I also need to figure out how to add SSL certificate creation to
requests.get
call infetch_page_data
, as not doing so gives a warning on the function call.