Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Added scrape_grades command #159

Merged
merged 30 commits into from Mar 22, 2020

Conversation

gannonprudhomme
Copy link
Member

So this is mostly implemented other than the tests. I made this a PR (although probs should be a draft) just so we could discuss the testing for it, since a lot of the functions in it are kind of funky to test.

I'm also considering adding optional --year and --college arguments to it so we can quickly scrape a small amount, rather than scraping everything from 2013 - 2019.

I also need to figure out how to add SSL certificate creation to requests.get call in fetch_page_data, as not doing so gives a warning on the function call.

@gannonprudhomme gannonprudhomme added the backend Anything related to the backend API/Django label Mar 3, 2020
@gannonprudhomme gannonprudhomme added this to the Backend v0.2 milestone Mar 3, 2020
@gannonprudhomme gannonprudhomme self-assigned this Mar 3, 2020
@gannonprudhomme gannonprudhomme force-pushed the backend/scraper/scrape_grades_command branch from a0b0fcc to 730932f Compare March 3, 2020 22:25
@firejake308
Copy link
Collaborator

firejake308 commented Mar 4, 2020

I got this error: No such file or directory: '/home/firejake308/AAS/autoscheduler/documents/grade_dists/grd20193GV.pdf' Do I need to download a PDF for this to work?

EDIT: Never mind. I had to make the documents and grade_dists directories. Should probably be done by the script itself though, since the server isn't smart enough to mkdir, like I am.

@gannonprudhomme
Copy link
Member Author

gannonprudhomme commented Mar 4, 2020

I got this error: No such file or directory: '/home/firejake308/AAS/autoscheduler/documents/grade_dists/grd20193GV.pdf' Do I need to download a PDF for this to work?

Damn I was afraid of that. Do you have the autoscheduler/documents/grade_dists folder created? It should have created it automatically, but that's generally why FileNotFound errors occur for me. The download_pdf function downloads the PDF and then saves them to the directory w/ save_pdf, so it's nothing on your end

@firejake308
Copy link
Collaborator

So I resolved the FileNotFound by creating the directory manually, but now I get this error:

django.db.utils.ProgrammingError:` column sections.honors does not exist
LINE 1: ...ections"."min_credits", "sections"."max_credits", "sections"...

I tried makemirgrations and migrate. Do I need to migrate the schema from a different branch to get the sections.honors field?

@gannonprudhomme
Copy link
Member Author

So I resolved the FileNotFound by creating the directory manually, but now I get this error:

django.db.utils.ProgrammingError:` column sections.honors does not exist
LINE 1: ...ections"."min_credits", "sections"."max_credits", "sections"...

I tried makemirgrations and migrate. Do I need to migrate the schema from a different branch to get the sections.honors field?

No the honors field is in there, I'm honestly not sure why that's happening. In situations like these I generally do the following:

  1. pip install django-extensions
  2. Go to autoscheduler/autoscheduler/settings/base.py and add "django_extensions" under INSTALLED_APPS. (tutorial for setting it up here)
  3. Run python manage.py reset_db
  4. Delete all of the migrations in scraper/migrations
  5. Run makemigrations and migrate

@firejake308
Copy link
Collaborator

Ok it owrks now. I'm assuming that eventually term will be an argument to this command, just like it is for scrape_depts and scrape_courses?

# Assert
self.assertEqual(expected, result)

# Test that it throws an error on a section-not-found?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add that test, but other than that, the tests look good to me for now. scrape_pdf is a pretty simple functionality, so there's not much to test, and if you can figure out how to mock file IO and network requests, the rest of it should be good too.

@gannonprudhomme
Copy link
Member Author

Ok it works now. I'm assuming that eventually term will be an argument to this command, just like it is for scrape_depts and scrape_courses?

So grade distributions don't really do terms like Banner does them. Instead they're ordered by year+semester and the school/college its under. For instance, there's a 20191 (spring 2019) for EN, which is the College of Engineering in College Station, as well as a 20191 for GV which is for all courses in TAMU Galveston.

So like I mentioned in the PR description, we could add a year argument that takes the 20191 part of the term, as well as a college argument that takes the EN. That being said, if we can always get all of the grade distributions I'd say it's good leaving it as the default, cause ideally for deploying we only want to run most of the scraping one time to fill the DB (other than running scrape_courses periodically for updating the current seat count). Plus since grade distributions uses past semesters, it's important to have all of the data we can get.

For testing though I'd say it'd be useful to have arguments just so we can fill the DB quickly for testing various features that use the grade distributions

@firejake308
Copy link
Collaborator

Oh, right, that makes more sense. If it's possible to get all of the data at once, then I think that's the way to go

@gannonprudhomme
Copy link
Member Author

So I'm not sure how I just realized this, but the old pdf style (believe it changed around 2016 Fall) isn't actually parsed correctly, so none of the PDF's before 2016 Fall will actually have grades for them. You can see this if you try to run it for 2015 (change years in handle() from get_available_years to [2015]), you'll see that it says "No grades scraped". The only time it should do this is if it returns a bunch of Section not found's, which would indicate that you just haven't run scrape_courses for that term. But since it doesn't, this shows that it actually doesn't scrape any grades for the PDF's of that style. Working on a fix right now, and will add tests for it in pdf_parser_tests

@gannonprudhomme
Copy link
Member Author

Just added the fix for the above comment. You can read the commit description for more information about it.

@gannonprudhomme
Copy link
Member Author

I also added optional CLI arguments for scrape_grades, so you can run python manage.py scrape_grades --year 2015 --college EN, or for short, python manage.py scrape_grades -y 2015 -c EN. (The case of the college doesn't actually matter, so en would also work)

Copy link
Collaborator

@rachelconn rachelconn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these are nitpicky/subjective so let me know what you think about about these suggestions

Copy link
Collaborator

@rachelconn rachelconn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now 👍

@gannonprudhomme
Copy link
Member Author

gannonprudhomme commented Mar 22, 2020

I'm just gonna wait until #154 is merged to merge this so I can deal with rebasing here rather than in that PR

@gannonprudhomme gannonprudhomme force-pushed the backend/scraper/scrape_grades_command branch from c2503a8 to 1268c7f Compare March 22, 2020 21:47
@gannonprudhomme gannonprudhomme force-pushed the backend/scraper/scrape_grades_command branch from 1268c7f to da1d7f5 Compare March 22, 2020 21:49
@gannonprudhomme
Copy link
Member Author

Since this is completed, I'm going to change the base from backend/scraper/scrape_grades to backend/master, instead of merging this into scrape_grades then from there making a PR to merge into backend/master.

This basically just cleans up the return types so they're easier to understand
Basically just for readability purposes, functions the same

Also removed unused function generate_year_semesters()
Also some misc linting fixes
Also added it to the lint-requirements for GitHub Actions
ALso removed redundant json.close() in load_json_file and instead returned it directly
Also changed pdf_reader.getNumPages() to .numPages

Also fixed linting error
Changed get_pdf_skip_count to assign returned variables inline

Removed extra grades iteration by adding up num_students in existing for-loop

Changed list addition operator to .extend for readability
GradeManager is used for calculating an instructor's past grade distributions
Changed instructor_performance return to specify that Dict value can be a float or int

Rest of commit is minor comment fixes
Also added beautiful soup to lint-requirements
These are incomplete, and more need to be added as commented
Since only the header row of the PDF indicates that it's an old pdf style, we only knew that it was an old pdf style for the first row and not the actual grades themselves, which prevented us from actually correctly parsing the section's grades, since the old style has a different format.

To remedy this, anytime old_pdf_style is True in pdf_helper.get_pdf_skip_count, we store it (in pdf_parser.parse_page) and use it for the rest of the page.

Also adds the according tests for it
Changed PDF_DOWNLOAD_DIR to use dirname instead of relative path

Changed scrape_pdf's counts dictionary to use defaultdict

Other misc semantic syntax changes
Moved to _create_documents_folder since thats where the actual error will occur
Example usage:

python manage.py scrape_grades -c EN --year 2015
Also adds SSL verification back to scrape_grades.fetch_page_data
- Removed unnecessary import to pass linting
- Changed task collecting to use list comprehension
- Changed colleges & years assignment to use ternary operators
@gannonprudhomme
Copy link
Member Author

gannonprudhomme commented Mar 22, 2020

Ok so I think I fixed it to be back to normal, but to double check I'm gonna reset the base back to backend/scraper/scrape_grades

@gannonprudhomme gannonprudhomme changed the base branch from backend/master to backend/scraper/scrape_grades March 22, 2020 21:59
@gannonprudhomme gannonprudhomme force-pushed the backend/scraper/scrape_grades_command branch from c2503a8 to d9947e9 Compare March 22, 2020 22:00
@gannonprudhomme
Copy link
Member Author

Ok should be good to go now. Changing the base back to backend/master then rebase & merging.

@gannonprudhomme gannonprudhomme changed the base branch from backend/scraper/scrape_grades to backend/master March 22, 2020 22:03
@gannonprudhomme gannonprudhomme merged commit 960b8ad into backend/master Mar 22, 2020
@gannonprudhomme gannonprudhomme deleted the backend/scraper/scrape_grades_command branch March 22, 2020 22:06
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backend Anything related to the backend API/Django
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants