Added scrape_grades command #159

gannonprudhomme · 2020-03-03T22:23:43Z

So this is mostly implemented other than the tests. I made this a PR (although probs should be a draft) just so we could discuss the testing for it, since a lot of the functions in it are kind of funky to test.

I'm also considering adding optional --year and --college arguments to it so we can quickly scrape a small amount, rather than scraping everything from 2013 - 2019.

I also need to figure out how to add SSL certificate creation to requests.get call in fetch_page_data, as not doing so gives a warning on the function call.

firejake308 · 2020-03-04T03:06:18Z

I got this error: No such file or directory: '/home/firejake308/AAS/autoscheduler/documents/grade_dists/grd20193GV.pdf' Do I need to download a PDF for this to work?

EDIT: Never mind. I had to make the documents and grade_dists directories. Should probably be done by the script itself though, since the server isn't smart enough to mkdir, like I am.

gannonprudhomme · 2020-03-04T03:10:37Z

I got this error: No such file or directory: '/home/firejake308/AAS/autoscheduler/documents/grade_dists/grd20193GV.pdf' Do I need to download a PDF for this to work?

Damn I was afraid of that. Do you have the autoscheduler/documents/grade_dists folder created? It should have created it automatically, but that's generally why FileNotFound errors occur for me. The download_pdf function downloads the PDF and then saves them to the directory w/ save_pdf, so it's nothing on your end

firejake308 · 2020-03-04T03:14:00Z

So I resolved the FileNotFound by creating the directory manually, but now I get this error:

django.db.utils.ProgrammingError:` column sections.honors does not exist
LINE 1: ...ections"."min_credits", "sections"."max_credits", "sections"...

I tried makemirgrations and migrate. Do I need to migrate the schema from a different branch to get the sections.honors field?

gannonprudhomme · 2020-03-04T03:21:02Z

So I resolved the FileNotFound by creating the directory manually, but now I get this error:

django.db.utils.ProgrammingError:` column sections.honors does not exist
LINE 1: ...ections"."min_credits", "sections"."max_credits", "sections"...

I tried makemirgrations and migrate. Do I need to migrate the schema from a different branch to get the sections.honors field?

No the honors field is in there, I'm honestly not sure why that's happening. In situations like these I generally do the following:

pip install django-extensions
Go to autoscheduler/autoscheduler/settings/base.py and add "django_extensions" under INSTALLED_APPS. (tutorial for setting it up here)
Run python manage.py reset_db
Delete all of the migrations in scraper/migrations
Run makemigrations and migrate

firejake308 · 2020-03-04T03:50:02Z

Ok it owrks now. I'm assuming that eventually term will be an argument to this command, just like it is for scrape_depts and scrape_courses?

firejake308 · 2020-03-04T04:10:02Z

autoscheduler/scraper/tests/commands/scrape_grades_tests.py

+        # Assert
+        self.assertEqual(expected, result)
+
+    # Test that it throws an error on a section-not-found?


I would add that test, but other than that, the tests look good to me for now. scrape_pdf is a pretty simple functionality, so there's not much to test, and if you can figure out how to mock file IO and network requests, the rest of it should be good too.

gannonprudhomme · 2020-03-04T04:24:44Z

Ok it works now. I'm assuming that eventually term will be an argument to this command, just like it is for scrape_depts and scrape_courses?

So grade distributions don't really do terms like Banner does them. Instead they're ordered by year+semester and the school/college its under. For instance, there's a 20191 (spring 2019) for EN, which is the College of Engineering in College Station, as well as a 20191 for GV which is for all courses in TAMU Galveston.

So like I mentioned in the PR description, we could add a year argument that takes the 20191 part of the term, as well as a college argument that takes the EN. That being said, if we can always get all of the grade distributions I'd say it's good leaving it as the default, cause ideally for deploying we only want to run most of the scraping one time to fill the DB (other than running scrape_courses periodically for updating the current seat count). Plus since grade distributions uses past semesters, it's important to have all of the data we can get.

For testing though I'd say it'd be useful to have arguments just so we can fill the DB quickly for testing various features that use the grade distributions

firejake308 · 2020-03-04T04:33:38Z

Oh, right, that makes more sense. If it's possible to get all of the data at once, then I think that's the way to go

autoscheduler/scraper/management/commands/scrape_grades.py

gannonprudhomme · 2020-03-14T17:32:14Z

So I'm not sure how I just realized this, but the old pdf style (believe it changed around 2016 Fall) isn't actually parsed correctly, so none of the PDF's before 2016 Fall will actually have grades for them. You can see this if you try to run it for 2015 (change years in handle() from get_available_years to [2015]), you'll see that it says "No grades scraped". The only time it should do this is if it returns a bunch of Section not found's, which would indicate that you just haven't run scrape_courses for that term. But since it doesn't, this shows that it actually doesn't scrape any grades for the PDF's of that style. Working on a fix right now, and will add tests for it in pdf_parser_tests

gannonprudhomme · 2020-03-14T18:14:51Z

Just added the fix for the above comment. You can read the commit description for more information about it.

gannonprudhomme · 2020-03-14T18:46:03Z

I also added optional CLI arguments for scrape_grades, so you can run python manage.py scrape_grades --year 2015 --college EN, or for short, python manage.py scrape_grades -y 2015 -c EN. (The case of the college doesn't actually matter, so en would also work)

autoscheduler/scraper/management/commands/scrape_grades.py

rachelconn

Most of these are nitpicky/subjective so let me know what you think about about these suggestions

autoscheduler/scraper/management/commands/scrape_grades.py

rachelconn

Looks good now 👍

gannonprudhomme · 2020-03-22T17:27:56Z

I'm just gonna wait until #154 is merged to merge this so I can deal with rebasing here rather than in that PR

gannonprudhomme · 2020-03-22T21:52:00Z

Since this is completed, I'm going to change the base from backend/scraper/scrape_grades to backend/master, instead of merging this into scrape_grades then from there making a PR to merge into backend/master.

This basically just cleans up the return types so they're easier to understand

Basically just for readability purposes, functions the same Also removed unused function generate_year_semesters()

Also some misc linting fixes

Also added it to the lint-requirements for GitHub Actions

ALso removed redundant json.close() in load_json_file and instead returned it directly

Also changed pdf_reader.getNumPages() to .numPages Also fixed linting error

Changed get_pdf_skip_count to assign returned variables inline Removed extra grades iteration by adding up num_students in existing for-loop Changed list addition operator to .extend for readability

GradeManager is used for calculating an instructor's past grade distributions

Changed instructor_performance return to specify that Dict value can be a float or int Rest of commit is minor comment fixes

Also added beautiful soup to lint-requirements

These are incomplete, and more need to be added as commented

Since only the header row of the PDF indicates that it's an old pdf style, we only knew that it was an old pdf style for the first row and not the actual grades themselves, which prevented us from actually correctly parsing the section's grades, since the old style has a different format. To remedy this, anytime old_pdf_style is True in pdf_helper.get_pdf_skip_count, we store it (in pdf_parser.parse_page) and use it for the rest of the page. Also adds the according tests for it

Changed PDF_DOWNLOAD_DIR to use dirname instead of relative path Changed scrape_pdf's counts dictionary to use defaultdict Other misc semantic syntax changes

Moved to _create_documents_folder since thats where the actual error will occur

Example usage: python manage.py scrape_grades -c EN --year 2015

Also adds SSL verification back to scrape_grades.fetch_page_data

- Removed unnecessary import to pass linting - Changed task collecting to use list comprehension - Changed colleges & years assignment to use ternary operators

gannonprudhomme · 2020-03-22T21:59:24Z

Ok so I think I fixed it to be back to normal, but to double check I'm gonna reset the base back to backend/scraper/scrape_grades

gannonprudhomme · 2020-03-22T22:03:15Z

Ok should be good to go now. Changing the base back to backend/master then rebase & merging.

gannonprudhomme added the backend Anything related to the backend API/Django label Mar 3, 2020

gannonprudhomme added this to the Backend v0.2 milestone Mar 3, 2020

gannonprudhomme requested review from firejake308 and rachelconn March 3, 2020 22:23

gannonprudhomme self-assigned this Mar 3, 2020

gannonprudhomme force-pushed the backend/scraper/scrape_grades_command branch from a0b0fcc to 730932f Compare March 3, 2020 22:25

firejake308 reviewed Mar 4, 2020

View reviewed changes

rachelconn suggested changes Mar 12, 2020

View reviewed changes

gannonprudhomme requested a review from rachelconn March 14, 2020 22:54

rachelconn suggested changes Mar 17, 2020

View reviewed changes

gannonprudhomme requested a review from rachelconn March 17, 2020 21:05

rachelconn suggested changes Mar 21, 2020

View reviewed changes

gannonprudhomme requested a review from rachelconn March 21, 2020 22:47

rachelconn approved these changes Mar 22, 2020

View reviewed changes

gannonprudhomme force-pushed the backend/scraper/scrape_grades branch from 784d066 to 256e8cf Compare March 22, 2020 21:46

gannonprudhomme force-pushed the backend/scraper/scrape_grades_command branch from c2503a8 to 1268c7f Compare March 22, 2020 21:47

gannonprudhomme force-pushed the backend/scraper/scrape_grades branch from 256e8cf to 58133e2 Compare March 22, 2020 21:49

gannonprudhomme force-pushed the backend/scraper/scrape_grades_command branch from 1268c7f to da1d7f5 Compare March 22, 2020 21:49

gannonprudhomme added 22 commits March 22, 2020 16:57

Changed pdf_parser functions to use GradeData

6d791cf

This basically just cleans up the return types so they're easier to understand

Extracted out extract_letter_grades from parse_page

eb20c98

Basically just for readability purposes, functions the same Also removed unused function generate_year_semesters()

Linting fixes & cleanup in pdf_parser+helper

db67218

Added parse_page test to pdf_parser_tests

8d2a00c

Also some misc linting fixes

Added PyPDF2 to requirements for pdf_parser

33bb200

Also added it to the lint-requirements for GitHub Actions

Changed generate_path to be public function in load_json.py

dadf932

ALso removed redundant json.close() in load_json_file and instead returned it directly

Added parse_pdf test

a5f2012

Added returned of pdf_data in pdf_parse.parse_pdf

573bd95

Also changed pdf_reader.getNumPages() to .numPages Also fixed linting error

Semantic fixes in pdf_parser

79080ef

Changed get_pdf_skip_count to assign returned variables inline Removed extra grades iteration by adding up num_students in existing for-loop Changed list addition operator to .extend for readability

Added Grades model + GradeManager

c777a45

GradeManager is used for calculating an instructor's past grade distributions

Added GradeManager.instructor_performance tests

107ad0d

Minor fixes in models_tests & grades model

6a130a9

Changed instructor_performance return to specify that Dict value can be a float or int Rest of commit is minor comment fixes

Added beautifulsoup and lxml to requirements.txt

692c49c

Also added beautiful soup to lint-requirements

Added scrape_grades command

bbf4fb1

Added tests for scrape_grades

0a49dc5

These are incomplete, and more need to be added as commented

Added suggestions to scrape_grades

bc21c29

Changed PDF_DOWNLOAD_DIR to use dirname instead of relative path Changed scrape_pdf's counts dictionary to use defaultdict Other misc semantic syntax changes

Added additional scrape_pdf test

e3fba69

Updated documents/grade_dists error catching

dc9fbf5

Moved to _create_documents_folder since thats where the actual error will occur

Added optional arguments for scrape_grades

12ae93c

Example usage: python manage.py scrape_grades -c EN --year 2015

Misc semantic changes in scrape_grades

463c1e6

Also adds SSL verification back to scrape_grades.fetch_page_data

Minor syntax changes in scrape_grades per PR comments

d9947e9

- Removed unnecessary import to pass linting - Changed task collecting to use list comprehension - Changed colleges & years assignment to use ternary operators

gannonprudhomme changed the base branch from backend/master to backend/scraper/scrape_grades March 22, 2020 21:59

gannonprudhomme force-pushed the backend/scraper/scrape_grades branch from 784d066 to 6a130a9 Compare March 22, 2020 22:00

gannonprudhomme force-pushed the backend/scraper/scrape_grades_command branch from c2503a8 to d9947e9 Compare March 22, 2020 22:00

gannonprudhomme changed the base branch from backend/scraper/scrape_grades to backend/master March 22, 2020 22:03

gannonprudhomme merged commit 960b8ad into backend/master Mar 22, 2020

gannonprudhomme deleted the backend/scraper/scrape_grades_command branch March 22, 2020 22:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added scrape_grades command #159

Added scrape_grades command #159

gannonprudhomme commented Mar 3, 2020

firejake308 commented Mar 4, 2020 •

edited

gannonprudhomme commented Mar 4, 2020 •

edited

firejake308 commented Mar 4, 2020

gannonprudhomme commented Mar 4, 2020

firejake308 commented Mar 4, 2020

firejake308 Mar 4, 2020

gannonprudhomme commented Mar 4, 2020

firejake308 commented Mar 4, 2020

gannonprudhomme commented Mar 14, 2020

gannonprudhomme commented Mar 14, 2020

gannonprudhomme commented Mar 14, 2020

rachelconn left a comment

rachelconn left a comment

gannonprudhomme commented Mar 22, 2020 •

edited

gannonprudhomme commented Mar 22, 2020

gannonprudhomme commented Mar 22, 2020 •

edited

gannonprudhomme commented Mar 22, 2020

Added scrape_grades command #159

Added scrape_grades command #159

Conversation

gannonprudhomme commented Mar 3, 2020

firejake308 commented Mar 4, 2020 • edited

gannonprudhomme commented Mar 4, 2020 • edited

firejake308 commented Mar 4, 2020

gannonprudhomme commented Mar 4, 2020

firejake308 commented Mar 4, 2020

firejake308 Mar 4, 2020

Choose a reason for hiding this comment

gannonprudhomme commented Mar 4, 2020

firejake308 commented Mar 4, 2020

gannonprudhomme commented Mar 14, 2020

gannonprudhomme commented Mar 14, 2020

gannonprudhomme commented Mar 14, 2020

rachelconn left a comment

Choose a reason for hiding this comment

rachelconn left a comment

Choose a reason for hiding this comment

gannonprudhomme commented Mar 22, 2020 • edited

gannonprudhomme commented Mar 22, 2020

gannonprudhomme commented Mar 22, 2020 • edited

gannonprudhomme commented Mar 22, 2020

firejake308 commented Mar 4, 2020 •

edited

gannonprudhomme commented Mar 4, 2020 •

edited

gannonprudhomme commented Mar 22, 2020 •

edited

gannonprudhomme commented Mar 22, 2020 •

edited