- Clone this repo
- Create a
.env
file in the root of the project directory - For each default section seen in the
db.ini
file, there are placeholders for certain key-value pairs. Example from the currentdb.ini
file:
[elephantsql] #default
host=satao.db.elephantsql.com
database=${ELEPHANTUSER}
user=${ELEPHANTUSER}
password=${ELEPHANTSQL}
The values ${ELEPHANTUSER}
and ${ELEPHANTSQL}
are placeholders for username and password to access the elephantDB. These secret values are stored as key-value pairs in the .env
file, or as config vars
in heroku deployment:
#elephantsql
ELEPHANTSQL=hunter2
ELEPHANTUSER=admin
Otherwise, you can directly replace these placeholder values in the db.ini
with the corresponding secret, but you will have to include db.ini
in your .gitignore
file.
There are 3 default sections in the db.ini
file. They are default sections because these sections are used in line #10 and #12 of main.py
and line #4 of config.py
. You can specify alternative sections to use by changing the values at said lines.
The [postgresql]
and [telebot_public]
sections in the repo's db.ini
file are examples of alternative sections.
This folder contains all the vacancy reports from previous rounds.
The naming convention is: {year} Sem {semester} Round {round}.pdf
Example: 2020 Sem 2 Round 1.pdf
For the year
variable, if the academic year is AY19/20
, then year will be 2019
.
Always take the lower year in an academic year.
During modreg, we wanted some way of quickly looking up old vacancy reports to gauge the trend in the number of vacancies left for a particular mod.
There was hardly any archive of past year vacancy reports (except for 1 reddit thread) and hence the idea for a vacancy report scraper/database and a python bot to query the database
The frontend is a bot that queries the postgresql database. On the back, pdfs of old vacancy reports are fed through a scraper to generate the relevant tables and stored into the database.
- scrape pdfs using tabula
- perform data cleaning on the scraped data
- insert clean data into postgresql database
- write some functions to query the database
- have a python bot invoke these functions
dealing with panda dataframes
- hosting a postgresql db
- good workflow implemented into the scraper so more vacancy reports can be added into the database as they come.
python-telegram-bot, postgresql, data cleaning
- we are missing vacancy reports for sem 1!!!
- expand the different ways data can be queried
- move away from text-based to image-based data visualisation for better viewing experience
- perform normalization on the database