Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2. Pick a topic #2

Closed
jianructose opened this issue Feb 24, 2021 · 11 comments
Closed

2. Pick a topic #2

jianructose opened this issue Feb 24, 2021 · 11 comments
Assignees
Milestone

Comments

@jianructose
Copy link
Collaborator

jianructose commented Feb 24, 2021

  • Come up with a topic for your project. Discuss the topic with your TA or lab instructor and proceed only after your topic has been approved by one of them.
  • Please vote by emojis.
  • It would be great to also include Pros/Cons of your proposed package topic.
  • Please have this ready by 2pm Thursday lab time.
@yhchen20
Copy link
Collaborator

My idea is to build a simple bank system but not sure if it's feasible

Possible functions:

  1. Create account - Input First Name, Last Name, Balance (defalt balance=0). The funtion will create an account number and store all personal information with the account number
  2. Deposit and Withdraw - increase or decrease the balance (the withdraw amount can't exceed the balance.)
  3. Certificates of Deposit - Print out all information about the account and the account balance
  4. Transfer - tranfer money to another account

@tdkhanhvu
Copy link
Collaborator

tdkhanhvu commented Feb 25, 2021

I have 2 ideas:

I) StockPorfolio:
Inside parentheses are parameters

  • Take historical prices of stocks (symbol, date, price)
  • Register stock purchases (symbol, quantity, date purchased, price)
  • Calculate the portfolio value at a point of time based on historical prices from above (date)
  • Simulate the portfolio value in the future (date): using moving average to predict the price in the future.

Pros: Easy to implement + test
Cons: Not sure if the complexity is sufficient.


II) Greedier / GreedyOptimizer:
A miscellaneous package that contains multiple optimization functions using Greedy algorithm. It will be fun to implement these functions, and the solution is not that hard.

Pros: These functions are independent of each other. So working on 1 will not break the other one. Easy to test (just think of some test cases).
Cons: Will be harder to implement compared to ordinary functions. But I can help. Not sure if Tiff / TA is okay with this approach or they want a list of functions that are related to each other.

1) Coin changer
Problem: Given a list of coin denominations, exchange them from a sum of money so that the total number of coins is minimal.
Condition: Coin denominations are divisible by one another. Ex: 50, 10, 5, 1
Solution: Divide by the largest nomination, then the second largest...
75 = 50 X 1 + 10 X 2 + 5 X 1

2) Max value in a 2D array
Problem: Given a 2D n x n array with each cell is a number, find the path from row 0 to row n - 1 so that the total sum is the biggest.
Condition: When going from row i to i + 1, you can either go straight (same column) or to the left (-1 column) or to the right(+1 column).
Solution: For each cell, compare to see the biggest path from that cell to these 3 cells: straight, left and right of the next row.

3) Activity selection:
Problem: Given a list of activity with start time and end time, try to schedule them so that as many activities can take place as possible.
Condition: No two activities can be overlapping.
Solution: Find the activity that finish first, then the next one that takes place after this one and finish first...

4) Police catch thief:
Problem: Given an array of size n that has the following specifications:

  • Each element in the array contains either a policeman or a thief.
  • Each policeman can catch only one thief.
  • A policeman cannot catch a thief who is more than K units away from the policeman.
    We need to find the maximum number of thieves that can be caught.
    Solution: link

@jraza19
Copy link
Collaborator

jraza19 commented Feb 25, 2021

EDA for a supervised learning dataset (with target column):

Possible functions:

  1. Shows the number of NA/missing data in the different columns of your data (if there are many columns - set a limit for top 10 or 20)

  2. Create overlapping histograms for the numerical columns against the target column (using the repeat)

  3. Create a heat-map to compare the categorical columns against the target column

  4. Create a correlation matrix and correlation df to compare against the target column

  5. Shows unique id columns so you know what to drop

  6. Find the columns that are boolean

Pros - can see its application to many projects that we already do

Cons -

  • not that original of an idea (pandas profiling kind of does this for us/there is way to do this in R too)
  • limited to a particular type of dataset
  • if the dataset has too many columns it might be hard to pick - might have to let the user decide the columns
  • might be hard to write tests for this

@jianructose
Copy link
Collaborator Author

mds-tracker

  • This is a mds-specific package for course management as well progress tracking, in which it can return a dataframe, some visualizations on how many courses/days have been covered and left.

  • Possible functions: default will be based on 2020-2021 cohort time

    1. setup_a_course (course_code=default is based on this cohort, time=now)->
course start_time end_time lab1 lab2 quiz1 lab3 lab4 quiz2 days_completed days_left %done notes
554 xxx xxx 1 0 0 10 30 25%
563 xxx xxx 1
591 xxx 0
 2. update a progress and return an updated dataframe
 3. delete a progress and return a dataframe
 4. return a bar chart or pie chart viz on progress by block or course or month or week etc.
  • Pros: easy to understand and would benefit this cohort and next few years for targeted clients
  • Cons: not sure how feasible this can be

@jianructose
Copy link
Collaborator Author

job-post-nlp

  • This package is for text analysis on job descriptions specifically for ds/ml/da/bi roles, which will return a dataframe of: job title, post time, deadline, PT/FT/contract, location, salary, skill-keywords, benefit, etc.
  • Possible functions:
  1. Parse a job_post.txt as a list of strings
  2. clean up the corpus using regex to scrub out any hyper links
  3. parse the corpus and return a df of job title, employer name, location, salary, ddl, etc.
  4. cluster viz on job titles

@jianructose
Copy link
Collaborator Author

pywash

  • a package containing data cleaning functions for the downstream eda/ml analysis.
  • possible functions:
  1. identify data type and missing values
  2. proper imputation for missing values
  3. scaler for numerical features
  4. onehot encoder for categorical features
  5. train/valid/test split
  • Pros: easy to understand; feasible for this course; can benefit in a long run for any ml/eda
  • Cons: has been done too much??

@jraza19 jraza19 added this to the Milestone 1 milestone Feb 25, 2021
@tdkhanhvu
Copy link
Collaborator

tdkhanhvu commented Feb 25, 2021

MDSGradeTracker
Manage the grades of students for each course, with the option to calculate the summary and suggest the adjusted scores for any assessment components.

1) Register the courses

  • (Course ID, Course Name) + assessments (ex: 15% lab 1 / lab 2... 20% quiz 1 / quiz 2)
  • Assessments must sum up to 100%
  • Assessments must have the order (we must have lab1 before we can have lab2)

2) Record the grade for students:

  • Student ID, Course ID, Assessment ID, Grade
    => will throw an error if Course Id or Assessment ID do not exist

3) Summarize the grades of students:
Each course:

  • Mean / Median / Quantile 1,2,3,4

Across courses:
Based on average grade for each course, rank the courses in the decreasing order.

Rank students:
In terms of their GPAs

4) Suggest grade adjustment:
Based on a predefined benchmark (90% for each lab, 90% for whole course, 85% for quiz), suggest grade adjustments for any course and return the suggested grades.

Pros:

  • Easy to implement
  • Useful for lecturers
  • Intuitive concepts

Cons:

  • Too simple? => We can add extra steps to check the input format in the first 2 functions.

@tdkhanhvu
Copy link
Collaborator

I know this is too soon to think of the implementation, but I believe giving a thought about how we will store the data in memory will help us write better function specification.

I am thinking of 2 ways:

1) Using dictionary:

courses = {
    "511": {
        "lab1": 0.15,
        "quiz1": 0.2,
        ...
    },
    "523": {
        "lab1": 0.13,
        "worksheet1": 0.01,
        "quiz1": 0.2,
        ...
    }
}

grades = {
    "511": {
        "Mr100": {
            "lab1": 100,
            "quiz1": 100,
            ...
        },
        "MrBarelyPass": {
            "lab1": 60.5,
            "quiz1": 65.5,
            ...
        },        
    }
}

2) Using dataframe:
courses

CourseID Lab1 Lab2
511 0.15 0.15
523 0.13 0.13

grades
(the first dimension will be the course id)

StudentID Lab1 Lab2
Mr100 100 100
MrBarelyPass 60 65

Dictionary
Pros:

  • Each course may have different components (self-reflection, worksheet...). So adding a new course with new components does not mess up with the dataframe structure

Cons:

  • Not supported in R. We need to use list

Dataframe
Pros:

  • Easier to visualize the data
  • Easier to match columns between grades and courses (hopefully we can utilize numpy / matrix operation?)

Cons:

  • Need to assign Component Name / Student ID... as row name / column name
  • Many columns may be 0 if some courses do not have that component (We can use SparseMatrix, which can be supported in R also, but not sure it may overcomplicate things). As the data size will be small (25 courses x 20 components x 100 students ~ 50k cells), it may not be a big issue.

@jraza19
Copy link
Collaborator

jraza19 commented Feb 26, 2021

I know this is too soon to think of the implementation, but I believe giving a thought about how we will store the data in memory will help us write better function specification.

I am thinking of 2 ways:

1) Using dictionary:

courses = {
    "511": {
        "lab1": 0.15,
        "quiz1": 0.2,
        ...
    },
    "523": {
        "lab1": 0.13,
        "worksheet1": 0.01,
        "quiz1": 0.2,
        ...
    }
}

grades = {
    "511": {
        "Mr100": {
            "lab1": 100,
            "quiz1": 100,
            ...
        },
        "MrBarelyPass": {
            "lab1": 60.5,
            "quiz1": 65.5,
            ...
        },        
    }
}

2) Using dataframe:
courses

CourseID Lab1 Lab2
511 0.15 0.15
523 0.13 0.13
grades
(the first dimension will be the course id)

StudentID Lab1 Lab2
Mr100 100 100
MrBarelyPass 60 65
Dictionary
Pros:

  • Each course may have different components (self-reflection, worksheet...). So adding a new course with new components does not mess up with the dataframe structure

Cons:

  • Not supported in R. We need to use list

Dataframe
Pros:

  • Easier to visualize the data
  • Easier to match columns between grades and courses (hopefully we can utilize numpy / matrix operation?)

Cons:

  • Need to assign Component Name / Student ID... as row name / column name
  • Many columns may be 0 if some courses do not have that component (We can use SparseMatrix, which can be supported in R also, but not sure it may overcomplicate things). As the data size will be small (25 courses x 20 components x 100 students ~ 50k cells), it may not be a big issue.

This is a good point and definitely needs to be discussed out. I prefer the dataframe method as this is what I am most used to - plus working with named lists is kinda of a pain in R in my opinion.

@jraza19
Copy link
Collaborator

jraza19 commented Feb 26, 2021

APPROVED IDEA

Documenting our final approved idea for reference here:

MDSGradeTracker
Manage the grades of students for each course, with the option to calculate the summary and suggest the adjusted scores for any assessment components.

  1. Register the courses (Jianru)

Purpose:
Read/store the input data as a dataframe
Checking/performing assessments on the data to ensure it is in a format that we need for the rest of the functions

Input - csv file
Output - None

Details:
(Course ID, Course Name) + assessments (ex: 15% lab 1 / lab 2... 20% quiz 1 / quiz 2)
Assessments must sum up to 100%
Assessments must have the order (we must have lab1 before we can have lab2)

  1. Record the grade for students (Jianru)
    Purpose:
    Checking/performing assessments on the data to ensure it is in a format that we need for the rest of the functions
    Deal with missing data

Input - dataframe (Student ID, Course ID, Assessment ID, Grade)
Output - None

Details:
will throw an error if Course Id or Assessment ID do not exist

  1. Summarize the grades by course (Yanhua):

Purpose:
Provides summary statistics on the courses
Lecturers can benchmark the difficulty based on the grades

Input - option to choose which method with a default option (summary, across_summary)
Output - summary (summary statistics on each course - Mean / Median / Quantile 1,2,3,4) and across_summary(based on average grade for each course, rank the courses in the decreasing order)

  1. Summarise the grades by students (Javairia):

Purpose:
Calculate the average grade for all students and provide the ranking for the course selected or the whole program.
Lecturers can find students that are struggling with the courses

Input - a specified course with a default option for the entire program
Output - student’s ranking

  1. Suggest grade adjustment (Vu):

Purpose:
Based on a predefined benchmark (90% for each lab, 90% for whole course, 85% for quiz), suggest grade adjustments for any course and return the suggested grades.

Input - three benchmark variables (course, lab, quiz), course ID
Output - summary table with suggested grade adjustment

function(courseid, benchmark_course = 0.9, benchmark_lab = 0.85, benchmark_quiz = 0.85)

@tdkhanhvu
Copy link
Collaborator

To make the parameter naming consistent, should we adopt this style (Python style)?

  • course_id
  • course_name
  • assessment_id
  • benchmark_course
  • benchmark_lab
  • benchmark_quiz

@jraza19 jraza19 closed this as completed Feb 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants