2. Pick a topic #2

jianructose · 2021-02-24T00:55:17Z

Come up with a topic for your project. Discuss the topic with your TA or lab instructor and proceed only after your topic has been approved by one of them.
Please vote by emojis.
It would be great to also include Pros/Cons of your proposed package topic.
Please have this ready by 2pm Thursday lab time.

yhchen20 · 2021-02-24T23:07:35Z

My idea is to build a simple bank system but not sure if it's feasible

Possible functions:

Create account - Input First Name, Last Name, Balance (defalt balance=0). The funtion will create an account number and store all personal information with the account number
Deposit and Withdraw - increase or decrease the balance (the withdraw amount can't exceed the balance.)
Certificates of Deposit - Print out all information about the account and the account balance
Transfer - tranfer money to another account

tdkhanhvu · 2021-02-25T01:23:35Z

I have 2 ideas:

I) StockPorfolio:
Inside parentheses are parameters

Take historical prices of stocks (symbol, date, price)
Register stock purchases (symbol, quantity, date purchased, price)
Calculate the portfolio value at a point of time based on historical prices from above (date)
Simulate the portfolio value in the future (date): using moving average to predict the price in the future.

Pros: Easy to implement + test
Cons: Not sure if the complexity is sufficient.

II) Greedier / GreedyOptimizer:
A miscellaneous package that contains multiple optimization functions using Greedy algorithm. It will be fun to implement these functions, and the solution is not that hard.

Pros: These functions are independent of each other. So working on 1 will not break the other one. Easy to test (just think of some test cases).
Cons: Will be harder to implement compared to ordinary functions. But I can help. Not sure if Tiff / TA is okay with this approach or they want a list of functions that are related to each other.

1) Coin changer
Problem: Given a list of coin denominations, exchange them from a sum of money so that the total number of coins is minimal.
Condition: Coin denominations are divisible by one another. Ex: 50, 10, 5, 1
Solution: Divide by the largest nomination, then the second largest...
75 = 50 X 1 + 10 X 2 + 5 X 1

2) Max value in a 2D array
Problem: Given a 2D n x n array with each cell is a number, find the path from row 0 to row n - 1 so that the total sum is the biggest.
Condition: When going from row i to i + 1, you can either go straight (same column) or to the left (-1 column) or to the right(+1 column).
Solution: For each cell, compare to see the biggest path from that cell to these 3 cells: straight, left and right of the next row.

3) Activity selection:
Problem: Given a list of activity with start time and end time, try to schedule them so that as many activities can take place as possible.
Condition: No two activities can be overlapping.
Solution: Find the activity that finish first, then the next one that takes place after this one and finish first...

4) Police catch thief:
Problem: Given an array of size n that has the following specifications:

Each element in the array contains either a policeman or a thief.
Each policeman can catch only one thief.
A policeman cannot catch a thief who is more than K units away from the policeman.
We need to find the maximum number of thieves that can be caught.
Solution: link

jraza19 · 2021-02-25T01:41:10Z

EDA for a supervised learning dataset (with target column):

Possible functions:

Shows the number of NA/missing data in the different columns of your data (if there are many columns - set a limit for top 10 or 20)
Create overlapping histograms for the numerical columns against the target column (using the repeat)
Create a heat-map to compare the categorical columns against the target column
Create a correlation matrix and correlation df to compare against the target column
Shows unique id columns so you know what to drop
Find the columns that are boolean

Pros - can see its application to many projects that we already do

Cons -

not that original of an idea (pandas profiling kind of does this for us/there is way to do this in R too)
limited to a particular type of dataset
if the dataset has too many columns it might be hard to pick - might have to let the user decide the columns
might be hard to write tests for this

jianructose · 2021-02-25T07:32:19Z

mds-tracker

This is a mds-specific package for course management as well progress tracking, in which it can return a dataframe, some visualizations on how many courses/days have been covered and left.
Possible functions: default will be based on 2020-2021 cohort time
1. setup_a_course (course_code=default is based on this cohort, time=now)->

course	start_time	end_time	lab1	lab2	quiz1	days_completed	days_left	%done
554	xxx	xxx	1	0	0	10	30	25%
563	xxx	xxx	1
591	xxx		0

 2. update a progress and return an updated dataframe
 3. delete a progress and return a dataframe
 4. return a bar chart or pie chart viz on progress by block or course or month or week etc.

Pros: easy to understand and would benefit this cohort and next few years for targeted clients
Cons: not sure how feasible this can be

jianructose · 2021-02-25T08:01:45Z

job-post-nlp

This package is for text analysis on job descriptions specifically for ds/ml/da/bi roles, which will return a dataframe of: job title, post time, deadline, PT/FT/contract, location, salary, skill-keywords, benefit, etc.
Possible functions:

Parse a job_post.txt as a list of strings
clean up the corpus using regex to scrub out any hyper links
parse the corpus and return a df of job title, employer name, location, salary, ddl, etc.
cluster viz on job titles

a good example: https://sites.northwestern.edu/msia/2020/11/30/what-skills-do-data-scientists-need-a-text-analysis-of-job-postings/
https://medium.com/analytics-vidhya/classifying-tech-data-job-postings-on-indeed-com-1fd8ca6e7cdd
https://medium.com/data-science-101/classifying-job-posts-via-nlp-3b2b49a33247
Pros: good application on 563 with possible clustering/pca and Word2vec embedding techniques etc.; useful for job searching
Cons: nlp knowledge might not be sufficient for now (feasibility)

jianructose · 2021-02-25T08:14:46Z

pywash

a package containing data cleaning functions for the downstream eda/ml analysis.
possible functions:

identify data type and missing values
proper imputation for missing values
scaler for numerical features
onehot encoder for categorical features
train/valid/test split

Pros: easy to understand; feasible for this course; can benefit in a long run for any ml/eda
Cons: has been done too much??

tdkhanhvu · 2021-02-25T21:33:54Z

MDSGradeTracker
Manage the grades of students for each course, with the option to calculate the summary and suggest the adjusted scores for any assessment components.

1) Register the courses

(Course ID, Course Name) + assessments (ex: 15% lab 1 / lab 2... 20% quiz 1 / quiz 2)
Assessments must sum up to 100%
Assessments must have the order (we must have lab1 before we can have lab2)

2) Record the grade for students:

Student ID, Course ID, Assessment ID, Grade
=> will throw an error if Course Id or Assessment ID do not exist

3) Summarize the grades of students:
Each course:

Mean / Median / Quantile 1,2,3,4

Across courses:
Based on average grade for each course, rank the courses in the decreasing order.

Rank students:
In terms of their GPAs

4) Suggest grade adjustment:
Based on a predefined benchmark (90% for each lab, 90% for whole course, 85% for quiz), suggest grade adjustments for any course and return the suggested grades.

Pros:

Easy to implement
Useful for lecturers
Intuitive concepts

Cons:

Too simple? => We can add extra steps to check the input format in the first 2 functions.

tdkhanhvu · 2021-02-26T03:10:49Z

I know this is too soon to think of the implementation, but I believe giving a thought about how we will store the data in memory will help us write better function specification.

I am thinking of 2 ways:

1) Using dictionary:

courses = {
    "511": {
        "lab1": 0.15,
        "quiz1": 0.2,
        ...
    },
    "523": {
        "lab1": 0.13,
        "worksheet1": 0.01,
        "quiz1": 0.2,
        ...
    }
}

grades = {
    "511": {
        "Mr100": {
            "lab1": 100,
            "quiz1": 100,
            ...
        },
        "MrBarelyPass": {
            "lab1": 60.5,
            "quiz1": 65.5,
            ...
        },        
    }
}

2) Using dataframe:
courses

CourseID	Lab1	Lab2
511	0.15	0.15
523	0.13	0.13

grades
(the first dimension will be the course id)

StudentID	Lab1	Lab2
Mr100	100	100
MrBarelyPass	60	65

Dictionary
Pros:

Each course may have different components (self-reflection, worksheet...). So adding a new course with new components does not mess up with the dataframe structure

Cons:

Not supported in R. We need to use list

Dataframe
Pros:

Easier to visualize the data
Easier to match columns between grades and courses (hopefully we can utilize numpy / matrix operation?)

Cons:

Need to assign Component Name / Student ID... as row name / column name
Many columns may be 0 if some courses do not have that component (We can use SparseMatrix, which can be supported in R also, but not sure it may overcomplicate things). As the data size will be small (25 courses x 20 components x 100 students ~ 50k cells), it may not be a big issue.

jraza19 · 2021-02-26T04:07:54Z

I know this is too soon to think of the implementation, but I believe giving a thought about how we will store the data in memory will help us write better function specification.

I am thinking of 2 ways:

1) Using dictionary:
courses = {
    "511": {
        "lab1": 0.15,
        "quiz1": 0.2,
        ...
    },
    "523": {
        "lab1": 0.13,
        "worksheet1": 0.01,
        "quiz1": 0.2,
        ...
    }
}

grades = {
    "511": {
        "Mr100": {
            "lab1": 100,
            "quiz1": 100,
            ...
        },
        "MrBarelyPass": {
            "lab1": 60.5,
            "quiz1": 65.5,
            ...
        },        
    }
}
2) Using dataframe:
courses

CourseID Lab1 Lab2
511 0.15 0.15
523 0.13 0.13
grades
(the first dimension will be the course id)

StudentID Lab1 Lab2
Mr100 100 100
MrBarelyPass 60 65
Dictionary
Pros:

Each course may have different components (self-reflection, worksheet...). So adding a new course with new components does not mess up with the dataframe structure

Cons:

Not supported in R. We need to use list

Dataframe
Pros:

Easier to visualize the data

Easier to match columns between grades and courses (hopefully we can utilize numpy / matrix operation?)

Cons:

Need to assign Component Name / Student ID... as row name / column name

Many columns may be 0 if some courses do not have that component (We can use SparseMatrix, which can be supported in R also, but not sure it may overcomplicate things). As the data size will be small (25 courses x 20 components x 100 students ~ 50k cells), it may not be a big issue.

This is a good point and definitely needs to be discussed out. I prefer the dataframe method as this is what I am most used to - plus working with named lists is kinda of a pain in R in my opinion.

jraza19 · 2021-02-26T04:14:12Z

APPROVED IDEA

Documenting our final approved idea for reference here:

MDSGradeTracker
Manage the grades of students for each course, with the option to calculate the summary and suggest the adjusted scores for any assessment components.

Register the courses (Jianru)

Purpose:
Read/store the input data as a dataframe
Checking/performing assessments on the data to ensure it is in a format that we need for the rest of the functions

Input - csv file
Output - None

Details:
(Course ID, Course Name) + assessments (ex: 15% lab 1 / lab 2... 20% quiz 1 / quiz 2)
Assessments must sum up to 100%
Assessments must have the order (we must have lab1 before we can have lab2)

Record the grade for students (Jianru)
Purpose:
Checking/performing assessments on the data to ensure it is in a format that we need for the rest of the functions
Deal with missing data

Input - dataframe (Student ID, Course ID, Assessment ID, Grade)
Output - None

Details:
will throw an error if Course Id or Assessment ID do not exist

Summarize the grades by course (Yanhua):

Purpose:
Provides summary statistics on the courses
Lecturers can benchmark the difficulty based on the grades

Input - option to choose which method with a default option (summary, across_summary)
Output - summary (summary statistics on each course - Mean / Median / Quantile 1,2,3,4) and across_summary(based on average grade for each course, rank the courses in the decreasing order)

Summarise the grades by students (Javairia):

Purpose:
Calculate the average grade for all students and provide the ranking for the course selected or the whole program.
Lecturers can find students that are struggling with the courses

Input - a specified course with a default option for the entire program
Output - student’s ranking

Suggest grade adjustment (Vu):

Purpose:
Based on a predefined benchmark (90% for each lab, 90% for whole course, 85% for quiz), suggest grade adjustments for any course and return the suggested grades.

Input - three benchmark variables (course, lab, quiz), course ID
Output - summary table with suggested grade adjustment

function(courseid, benchmark_course = 0.9, benchmark_lab = 0.85, benchmark_quiz = 0.85)

tdkhanhvu · 2021-02-26T18:00:08Z

To make the parameter naming consistent, should we adopt this style (Python style)?

course_id
course_name
assessment_id
benchmark_course
benchmark_lab
benchmark_quiz

tdkhanhvu self-assigned this Feb 24, 2021

jraza19 assigned jraza19, jianructose and yhchen20 Feb 24, 2021

jraza19 added this to the Milestone 1 milestone Feb 25, 2021

jianructose mentioned this issue Feb 26, 2021

Function: Load the course and record the grade #10

Closed

jraza19 closed this as completed Feb 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2. Pick a topic #2

2. Pick a topic #2

jianructose commented Feb 24, 2021 •

edited

Loading

yhchen20 commented Feb 24, 2021

tdkhanhvu commented Feb 25, 2021 •

edited

Loading

jraza19 commented Feb 25, 2021

jianructose commented Feb 25, 2021

jianructose commented Feb 25, 2021

jianructose commented Feb 25, 2021

tdkhanhvu commented Feb 25, 2021 •

edited

Loading

tdkhanhvu commented Feb 26, 2021

jraza19 commented Feb 26, 2021

jraza19 commented Feb 26, 2021

tdkhanhvu commented Feb 26, 2021

2. Pick a topic #2

2. Pick a topic #2

Comments

jianructose commented Feb 24, 2021 • edited Loading

yhchen20 commented Feb 24, 2021

tdkhanhvu commented Feb 25, 2021 • edited Loading

jraza19 commented Feb 25, 2021

jianructose commented Feb 25, 2021

mds-tracker

jianructose commented Feb 25, 2021

job-post-nlp

jianructose commented Feb 25, 2021

pywash

tdkhanhvu commented Feb 25, 2021 • edited Loading

tdkhanhvu commented Feb 26, 2021

jraza19 commented Feb 26, 2021

jraza19 commented Feb 26, 2021

tdkhanhvu commented Feb 26, 2021

jianructose commented Feb 24, 2021 •

edited

Loading

tdkhanhvu commented Feb 25, 2021 •

edited

Loading

tdkhanhvu commented Feb 25, 2021 •

edited

Loading