Skip to content
Sumer Singh edited this page May 4, 2019 · 6 revisions

Data Science Ethics Considerations

A. Data Collection

  • A.1 Informed consent: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

We only accept participants' data if they sign a form similar to this.

  • A.2 Collection bias: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

We refer to the 12 sources of bias outlined here and attempt to eliminate the bias by incorporating all the suggestions provided.

  • A.3 Limit PII exposure: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

While submitting data, the person's name is hashed and stored. Therefore in no stage is any personal data (such as names) available to us. The person can use their hash to remove their data in the future.

B. Data Storage

  • B.1 Data security: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?

We have implemented a 3tier DBMS architecture which ensures data security and access controls. Refer to the Wiki page on DBMS architecture for more details.

  • B.2 Right to be forgotten: Do we have a mechanism through which an individual can request their personal information be removed?

The individuals are authorized with the rights of being forgotten in any phase of this research. The individuals are provided with the data removal form, and once the form is signed, the data from the participant is removed from the database. One example form can be found at here.

  • B.3 Data retention plan: Is there a schedule or plan to delete the data after it is no longer needed?

The data is kept and maintained as the usage and the progress of the research goes. A schedule of the data keeping plan is created and enforced. A sample plan can be found here.

C. Analysis

  • C.1 Missing perspectives: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?

The assumptions and implications are already discussed with the subject matter experts to check if we have missing perspectives that which are important for the research. The research is conducted and consulted frequently with the subject matter experts with a schedule.

  • C.2 Dataset bias: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

As our data is credit based, we believe that our already captures sufficiently high variations for each attribute of the data.

  • C.3 Honest representation: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?

All the statistics and visualizations are generated directly from the data and the results from the designed algorithms, and are honestly reported and written in the reports and presentations.

  • C.4 Privacy in analysis: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?

The PII data is only displayed upon necessary comparisons for showing the result of the proposed algorithms and architectures. The privacy of the PII data is well abided by and steps have been taken to avoid unnecessary display of the data.

  • C.5 Auditability: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

All processes of re-generating the analysis is well documented in the README file, including every step needed to reproduce the research reported in this repository.

D. Modeling

  • D.1 Proxy discrimination: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?

Yes, all the variables used to generate the predictions is just the data obtained from Kaggle and do not exhibit any form of unfair discriminatory.

  • D.2 Fairness across groups: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?

All the test samples have been collected from different affected groups. The model performs well for a most of the test cases. A study of bias in prediction instruments is mentioned here.

  • D.3 Metric selection: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?

Optimizing the metrics for the defined test set could lead to a bias to only the fixed test set. This can be overcome by increasing the test size or considering different train/test split cases each time.

  • D.4 Explainability: Can we explain in understandable terms a decision the model made in cases where a justification is needed?

We can to some extent exactly explain why the model made a decision to a stakeholder. As seen from the EDA graphs generated, certain highly important features are able to explain the data quite well.

  • D.5 Communicate bias: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

The main shortcoming of this model is the high variance in the data that is collected and just shows how difficult it is to find a particular spending/credit pattern among such huge variety of individuals.

E. Deployment

  • E.1 Redress: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?

If users were to be harmed by our results, we have a feedback system in place where we can find the exact case and determine where the predictions went wrong, investigate, and hope to fix these issues in the future.

  • E.2 Roll back: Is there a way to turn off or roll back the model in production if necessary?

For critical condition where our model goes haywire and has to be stopped, we will implement a kill switch to safely stop the model and also remove all data fed to it. Access to the kill switch lies only with the highest of stakeholders. The idea of a kill switch and why it is necessary is explained in Safely Interruptible Agents

  • E.3 Concept drift: Do we test and monitor for concept drift to ensure the model remains fair over time?

In order to address concept drift, we perform periodically re-fit the data. This involves back-testing the model in order to select a suitable amount of historical data to include when re-fitting the static model. The problem of concept drift and its related work are shown here

  • E.4 Unintended use: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

We have not taken steps to identify unintended use. However, we have implemented access rights and permission hierarchies to prevent unintended use of the model.