Skip to content

arvinsingh/biases-in-data

Repository files navigation

Biases in data and models.

This repository explores the topic of biases and abuses in data and aims to study their effects on various experiments. The experiments will be conducted using Jupyter Notebook to analyze and understand the impact of biases in data and find ways to minimize them.

Tech Stack

  1. Keras with TensorFlow
  2. Numerical Python Stack
  3. Word2Vec
  4. Scikit-Learn
  5. Jupyter

Datasets

  1. Cat Vs Dog
  2. Titanic Dataset
  3. Statlog - German Credit Data

Introduction

In today's data-driven world, it is crucial to be aware of the biases and abuses that can exist within datasets. Biases can arise from various sources, such as data collection methods, sampling techniques, or even human judgment. These biases can lead to skewed results and unfair outcomes, impacting decision-making processes and perpetuating inequalities.

The purpose of this project is to shed light on the presence of biases and abuses in data & trained model and explore ways to mitigate their effects.

Topics to explore

  1. Bias in Natural Language Processing models.
  2. Convolutional Neural Network Manifold Learning.
  3. Global Black-box Explanation.
  4. Local Black-box Explanation.
  5. FairML

Biases in Data

Biases in data can occur in different forms, including:

  • Selection Bias: When certain groups or characteristics are overrepresented or underrepresented in the dataset due to biased sampling methods.
  • Confirmation Bias: When data is selectively collected or interpreted to support preconceived notions or beliefs.
  • Measurement Bias: When measurement instruments or techniques introduce systematic errors or inaccuracies.
  • Cultural Bias: When data reflects the biases and perspectives of a particular culture or group.

Experimental Setup

The experiments will be conducted using Jupyter Notebook, a popular tool for data analysis and visualization. The datasets used in the experiments will be carefully selected to highlight different types of biases and potential abuses. The code and analysis will be documented in the Jupyter Notebook files provided in this repository.

Results and Analysis

The results obtained from the experiments will be analyzed to identify the presence and impact of biases in the data. Various statistical techniques and machine learning algorithms will be used to quantify and understand the biases. Additionally, strategies and methodologies to minimize biases and improve the fairness of the data will be explored.

Conclusion

By studying biases and abuses in data, I aim to raise awareness about their existence and impact on decision-making processes. Through rigorous experimentation and analysis, I strive to develop best practices and guidelines to minimize biases and promote fairness in data-driven applications.

Please refer to the Jupyter Notebook files in this repository for detailed experiments, code, and analysis.

Insights

In the form of Critical Questions/Discussions at the end of each Notebook.