<a href="https://colab.research.google.com/github/comparativechrono/Principles-of-Data-Science/blob/main/Week_10/Section_4__Python_Example__Ethical_Data_Analysis_Scenarios.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Section 4: Python example - ethical data analysis scenarios

In this section, we explore how to handle ethical dilemmas in data analysis through practical Python examples. These scenarios demonstrate the importance of ethical decision-making in data science, focusing on data anonymization, bias detection, and ethical modeling practices. These Python examples will provide a glimpse into the implementation of ethical practices in the data analysis workflow.

1. Setting Up the Environment:

For these examples, ensure your Python environment includes libraries that support data manipulation and ethical modeling practices:

In [None]:
pip install pandas numpy scikit-learn

2. Importing Required Libraries:

We'll use Pandas for data manipulation, NumPy for numerical operations, and scikit-learn for modeling and bias detection:

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score


3. Scenario 1: Data Anonymization

Anonymization helps protect personal information in datasets. In this example, we anonymize a dataset by removing direct identifiers and shuffling data to prevent de-anonymization.

In [None]:
# Sample Data Creation
data = pd.DataFrame({ 'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Age': [25, 30, 35, 40], 'Medical Condition': ['Diabetes', 'Healthy', 'Heart Disease', 'Healthy'] })

# Removing identifiable information
data.drop('Name', axis=1, inplace=True)

# Shuffling rows to anonymize the data further
data = data.sample(frac=1).reset_index(drop=True)
print(data)

4. Scenario 2: Detecting and Mitigating Bias

Bias in datasets can lead to unfair outcomes. Here, we simulate a scenario where a dataset might have gender bias, and we demonstrate how to detect and address it.

In [None]:
# Generating synthetic data
np.random.seed(0)
data = pd.DataFrame({
    'Gender': ['Male'] * 50 + ['Female'] * 50,
    'Hours_Studied': np.random.normal(30, 10, 100),
    'Exam_Score': np.random.normal(75, 15, 100)
})

# Encoding categorical data
data['Gender'] = LabelEncoder().fit_transform(data['Gender'])

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(
    data[['Gender', 'Hours_Studied']],
    data['Exam_Score'],
    test_size=0.2,
    random_state=42
)

# Model Training
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)

# Detecting bias: Checking if the model score is significantly different across groups
male_indices = X_test['Gender'] == 1
female_indices = X_test['Gender'] == 0

male_pred = model.predict(X_test[male_indices])
female_pred = model.predict(X_test[female_indices])

male_actual = y_test[male_indices]
female_actual = y_test[female_indices]

male_score = r2_score(male_actual, male_pred)
female_score = r2_score(female_actual, female_pred)

print(f'Male Score: {male_score}, Female Score: {female_score}')


5. Conclusion:

These examples highlight the importance of ethical considerations in data science projects. Anonymizing data helps protect individual privacy, while detecting and mitigating bias ensures fair and equitable outcomes from machine learning models. By integrating these practices into data science workflows, practitioners can uphold ethical standards and foster trust in their applications. This proactive approach to ethics should be a continuous process, evolving with new discoveries and societal norms in the field of data science.