# Ethics in Data Science

**Instructions:** This is an individual assignment, but you may discuss your code with your neighbors.

Please see the README for instructions on how to submit and obtain the lab.

In [5]:
#### NO NEED TO EDIT ####
from pathlib import Path
home = str(Path.home()) # all other paths are relative to this path. change to something else if this is not the case on your system
REPO = f"{home}/csc-466-student"
NOTEBOOK = "Lab1"

%load_ext autoreload
%autoreload 2

from importlib import import_module
helper = import_module(f'{NOTEBOOK}_helper')

#### NO NEED TO EDIT ####

We have preprocessed a dataset on loan applications to make this example appropriate for linear regression. The independent variable data is real and has not been modified apart from being transformed (e.g., Married=Yes => Married=1.). In other words, this is a real dataset with minimal modifcations. 

Our client is a loan company, they would like you to look at this historical data of 296 loans which have been approved for varying amounts and stored in the column LoanAmountApproved. They are interested in extracting which independent variables are the most influential/important when predicting the amount of the approved loan. Upon ethical review, they have determined that ``Gender`` is a protected column and should not be considered in the analysis.

In [6]:
import pandas as pd

# Read in the data into a pandas dataframe
credit = pd.read_csv(f"{REPO}/data/credit.csv",index_col=0)
credit.head()

Unnamed: 0,Married,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,Loan_Amount_Term,Credit_History,Property_Area,Gender,LoanAmountApproved
1,1.0,1,0.0,4583,1508.0,360.0,1.0,0.0,0.0,366797.0
2,1.0,1,1.0,3000,0.0,360.0,1.0,1.0,0.0,242073.0
3,1.0,0,0.0,2583,2358.0,360.0,1.0,1.0,0.0,162284.0
4,0.0,1,0.0,6000,0.0,360.0,1.0,1.0,0.0,356509.0
5,1.0,1,1.0,5417,4196.0,360.0,1.0,1.0,0.0,371674.0


#### Exercise 1. Construct a linear model model.

Your model should predict LoanAmountApproved using all of the columns except ``Gender`` which after an ethical review was deemed inappropriate to consider when make a determination on the amount of loan approved for an applicant.

Use ``sklearn.linear_model.LinearRegression`` with the default constructor arguments. You'll need to call ``.fit``. The documentation for this is so easy to get I almost hesitate to put a link in here, but here you go :)

https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares

In [7]:
X = credit.drop(['Gender','LoanAmountApproved'],axis=1)
y = credit['LoanAmountApproved']

model = helper.exercise_1(X,y)

# Here is code that takes the numpy array of coefficients stored in model.coef_ and formats it nicely for the screen
coef = pd.Series(model.coef_,index=X.columns)
coef.abs().sort_values(ascending=False) # this takes the absolute value and then sorts the values in descending order

Married              70486.456486
Education            16443.486789
Self_Employed         8051.192483
Property_Area         3148.984202
Credit_History        1968.229524
ApplicantIncome         49.547937
Loan_Amount_Term        47.370787
CoapplicantIncome        0.539701
dtype: float64

In [8]:
!pytest -vv --diff-symbols {REPO}/tests/test_{NOTEBOOK}.py::test_exercise_1

platform linux -- Python 3.9.7, pytest-7.4.0, pluggy-1.2.0 -- /opt/tljh/user/bin/python3.9
cachedir: .pytest_cache
rootdir: /home/jupyter-pander14/csc-466-instructor
plugins: clarity-1.0.1, anyio-3.7.0
collected 1 item                                                               [0m[1m

../tests/test_Lab1.py::test_exercise_1 [32mPASSED[0m[32m                            [100%][0m



Now let's write some code that calculates the mean absolute error of our model which is one measure of how good our model is performing. Looks like we are approximately $27K off in our model on average.

In [9]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y,model.predict(X))

26916.852660086042

The company asks you for your interpretation of the model. You say that being married is a high indicator of receiving a high amount for a loan. This surprises some of your colleagues, but they think this is reasonable to them. Everyone seems happy with the work. However, an experienced data scientist on your team suggests you run a correlation of the columns used in the regression against the column ``Gender`` since it is considered a protected column. You do so quickly to satisfy this request and get:

In [10]:
Xgender = X.copy()
Xgender['Gender'] = credit['Gender']
Xgender.corr().loc['Gender'].abs().sort_values(ascending=False).drop('Gender')

Married              0.361358
CoapplicantIncome    0.164699
Loan_Amount_Term     0.148585
Education            0.081054
Self_Employed        0.033836
ApplicantIncome      0.024081
Property_Area        0.018244
Credit_History       0.007251
Name: Gender, dtype: float64

**The problems labeled "Problem X" will not be autograded.** They will be factored into your participation score at the end of the quarter. In other words, complete them :)

#### Problem 1: 

What do you think about the results? Specifically, is the fact that Married is correlated with Gender at a correlation of 0.36 concerning from an ethical standpoint? What do you as an individual think? Can you think of any suggestions about what to do? What ethical framework did you use in your decision making process: Utilitarianism, Deontology, or Virtue-ethics.

Let's assume your suggestion was to remove it. Do so and then compare the accuracy of the model

In [11]:
from sklearn.linear_model import LinearRegression

# YOUR SOLUTION HERE
# X2 = # make this a dataframe without the married column
# model2 = # make this your linear model
mean_absolute_error(y,model2.predict(X2))

39450.6338289212

**Upload your solution/answer to Canvas.**

#### Problem 2:

What do you think now, should you drop it? Your prediction is now off by more than $10,000. Is this ok?

**Upload your solution/answer to Canvas.**

In [12]:
# Good job!
# Don't forget to push with ./submit.sh