# An introduction to Bias in Data

**Instructions:** This is an individual assignment, but you may discuss your code with your classmates.

**Problem type key and definition:**
* _Exercises_ are autograded on GitHub classroom
* _Problems_ are manually graded and often open ended without a single correct answer.
* _Stop and think_ prompts are not graded, and are provided to guide you.

Please see the README for instructions on how to submit and obtain the lab.

In [2]:
%load_ext autoreload
%autoreload 2


# Put all your solutions into Lab1_helper.py as this script which is autograded
import Lab2_helper 

from pathlib import Path
home = str(Path.home()) # all other paths are relative to this path. 
# This is not relevant to most people because I recommended you use my server, but
# change home to where you are storing everything. Again. Not recommended.

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [3]:
import pandas as pd
df_airbnb = pd.read_csv(f'{home}/data-301-student/data/airbnb.csv')

FileNotFoundError: [Errno 2] No such file or directory: '/home/jupyter-pander14/data-301-student/data/airbnb.csv'

#### Discuss your previous answer in the context of selection bias, algorithmic bias, human bias, a biased estimator (namely that $E(\hat{\beta}) = \beta$), and regularization bias ($\hat{\beta} = (X'X + \lambda I)^{-1}X'y$) (15). You should cite at atleast two of the articles/podcasts we covered in this module: 

[Fisman and Luca (2016), Fixing Discrimination](https://hbr.org/2016/12/fixing-discrimination-in-online-marketplaces)

[NPR, Bias in the sharing economy](https://www.npr.org/2016/04/26/475623339/-airbnbwhileblack-how-hidden-bias-shapes-the-sharing-economy)

[Benjamin 2019, Automating Bias](https://science.sciencemag.org/content/366/6464/421)

[O'Neil Weapons of Math Destruction](https://www.npr.org/2018/01/26/580617998/cathy-oneil-do-algorithms-perpetuate-human-bias) 

[MIT article, Machine Learning Bias](https://sloanreview.mit.edu/article/the-risk-of-machine-learning-bias-and-how-to-prevent-it/)

[Vox Fight for your Face](https://www.vox.com/today-explained/archives)) 




#your answer here

We can see that how often an airbnb has been reviewed is correlated with higher ratings - suggesting that there is a **selection bias** occuring in terms of users' use of airbnb. Customers are more inclined to stay at AirBNBs that have a lot of ratings (similar to position bias in recommender systems, where users are more likely to click on higher ranked search results) AND review it rather than select AirBNBs with fewer ratings. 

Further, it suggests that the review process is not [incentive compatible from an microeconomic theory perspective](https://www.britannica.com/topic/incentive-compatibility) (Note that Airbnb has since changed the review process.). Namely, customers are more likely to leave a review when it's positive but less so if it's negative. Economists [Fisman and Luca (2016)](https://hbr.org/2016/12/fixing-discrimination-in-online-marketplaces) proposed a series of market design choices that might reduce discrimination in online markets more
generally—such as further automating transactions on platforms. As a result of this work, the company created a task force that weighed the different options, which led to a full-time team of data scientists to explore discrimination on an ongoing basis.


Note that the selection bias above, could be heavily driven by **human bias**. [Recall the Harvard Business School experiment on bias in the sharing economy](https://www.npr.org/2016/04/26/475623339/-airbnbwhileblack-how-hidden-bias-shapes-the-sharing-economy). If customers are less likely to stay at AirBNB's hosted by an African American or Black host, then those listings will progressively lose their ranking in users' search results. 

Finally, the algorithm (**algorithm bias**) exacerbates this selection by then being more likely to show airbnb users the listings that have a lot of reviews. 

Of course, all of the above biases are distinct from the notion of **biased estimator** in statistical inference (**selection bias** is one cause of estimators being biased), which is also different from the bias that **regularization** attempts to mitigate in the bias/variance tradeoff within prediction! But neither ridge nor RF can be used for inference. 

## Exercise 0
Please read and reference the following as your progress through this course. 

* [What is the Jupyter Notebook?](https://nbviewer.jupyter.org/github/jupyter/notebook/blob/master/docs/source/examples/Notebook/What%20is%20the%20Jupyter%20Notebook.ipynb#)
* [Notebook Tutorial](https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook)
* [Notebook Basics](https://nbviewer.jupyter.org/github/jupyter/notebook/blob/master/docs/source/examples/Notebook/Notebook%20Basics.ipynb)

## Excercises 1-2
We will use Pandas a lot in this course (NumPy some but less so). Please read and study [10 minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html) before proceeding to any of the exercises below.

#### Exercise 1. Make a dataframe called ``a`` of size 6 x 4 where every element is a 2.

In [2]:
a = Lab1_helper.exercise_1()
a

Unnamed: 0,0,1,2,3
0,2,2,2,2
1,2,2,2,2
2,2,2,2,2
3,2,2,2,2
4,2,2,2,2
5,2,2,2,2


#### Exercise 2. Create a dataframe that contains the content from the following table. Set the index of this dataframe to Series.

Notes: All of the columns should be strings. Missing values should be filled with ``np.NaN``.

|             Series             |  Aired  |            Episodes           |
|:------------------------------:|:-------:|:-----------------------------:|
| The Marvel Super Heroes        | 1966    | 65                            |
| Fantastic Four                 | 1967-68 | 20                            |
| Spider-Man                     | 1967-70 | 52                            |
| The New Fantastic Four         | 1978    | 13                            |
| Fred and Barney Meet the Thing | 1979    | 13 (26 segments of The Thing) |
| Spider-Woman                   | 1979-80 |                               |


In [3]:
b = Lab1_helper.exercise_2()
b

Unnamed: 0_level_0,Aired,Episodes
Series,Unnamed: 1_level_1,Unnamed: 2_level_1
The Marvel Super Heroes,1966,65
Fantastic Four,1967-68,20
Spider-Man,1967-70,52
The New Fantastic Four,1978,13
Fred and Barney Meet the Thing,1979,13 (26 segments of The Thing)
Spider-Woman,1979-80,


## Exercises 3-10
Now let's look at a dataset read from a file, and talk about ``.iloc`` and ``.loc``. For this exercise, we will use the popular Titanic dataset from Kaggle. Here is some sample code to read it into a dataframe.

In [4]:
import pandas as pd
titanic_df = pd.read_csv("https://dlsun.github.io/pods/data/titanic.csv")
titanic_df

Unnamed: 0,name,gender,age,class,embarked,country,ticketno,fare,survived
0,"Abbing, Mr. Anthony",male,42.0,3rd,S,United States,5547.0,7.11,0
1,"Abbott, Mr. Eugene Joseph",male,13.0,3rd,S,United States,2673.0,20.05,0
2,"Abbott, Mr. Rossmore Edward",male,16.0,3rd,S,United States,2673.0,20.05,0
3,"Abbott, Mrs. Rhoda Mary 'Rosa'",female,39.0,3rd,S,England,2673.0,20.05,1
4,"Abelseth, Miss. Karen Marie",female,16.0,3rd,S,Norway,348125.0,7.13,1
...,...,...,...,...,...,...,...,...,...
2202,"Wynn, Mr. Walter",male,41.0,deck crew,B,England,,,1
2203,"Yearsley, Mr. Harry",male,40.0,victualling crew,S,England,,,1
2204,"Young, Mr. Francis James",male,32.0,engineering crew,S,England,,,0
2205,"Zanetti, Sig. Minio",male,20.0,restaurant staff,S,England,,,0


In [5]:
titanic_df.index # default index

RangeIndex(start=0, stop=2207, step=1)

#### Stop and think: How would you set the index of the data frame to the Sex column and then select only those passengers who are female?

In [6]:
df = titanic_df.set_index('gender').loc['female']
df

Unnamed: 0_level_0,name,age,class,embarked,country,ticketno,fare,survived
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
female,"Abbott, Mrs. Rhoda Mary 'Rosa'",39.0,3rd,S,England,2673.0,20.0500,1
female,"Abelseth, Miss. Karen Marie",16.0,3rd,S,Norway,348125.0,7.1300,1
female,"Abelson, Mrs. Hannah",28.0,2nd,C,France,3381.0,24.0000,1
female,"Ahlin, Mrs. Johanna Persdotter",40.0,3rd,S,Sweden,7546.0,9.0906,0
female,"Aks, Mrs. Leah",18.0,3rd,S,England,392091.0,9.0700,1
...,...,...,...,...,...,...,...,...
female,"Smith, Miss. Katherine Elizabeth",45.0,victualling crew,S,England,,,1
female,"Snape, Mrs. Lucy Violet",22.0,victualling crew,S,England,,,0
female,"Stap, Miss. Sarah Agnes",47.0,victualling crew,S,England,,,1
female,"Wallis, Mrs. Catherine Jane",36.0,victualling crew,S,England,,,0


What is important about the above statement is we have used ``.loc`` when we want to reference the pandas dataframe index. This index can be integers (starting at 0 or 1 or random). Further, it could just be a string like the example above. If you want a traditional index like you would in an array, then use ``iloc``.

#### Exercise 3. Select the ``name`` column without using .iloc?

In [7]:
sel = Lab1_helper.exercise_3(titanic_df)
sel

0                  Abbing, Mr. Anthony
1            Abbott, Mr. Eugene Joseph
2          Abbott, Mr. Rossmore Edward
3       Abbott, Mrs. Rhoda Mary 'Rosa'
4          Abelseth, Miss. Karen Marie
                     ...              
2202                  Wynn, Mr. Walter
2203               Yearsley, Mr. Harry
2204          Young, Mr. Francis James
2205               Zanetti, Sig. Minio
2206                Zarracchi, Sig. L.
Name: name, Length: 2207, dtype: object

#### Exercise 4. After setting the index to ``gender``, select all passengers that are ``male``?

In [8]:
sel = Lab1_helper.exercise_4(titanic_df)
sel

Unnamed: 0_level_0,name,age,class,embarked,country,ticketno,fare,survived
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
male,"Abbing, Mr. Anthony",42.0,3rd,S,United States,5547.0,7.11,0
male,"Abbott, Mr. Eugene Joseph",13.0,3rd,S,United States,2673.0,20.05,0
male,"Abbott, Mr. Rossmore Edward",16.0,3rd,S,United States,2673.0,20.05,0
male,"Abelseth, Mr. Olaus Jørgensen",25.0,3rd,S,United States,348122.0,7.13,1
male,"Abelson, Mr. Samuel",30.0,2nd,C,France,3381.0,24.00,0
...,...,...,...,...,...,...,...,...
male,"Wynn, Mr. Walter",41.0,deck crew,B,England,,,1
male,"Yearsley, Mr. Harry",40.0,victualling crew,S,England,,,1
male,"Young, Mr. Francis James",32.0,engineering crew,S,England,,,0
male,"Zanetti, Sig. Minio",20.0,restaurant staff,S,England,,,0


#### Exercise 5. Reset the index of ``titanic_df_copy`` using ``inplace=True``.

In [9]:
titanic_df_copy = titanic_df.set_index('name')
Lab1_helper.exercise_5(titanic_df_copy)
titanic_df_copy

Unnamed: 0,name,gender,age,class,embarked,country,ticketno,fare,survived
0,"Abbing, Mr. Anthony",male,42.0,3rd,S,United States,5547.0,7.11,0
1,"Abbott, Mr. Eugene Joseph",male,13.0,3rd,S,United States,2673.0,20.05,0
2,"Abbott, Mr. Rossmore Edward",male,16.0,3rd,S,United States,2673.0,20.05,0
3,"Abbott, Mrs. Rhoda Mary 'Rosa'",female,39.0,3rd,S,England,2673.0,20.05,1
4,"Abelseth, Miss. Karen Marie",female,16.0,3rd,S,Norway,348125.0,7.13,1
...,...,...,...,...,...,...,...,...,...
2202,"Wynn, Mr. Walter",male,41.0,deck crew,B,England,,,1
2203,"Yearsley, Mr. Harry",male,40.0,victualling crew,S,England,,,1
2204,"Young, Mr. Francis James",male,32.0,engineering crew,S,England,,,0
2205,"Zanetti, Sig. Minio",male,20.0,restaurant staff,S,England,,,0


# Ethics

We are finally ready to think about data science ethics! 

We have preprocessed a dataset on loan applications to make this example appropriate for linear regression (i.e., y=mx+b). The independent variable data is real and has not been modified apart from being transformed (e.g., Married=Yes => Married=1.). In other words, this is a real dataset with minimal modifcations. 

Our client is a loan company, they would like you to look at this historical data of 296 loans which have been approved for varying amounts and stored in the column LoanAmountApproved. They are interested in extracting which independent variables are the most influential/important when predicting the amount of the approved loan. Upon ethical review, they have determined that ``Gender`` is a protected column and should not be considered in the analysis.

I am doing the majority of the coding for you in this part. I want you to use the ethical frameworks presented in class (see slides and video) to discuss.

In [10]:
import pandas as pd

# Read in the data into a pandas dataframe
credit = pd.read_csv(f"{home}/data-301-student/data/credit.csv",index_col=0)
credit.head()

Unnamed: 0,Married,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,Loan_Amount_Term,Credit_History,Property_Area,Gender,LoanAmountApproved
1,1.0,1,0.0,4583,1508.0,360.0,1.0,0.0,0.0,366797.0
2,1.0,1,1.0,3000,0.0,360.0,1.0,1.0,0.0,242073.0
3,1.0,0,0.0,2583,2358.0,360.0,1.0,1.0,0.0,162284.0
4,0.0,1,0.0,6000,0.0,360.0,1.0,1.0,0.0,356509.0
5,1.0,1,1.0,5417,4196.0,360.0,1.0,1.0,0.0,371674.0


#### Construct a linear model model.

Our model should predict LoanAmountApproved using all of the columns except ``Gender`` which after an ethical review was deemed inappropriate to consider when make a determination on the amount of loan approved for an applicant.

Use ``sklearn.linear_model.LinearRegression`` with the default constructor arguments. We will need to call ``.fit``. The documentation for this is available at:

https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares

#### Exercise 6.

Your exercise is to create a new dataframe called ``X`` that does not have ``Gender`` or ``LoanAmountApproved``. You must also create a series object called ``y``. The order of the columns in ``X`` must be the same as mine.

In [11]:
X,y = Lab1_helper.exercise_6(credit)

X

Unnamed: 0,Married,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,Loan_Amount_Term,Credit_History,Property_Area
1,1.0,1,0.0,4583,1508.0,360.0,1.0,0.0
2,1.0,1,1.0,3000,0.0,360.0,1.0,1.0
3,1.0,0,0.0,2583,2358.0,360.0,1.0,1.0
4,0.0,1,0.0,6000,0.0,360.0,1.0,1.0
5,1.0,1,1.0,5417,4196.0,360.0,1.0,1.0
...,...,...,...,...,...,...,...,...
608,1.0,1,0.0,3232,1950.0,360.0,1.0,0.0
609,0.0,1,0.0,2900,0.0,360.0,1.0,0.0
610,1.0,1,0.0,4106,0.0,180.0,1.0,0.0
611,1.0,1,0.0,8072,240.0,360.0,1.0,1.0


In [12]:
y

1      366797.0
2      242073.0
3      162284.0
4      356509.0
5      371674.0
         ...   
608    234568.0
609    112928.0
610    292763.0
611    475431.0
612    523493.0
Name: LoanAmountApproved, Length: 296, dtype: float64

In [13]:
# Now we can create our model
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X,y)

coef = pd.Series(model.coef_,index=X.columns)
coef.abs().sort_values(ascending=False) # this takes the absolute value and then sorts the values in descending order

Married              70486.456486
Education            16443.486789
Self_Employed         8051.192483
Property_Area         3148.984202
Credit_History        1968.229524
ApplicantIncome         49.547937
Loan_Amount_Term        47.370787
CoapplicantIncome        0.539701
dtype: float64

Now let's write some code that calculates the mean absolute error of our model which is one measure of how good our model is performing. Looks like we are approximately $27K off in our model on average.

In [14]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y,model.predict(X))

26916.852660085868

The company asks you for your interpretation of the model. You say that being married is a high indicator of receiving a high amount for a loan. This surprises some of your colleagues, but they think this is reasonable to them. Everyone seems happy with the work. However, an experienced data scientist on your team suggests you run a correlation of the columns used in the regression against the column ``Gender`` since it is considered a protected column. You do so quickly to satisfy this request and get:

In [15]:
# I know this code is beyond the Chapter at this point, so ignore the details.
Xgender = X.copy()
Xgender['Gender'] = credit['Gender']
Xgender.corr().loc['Gender'].abs().sort_values(ascending=False).drop('Gender')

Married              0.361358
CoapplicantIncome    0.164699
Loan_Amount_Term     0.148585
Education            0.081054
Self_Employed        0.033836
ApplicantIncome      0.024081
Property_Area        0.018244
Credit_History       0.007251
Name: Gender, dtype: float64

**The problems labeled "Problem X" will not be autograded.** You must still complete them.

#### Problem 1: 

What do you think about the results? Specifically, is the fact that Married is correlated with Gender at a correlation of 0.36 concerning from an ethical standpoint? What do you as an individual think? Can you think of any suggestions about what to do? Answers these questions using the ethical frameworks we discussed in the ethics video/slides Utilitarianism, Deontology, or Virtue-ethics. In other words, answer these questions as you explain your answers:

Evaluate the alternate options using the following questions:
* Which option will produce the most good and do the least harm?
* Which option best enables me to fulfill my duties to all who have a stake? 
* Which option leads me to act as the sort of person I want to be?

**Your answer here**

Let's assume your suggestion was to remove it. Let's do so and then compare the accuracy of the model

In [16]:
from sklearn.linear_model import LinearRegression

X2 = X.drop('Married',axis=1)
model2 = LinearRegression().fit(X2, y)
mean_absolute_error(y,model2.predict(X2))

39450.63382892131

#### Problem 2:

What do you think now that you know how a decision changes the mean absolute error, should you drop it? Your prediction is now off by more than $10,000. Is this ok? Again, use the frameworks outlined above and in the ethics slides.

**YOUR SOLUTION HERE**

In [17]:
# Good job!
# Don't forget to push with ./submit.sh