In [None]:
%config InlineBackend.figure_format='retina'

# DSC 80
The Practice and Application of Data Science

**Fall 2021**

### Part 1

# Syllabus and Course Info

## Course Goals

* Practice translating potentially vague questions into quantitative questions about measurable observations.
* Learn to reason about 'black-box' processes (e.g. complicated models).
* Understand computational and statistical implications of working with data.
* Learn to use real data tools (e.g. love the documentation!).
* Give a taste of the "life of a data scientist".

## Course Outcomes

* Prepare you for internships and data science "take home" interviews!
* Enable you to create your own portfolio of personal projects.
* Prepare you for upper-division ML and Stats courses (material and maturity).

## Schedule

- Week 1:
Tables and Messy Data
- Week 2:
Hypothesis Testing and Data Granularity
- Week 3:
Permutation Tests
- Week 4:
Missingness
- Week 5:
HTTP and HTML
- Week 6:
Regex
- Week 7:
Features and sklearn
- Week 8:
Thanksgiving Week
- Week 9:
Model Selection
- Week 10:
Model Evaluation and Fairness

## Course Materials and Information

* The course [website](http://dsc80.com/): syllabus, links, schedule.
    - http://dsc80.com

## Course Materials on GitHub

* The course [github repository](https://github.com/dsc-courses/dsc80-2021-fa): assignments, lectures, references.
    - https://github.com/dsc-courses/dsc80-2021-fa

## Course Materials

* This is (still) a new course (anywhere!); be patient, please. 😁
* We kinda-sorta have a book! (finally!)
    - https://afraenkel.github.io/practical-data-science
    - the lecture slides serve as main source of information and practice.
    
* Just like a working Data Scientist:
    - using new, evolving tools requires doing your own research!
    - using real data require doing your own research!
    - *You* will be responsible for assessing the correctness of your research!

## Secondary references

* Wes McKinney. "Python for Data Analysis" ([Link - requires UCSD internet](proquest.safaribooksonline.com/9781449323592))
* Sam Lau, Joey Gonzalez, and Deb Nolan. "Principles and Techniques of Data Science" (https://www.textbook.ds100.org/)
* Ani Adhikari and John DeNero. "Computational and Inferential Thinking" (https://www.inferentialthinking.com)
* On-line tutorials are great, but be sure to understand the *concepts* in the lecture!

## Working locally vs remotely

- You can choose to set up your own Python environment or use DataHub
- We recommend setting up a local environment with Anaconda Python
- Tips: http://dsc80.com/tech_support.html
- Materials will be available on GitHub; you'll need to use Git
- Assignments are submitted to Gradescope, though (not using git)
- demo?

## A few last remarks...

* This course requires *a lot* of work; gaining fluency working with data is hard!
* You will have to learn things on your own (e.g. love the documentation).
* Learning to effectively check your work (and debug) pays dividends:
    - Does my answer *look* right for the context?
    - Does my code do what I think? (small data testing)
    - Does my code generalize properly? (unseen data testing)

### Part 1

# What is a data scientist, anyways?

# What is a data scientist?

<div class="image-txt-container">
    
* [The Venn Diagram](http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram), Drew Conway, 2010
* Statistics: drawing conclusions.
* Hacking: extracting information.
* Substantive Expertise: understanding context.

<img src="imgs/image_0.png">

</div>

# What is a data scientist?

<div class="image-txt-container">

* [Battle of the Data Science Venn Diagrams](http://www.prooffreader.com/2016/09/battle-of-data-science-venn-diagrams.html)
* Stephan Kolassa, 2014
* Data Science is expanding and maturing.
* Find a niche you enjoy!
* Understanding a little of everything is powerful.

<img src="imgs/image_1.png">

</div>

## What is a data scientist?

Who knows? What do data scientists do?

* Extracts usable information from data. (~CS)
* Uses that information to answer questions. (~Stats)
* Uses that information to solve problems. (~Science)

Let's look at a few examples...

## Predicting Elections

<div class="image-txt-container">
    <div></div>
        
* What does the electorate look like?
* What are the important traits?
* How do we gather/measure those traits?
* Quantify into a model / Draw conclusion with confidence!
        
<img src="imgs/538.png" width="50%">
</div>

## Internet Advertisements
Can using internet ads increase a dealership's truck sales?

<div class="image-txt-container">
    <div></div>
        
* How likely is a person to click?
* Who is it shown to?
* How do ad clicks translate to sales? 
* Are they *causing* higher sales?
* What data should I collect?

<img src="imgs/internet_ads.gif" width="500">
</div>




## Image recognition for celebrity GIFs

[Gfycat](https://www.wired.com/story/how-coders-are-fighting-bias-in-facial-recognition-software/) resolves celebrity faces to generate memes.

<div class="image-txt-container">
<img src="imgs/Chris.png" width="50%">
<img src="imgs/Twice.png" width="50%">
</div>




## Data Science requires responsible practice

<div class="image-txt-container">
    
- Easy to obscure complex decisions made with data:
    * 2009 market crash.
    * Criminal sentencing.
    * Hiring and Admissions.
    * Hyper-personalized ad recommendations.

<img src="imgs/image_2.png" width="50%">

</div>

## Data Science requires responsible practice

<div class="image-txt-container">
    
* Reinforcing historical trends and biases:
    - Hiring based on previous hiring data.
    - Criminal sentencing using previous decisions.
    - Social media, news, and politics

* Data is generated by people; treat people responsibly!


<img src="imgs/image_2.png" width="50%">

</div>

### Part 3

# The (Applied) Data Science Lifecycle

## The scientific method hides complexity

<div class="image-txt-container">
    
* From what context did your hypothesis come?
* What data are you using/measuring?
    - What if the data isn't sufficient?
* Under which conditions are the conclusions valid?
* The language of modeling helps answer these questions.

<img src="imgs/image_3.png">

</div>

### The Data Science Life-cycle: Everything leads to more questions!
<img src="imgs/DSLC.png" width="40%">

<img src="./imgs/eeg.png" width=60%>

### Research Domain

<div class="image-txt-container">
    <div></div>

* What subject do we care about?
* How is relevant data generated?

<img src="imgs/DSLC.png">
</div>





### Question or Hypothesis

<div class="image-txt-container">
    <div></div>

* What do we want to know?
* What problem are we trying to solve?
* What are our metrics for success?
* Hypotheses are refined from any stage of work!

<img src="imgs/DSLC.png">
</div>

### Find and Clean Data

<div class="image-txt-container">
    <div></div>

* What data exists and can it answer the question?
* Do we need to collect/measure our own data?
* Cleaning the data: does it well-represent the domain?

<img src="imgs/DSLC.png">
</div>


### Data Modeling

<div class="image-txt-container">
    <div></div>

* What assumptions are made of data to draw conclusions?
* What biases or anomalies exist in the data?
* How is the data simplified to use for predictions and inference?

<img src="imgs/DSLC.png">
</div>



### Predictions and Inference

<div class="image-txt-container">
    <div></div>

* What does the data say about the world?
* Does it answer our questions? Solve our problem?
* Can we trust our conclusions and predictions?

<img src="imgs/DSLC.png">
</div>




## Aside

- I called this the *applied* data science lifecycle
- I think you can be a *theoretical* data scientist, too
    - Do math and prove stuff.
- But either way, you should be familiar with the applied data science lifecycle.

### Part 4

# Example: SD Employee Salaries

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import util

### Research Domain & Questions

<img src="imgs/DSLC.png">
</div>




## Example: SD Employee Salaries

* Build a "profile" of city of SD employees?
* Which jobs make more/less? How much?
* Who works part-time? full-time?
* Are salaries "fair"?

### SD Employee Salaries: background research

* Why try to understand this dataset?
    - Such a profile could inform 3rd party workplace programs.
    - Journalists might search for salary anomalies.
    - Auditors may want actionable advice on fair employment practices.

### Find and Clean Data

<img src="imgs/DSLC.png">
</div>




### Initial look at the data
* [Transparent California](https://transparentcalifornia.com/salaries/san-diego/) salary website.

In [None]:
salary_path = util.safe_download('https://transcal.s3.amazonaws.com/public/export/san-diego-2017.csv')

In [None]:
salaries = pd.read_csv(salary_path)
util.anonymize_names(salaries)
salaries

### Aside on privacy and ethics

* Employee names correspond to *real* people.
* PII (personably identifiable information).
* Legal vs Ethical:
    - Public record vs Searchable record.
    - Don't propagate people's data.

### Cleaning data / EDA
* Not something we did much of in DSC 10
* A *huge* component of real-world data science
* What does "typical" look like?
* Any questions about data reliability?

In [None]:
# .T is for transpose()
salaries.describe().T

* Negative payments? Near zero salaries?
* Where is "Other Pay"?
* Max values -- are outliers real?

### Empirical Distribution of Salaries
* Bimodal (two distributions?)
* Typical skew, since pay is (or should be) positive > 0

In [None]:
salaries['Total Pay'].plot(kind='hist', bins=50, density=True);

### Empirical Distribution of Salaries
* Part-time vs full-time

In [None]:
bystatus = salaries.groupby('Status')
bystatus['Total Pay'].plot(kind='kde', alpha=0.5, title='Salary by Full-time/Part-time')
plt.legend(bystatus.groups);

### Does gender influence pay?
* Do employees of different genders have similar pay?
* Don't have gender information in our salary data.
    - Do have (first) names of employees!
    - Join to Social Security Administration "baby names" dataset!

In [None]:
names_path = util.safe_download('https://www.ssa.gov/oact/babynames/names.zip')

In [None]:
import pathlib

dfs = []
for path in pathlib.Path('data/names/').glob('*.txt'):
    year = int(str(path)[14:18])
    if year >= 1964:
        df = pd.read_csv(path, names=['firstname', 'gender', 'count']).assign(year=year)
        dfs.append(df)
        
names = pd.concat(dfs)
names

### SSA names dataset
* Contains the list of names on social security applications
* For each name, it contains:
    - the number of applications per year identified as Male/Female
* We want a list of names and most likely gender to join to `salaries`.
* Note that data set contains only "M" and "F"

### Basic check of `names`:
* Many names have nonzero counts for both M and F
* Most names occur only a few times per year.
* A few names occur very often.

In [None]:
# glance at data
names.head()

In [None]:
# look at a single name
names[names['firstname'] == 'Madison']

In [None]:
# look at statistics of counts
names.describe()

### Data Modeling

<img src="imgs/DSLC.png">
</div>


### Approach to inferring gender:

* Our model: if someone has a name that is predominantly used by gender $\alpha$, we'll infer their gender to be $\alpha$.
    - This isn't 100 accurate!
    - But it should be good enough in aggregate.

* Create a data frame of distinct names with the proportion of applications on which that name identifies as female. 
* That is, for each name $N$, we compute:
$$P({\rm person\ is\ female\ }|{\rm \ person\ has\ first\ name\ } N)$$

* Join this data frame to the salaries dataset

### Calculate names and their most likely genders

In [None]:
# Counts by gender

counts_by_gender = (
    names
    .groupby(['firstname', 'gender'])
    .sum()
    .reset_index()
    .pivot('firstname', 'gender', 'count')
    .fillna(0)
)
counts_by_gender

In [None]:
# proportion of a given name that's identified female
prop_female = (counts_by_gender['F'] / counts_by_gender.sum(axis=1))
prop_female.head(10)

In [None]:
genders = prop_female.rename('prop_female').to_frame().assign(gender=np.where((prop_female > 0.5), 'F', 'M'))
genders

### Add a given name column to `salaries` and join names

In [None]:
# Add firstname column
salaries['firstname'] = salaries['Employee Name'].str.split().str[0]
salaries

In [None]:
# join gender
salaries_with_gender = salaries.merge(genders, on='firstname', how='left')
salaries_with_gender.head()

### Predictions and Inference

<img src="imgs/DSLC.png">
</div>




### Do women earn similar pay to their contemporaries?
* Is this difference significant, or just due to chance?

In [None]:
pd.concat([
    salaries_with_gender.groupby('gender')['Total Pay'].describe().T,
    salaries_with_gender['Total Pay'].describe().rename('All')
], axis=1)

### Use a hypothesis test
* Can women's median pay be explained as a random subset of the population of city of SD salaries?
    - If so, the salary of women doesn't significantly differ from the population
    - If not, some other mechanism is needed to explain the difference!

In [None]:
# size of sample is number of women:
n_female = (salaries_with_gender['gender'] == 'F').sum()


# calculate observed 
female_median = salaries_with_gender.loc[salaries_with_gender['gender'] == 'F']['Total Pay'].median()


# simulate 1000 draws from the population of size n_female
medians = []
for _ in np.arange(1000):
    median = salaries_with_gender.sample(n_female)['Total Pay'].median()
    medians.append(median)

In [None]:
title='Median salary of randomly chosen groups from population'
pd.Series(medians).plot(kind='hist', title=title);
plt.plot([female_median], [0], marker='o', markersize=10)
plt.legend(['Observed Median Salary of Women', 'Median Salaries of Random Groups']);

### Part 5

# The Truism of Data Science

### or how answering one question always raises 10 more

### Questions:

* Can we trust the SD employee population?
* Can we trust the name-to-gender assignment?
* Can we trust our join?
* Is the disparity correlated to pay-type? job status? job type?
* What is the cause of the disparity?

### Can we trust the SD employee population?

* Look up the "transparent california" and verify this dataset is a *census* (everyone).
* Is "Total Pay" the most appropriate field to use?
* Cross-reference independent counts of city of SD employees to assess the salary data.

### Can we trust the name-to-gender assignment?
* How many names are borderline male/female?
* Does it make sense we incorporate name usage from 1880-2017?

In [None]:
title = 'distribution of gendered-ness of names\n 0 = masculine \n 1 = feminine'
prop_female.plot(kind='hist', bins=20, title=title);

### Can we trust our join?
* Are there names in the salary dataset that aren't in the SSA dataset?
    - Who might not be in the SSA dataset? 
    - Might they be biased toward certain salaries?
* Does the salary dataset have a disproportionately large portion of unisex names.
* Is it better to use a subset of the SSA dataset (e.g. by state?)
    - Do the gender of names typically vary by geography?

### Can we trust our join?

In [None]:
# proportion of employees not in SSA dataset
salaries_with_gender['gender'].isnull().mean()

In [None]:
# Description of total pay by joined vs not joined
(
    salaries_with_gender
    .assign(joined=salaries_with_gender['gender'].notnull())
    .groupby('joined')['Total Pay']
    .describe()
    .T
)

In [None]:
nonjoins = salaries_with_gender.loc[salaries_with_gender['gender'].isnull()]

title = 'Distribution of Salaries'
nonjoins['Total Pay'].plot(kind='hist', bins=50, alpha=0.5, density=True, sharex=True)
salaries_with_gender['Total Pay'].plot(kind='hist', bins=50, alpha=0.5, density=True, sharex=True, title=title)
plt.legend(['Not in SSA','All']);

### Can we trust our join?
* Lesson: joining to another dataset can bias your sample! 

### Is the pay disparity correlated to another field? job status? job type?
* Is the proportion of women in a job type correlated to pay?
    - Controlling for job type, do women earn similar salaries?

In [None]:
# select jobs with word 'fire' in them
firejobs = salaries_with_gender.loc[salaries_with_gender['Job Title'].str.contains('Fire')]
firejobs.head()

In [None]:
# Proportion of fire-related jobs held by women
(firejobs['gender'] == 'F').mean()

In [None]:
# Pay Statistics for fire-related jobs
firejobs['Total Pay'].describe()

In [None]:
# select jobs with library related jobs
libjobs = salaries_with_gender.loc[salaries_with_gender['Job Title'].str.contains('Librar')]
libjobs.head()

In [None]:
# Proportion of library-related jobs held by women
(libjobs['gender'] == 'F').mean()

In [None]:
# Pay Statistics for fire-related jobs
libjobs['Total Pay'].describe()

### What is the cause of the disparity?
* Now that the picture is better understood, you can:
    - Research historical gender preferences across jobs
    - list possibilities and formulate hypothesis for a cause,
    - find data capable of describing these possibilities,
    - use a [natural experiment](https://en.wikipedia.org/wiki/Natural_experiment) to argue causality.

### Possible follow-ups
* Clean job titles to reflect broader job categories.
* Further investigate correlation between job categories and gender proportions.
* Do the analogous investigation with age in place of gender.