# The Career Decisions of Young Men

This notebook processes and explores the data for the seminal work by Michael Keane and Kenneth Wolpin studying the career decisions of young men.

> Michael P. Keane, Kenneth I. Wolpin (1997). [The Career Decisions of Young Men](http://www.journals.uchicago.edu/doi/10.1086/262080). *Journal of Political Economy*, 105(3): 473-522.

The original file is part of the online repository and can be accessed [here](https://github.com/structDataset/career_decisions_data/blob/master/KW_97.raw). 

## Preparations

We first peform some basic preparations on the original dataset that eases further processing.

In [1]:
import pandas as pd
import numpy as np

columns = ['Identifier', 'Age', 'Schooling', 'Choice', 'Wage']
dtype = {'Identifier': np.int, 'Age': np.int,  'Schooling': np.int,  'Choice': 'category'}

df = pd.DataFrame(np.genfromtxt('KW_97.raw'), columns=columns).astype(dtype)
df.set_index(['Identifier', 'Age'], inplace=True, drop=False)
df["Choice"].cat.categories = ['Schooling', 'Home', 'White', 'Blue', 'Military']

## Basic Descriptives

We start by reproducing some basic descriptive statistics from the paper.

### Choice Probabilities

We reproduce the choice probabilities reported in Tabel 1 of the orignal paper.

In [2]:
# Produce the raw table
table_1 = pd.crosstab(index=df.Age, columns=df.Choice, margins=True)
# Produce frequencies
table_1_rel = table_1.div(table_1.All, axis=0) * 100

Defaulting to column but this will raise an ambiguity error in a future version
  grouped = data.groupby(keys)
Defaulting to column but this will raise an ambiguity error in a future version
  margin = data[rows + values].groupby(rows).agg(aggfunc)


In [3]:
table_1

Choice,Schooling,Home,White,Blue,Military,All
Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
16,1178,145,4,45,1,1373
17,1014,197,15,113,20,1359
18,561,296,92,331,70,1350
19,420,293,115,406,107,1341
20,341,273,149,454,113,1330
21,275,257,170,498,106,1306
22,169,212,256,559,90,1286
23,105,185,336,546,68,1240
24,65,112,284,416,44,921
25,24,61,215,267,24,591


In [4]:
table_1_rel

Choice,Schooling,Home,White,Blue,Military,All
Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
16,85.797524,10.560816,0.291333,3.277495,0.072833,100.0
17,74.613687,14.495953,1.103753,8.314937,1.47167,100.0
18,41.555556,21.925926,6.814815,24.518519,5.185185,100.0
19,31.319911,21.849366,8.57569,30.275913,7.97912,100.0
20,25.639098,20.526316,11.203008,34.135338,8.496241,100.0
21,21.056662,19.678407,13.016845,38.1317,8.116386,100.0
22,13.141524,16.485226,19.906687,43.468118,6.998445,100.0
23,8.467742,14.919355,27.096774,44.032258,5.483871,100.0
24,7.057546,12.160695,30.836048,45.168295,4.777416,100.0
25,4.060914,10.321489,36.379019,45.177665,4.060914,100.0


### Transition Matrix

We compute the transition probabilites reported in Table 2 of the original paper. Up to now, the tables differ substantially.

In [5]:
# Create column which
df['Choice_t_1'] = df.groupby('Identifier').Choice.shift(1)
# Create table with absolute numbers
table_2_abs = pd.crosstab(index=df.Choice_t_1, columns=df.Choice, margins=True)
# Create table with row frequencies
table_2_row = table_2_abs.div(table_2_abs.All, axis=0) * 100
# Create table with column frequencies
table_2_col = table_2_abs.div(table_2_abs.loc['All', :]) * 100

Defaulting to column but this will raise an ambiguity error in a future version
  


In [6]:
table_2_row

Choice,Schooling,Home,White,Blue,Military,All
Choice_t_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Schooling,64.502592,13.009134,8.639842,12.095779,1.752654,100.0
Home,9.754797,47.17484,8.102345,31.289979,3.678038,100.0
White,5.703422,6.311787,67.376426,19.923954,0.684411,100.0
Blue,3.41448,12.361682,9.927284,73.411318,0.885236,100.0
Military,1.376936,5.507745,3.098107,9.638554,80.378657,100.0
All,27.18915,17.458584,15.65629,33.833971,5.862006,100.0


In [7]:
table_2_col

Choice,Schooling,Home,White,Blue,Military,All
Choice_t_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Schooling,87.479076,27.476538,20.348837,13.182674,11.024845,36.874204
Home,6.126548,46.141814,8.837209,15.792306,10.714286,17.076279
White,2.51088,4.327424,51.511628,7.048695,1.397516,11.96978
Blue,3.615668,20.385819,18.255814,62.469734,4.347826,28.791189
Military,0.267827,1.668405,1.046512,1.506591,72.515528,5.288549
All,100.0,100.0,100.0,100.0,100.0,100.0


### Average Real Wages

We reproduce the average real wages by occupation reported in Table 4 of the original paper. The tables here share the same information.

In [8]:
table_4_mean = pd.crosstab(index=df.Age, columns=df.Choice, values=df.Wage, aggfunc='mean', margins=True)
table_4_mean = table_4_mean[['All', 'White', 'Blue', 'Military']]

Defaulting to column but this will raise an ambiguity error in a future version
  grouped = data.groupby(keys)
Defaulting to column but this will raise an ambiguity error in a future version
  margin = data[rows + values].groupby(rows).agg(aggfunc)


In [9]:
table_4_mean

Choice,All,White,Blue,Military
Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
16,10217.740418,9320.762,10286.738758,
17,11036.597108,10049.757071,11572.887893,9005.362615
18,12060.746019,11775.341338,12603.820833,10171.86815
19,12246.684578,12376.418072,12949.838227,9714.600108
20,13635.869637,13824.013219,14363.658471,10852.506971
21,14977.004406,15578.139155,15313.451473,12619.374667
22,17561.28014,20236.075551,16947.904935,13771.555541
23,18719.83594,20745.564706,17884.949782,14868.653698
24,20942.417442,24066.635884,19245.185944,15910.839514
25,22754.544937,24899.227802,21473.314696,17134.463455


In [10]:
table_4_count = pd.crosstab(index=df.Age, columns=df.Choice, values=df.Wage, aggfunc='count', margins=True).drop(['Schooling', 'Home'], axis=1)
table_4_count = table_4_count[['All', 'White', 'Blue', 'Military']]

Defaulting to column but this will raise an ambiguity error in a future version
  grouped = data.groupby(keys)
Defaulting to column but this will raise an ambiguity error in a future version
  margin = data[rows + values].groupby(rows).agg(aggfunc)


In [11]:
table_4_count

Choice,All,White,Blue,Military
Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
16,28.0,2.0,26.0,0.0
17,102.0,14.0,75.0,13.0
18,377.0,71.0,246.0,60.0
19,507.0,97.0,317.0,93.0
20,587.0,128.0,357.0,102.0
21,657.0,142.0,419.0,96.0
22,764.0,214.0,476.0,74.0
23,833.0,299.0,481.0,53.0
24,667.0,259.0,373.0,35.0
25,479.0,207.0,250.0,22.0


## State Variables

We extend the original data and derive the full set of state variables.


* Blue-collar work experience, linear and squared

* White-collar work experience, linear and squared

* Military work experience, linear

In [12]:
temp = df.reset_index(drop=True).copy()

# Create an empty dataframe and loop over ages to
# count accumulated experience.
exp = pd.DataFrame()
for i in range(16, 27):
    a = temp[temp.Age <= i].groupby(['Identifier']).Choice.value_counts().unstack('Choice').add_prefix('Experience_').reset_index()
    a['Age'] = i
    exp = exp.append(a)
    
exp.set_index(['Identifier', 'Age'], inplace=True)
exp.drop(['Experience_Home', 'Experience_Schooling'], axis=1, inplace=True)

# Merge both DataFrames
df = pd.concat([df, exp], axis=1)

In [13]:
df[['Choice', 'Experience_White', 'Experience_Blue', 'Experience_Military']].head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Choice,Experience_White,Experience_Blue,Experience_Military
Identifier,Age,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
6,16,Schooling,,,
6,17,Schooling,,,
6,18,Schooling,,,
6,19,Schooling,,,
6,20,Schooling,,,
6,21,Home,,,
6,22,White,1.0,,
6,23,White,2.0,,
6,24,White,3.0,,
6,25,White,4.0,,


## Further Processing

We simply store a DataFrame and a simple text file to ease your further processing. Both are checked into version control so you can download them directly [here](https://github.com/structDataset/career_decisions_data).

In [14]:
fname = 'career_decisions_data'
df.to_pickle(fname + '.pkl')
df.to_csv(fname + '.csv', index=False)